All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 00/13] vfs: hot data tracking
@ 2012-10-10 10:07 zwu.kernel
  2012-10-10 10:07 ` [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' zwu.kernel
                   ` (16 more replies)
  0 siblings, 17 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

NOTE:

  The patchset is currently post out mainly to make sure
it is going in the correct direction and hope to get some
helpful comments from other guys.
  For more infomation, please check hot_tracking.txt in Documentation

TODO List:

 1.) Need to do scalability or performance tests.
 2.) Turn some Micro into be tunable
       TIME_TO_KICK, and HEAT_UPDATE_DELAY
 3.) Rafactor hot_hash_is_aging()
       If you just made the timeout value a timespec and compared
     the _timespecs_, you would be doing a lot fewer conversions.
 4.) Cleanup some unnecessary lock protect
 5.) Add more comments to explain how to calc temperature
       How to "read" the avg read/write time (nanoseconds,
     microseconds, jiffies....??)
 6.) Make updating tempreture more parallel
 7.) How to save the file tempreture among the umount to be able to
     preserve the file tempreture after reboot
 8.) Add one new ioctl inteface to set temperature value.

Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to
 be very btrfs-specific.  I've taken their code and attempted to
make it more generic and integrate it at the VFS level.

Changelog from v2:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]
 3.) Converted to one workqueue to update map info periodically [Dave Chinner]
 4.) Cleanedup a lot of other issues [Dave Chinner]

v1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
                                        [Marco Stornelli , Dave Chinner]

Zhi Yong Wu (13):
  btrfs: add one new mount option '-o hot_track'
  vfs: introduce private radix tree structures
  vfs: Initialize and free main data structures
  vfs: add function for collecting raw access info
  vfs: add two map arrays
  vfs: add hooks to enable hot data tracking
  vfs: add function for updating map arrays
  vfs: add aging function for old map info
  vfs: add one wq to update map info periodically
  vfs: register one memory shrinker
  vfs: add 3 new ioctl interfaces
  vfs: add debugfs support
  vfs: add documentation

 Documentation/filesystems/00-INDEX         |    2 +
 Documentation/filesystems/hot_tracking.txt |  165 ++++
 fs/Makefile                                |    2 +-
 fs/btrfs/ctree.h                           |    1 +
 fs/btrfs/super.c                           |   15 +-
 fs/compat_ioctl.c                          |    9 +
 fs/direct-io.c                             |    8 +
 fs/hot_tracking.c                          | 1321 ++++++++++++++++++++++++++++
 fs/hot_tracking.h                          |  155 ++++
 fs/ioctl.c                                 |  122 +++
 include/linux/fs.h                         |    4 +
 include/linux/hot_tracking.h               |  123 +++
 mm/filemap.c                               |    7 +
 mm/page-writeback.c                        |   13 +
 mm/readahead.c                             |    7 +
 15 files changed, 1952 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.6.5


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
       [not found]   ` <5075632c.03cc440a.1b33.7805SMTPIN_ADDED@mx.google.com>
  2012-10-10 16:28   ` David Sterba
  2012-10-10 10:07 ` [RFC v3 02/13] vfs: introduce private radix tree structures zwu.kernel
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Introduce one new mount option '-o hot_track',
and add its parsing support.
  Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h |    1 +
 fs/btrfs/super.c |    7 ++++++-
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9821b67..094bec6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY	(1 << 20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
+#define BTRFS_MOUNT_HOT_TRACK		(1 << 23)
 
 #define btrfs_clear_opt(o, opt)		((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 83d6f9f..00be9e3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -41,6 +41,7 @@
 #include <linux/slab.h>
 #include <linux/cleancache.h>
 #include <linux/ratelimit.h>
+#include <linux/hot_tracking.h>
 #include "compat.h"
 #include "delayed-inode.h"
 #include "ctree.h"
@@ -303,7 +304,7 @@ enum {
 	Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
 	Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
 	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
-	Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
+	Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
 	Opt_check_integrity, Opt_check_integrity_including_extent_data,
 	Opt_check_integrity_print_mask, Opt_fatal_errors,
 	Opt_err,
@@ -342,6 +343,7 @@ static match_table_t tokens = {
 	{Opt_no_space_cache, "nospace_cache"},
 	{Opt_recovery, "recovery"},
 	{Opt_skip_balance, "skip_balance"},
+	{Opt_hot_track, "hot_track"},
 	{Opt_check_integrity, "check_int"},
 	{Opt_check_integrity_including_extent_data, "check_int_data"},
 	{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
@@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		case Opt_skip_balance:
 			btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
 			break;
+		case Opt_hot_track:
+			btrfs_set_opt(info->mount_opt, HOT_TRACK);
+			break;
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 		case Opt_check_integrity_including_extent_data:
 			printk(KERN_INFO "btrfs: enabling check integrity"
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 02/13] vfs: introduce private radix tree structures
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
  2012-10-10 10:07 ` [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 15:34   ` David Sterba
  2012-10-10 10:07 ` [RFC v3 03/13] vfs: Initialize and free main data structures zwu.kernel
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold radix tree
root, hash list root and some other information, etc.
  Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
  Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
  Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.
  Each of the two structures contains a hot_freq_data
struct with its frequency of access metrics (number of
{reads, writes}, last {read,write} time, frequency of
{reads,writes}).
  Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/Makefile                  |    2 +-
 fs/hot_tracking.c            |  138 ++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |   26 ++++++++
 include/linux/hot_tracking.h |   74 ++++++++++++++++++++++
 4 files changed, 239 insertions(+), 1 deletions(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 1d7af79..f966dea 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o statfs.o
+		stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 0000000..634ec03
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,138 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/hardirq.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/types.h>
+#include <linux/limits.h>
+#include "hot_tracking.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep;
+static struct kmem_cache *hot_range_item_cachep;
+
+/*
+ * Initialize the inode tree. Should be called for each new inode
+ * access or other user of the hot_inode interface.
+ */
+static void hot_inode_tree_init(struct hot_info *root)
+{
+	INIT_RADIX_TREE(&root->hot_inode_tree, GFP_ATOMIC);
+	spin_lock_init(&root->lock);
+}
+
+/*
+ * Initialize the hot range tree. Should be called for each new inode
+ * access or other user of the hot_range interface.
+ */
+void hot_range_tree_init(struct hot_inode_item *he)
+{
+	INIT_RADIX_TREE(&he->hot_range_tree, GFP_ATOMIC);
+	spin_lock_init(&he->lock);
+}
+
+/*
+ * Initialize a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item()
+ */
+static void hot_range_item_init(struct hot_range_item *hr, u32 start,
+				struct hot_inode_item *he)
+{
+	hr->start = start;
+	hr->len = RANGE_SIZE;
+	hr->hot_inode = he;
+	kref_init(&hr->hot_range.refs);
+	spin_lock_init(&hr->hot_range.lock);
+	hr->hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
+	hr->hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
+	hr->hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
+}
+
+/*
+ * Initialize a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using hot_free_inode_item()
+ */
+static void hot_inode_item_init(struct hot_inode_item *he, u64 ino,
+				struct radix_tree_root *hot_inode_tree)
+{
+	he->i_ino = ino;
+	he->hot_inode_tree = hot_inode_tree;
+	kref_init(&he->hot_inode.refs);
+	spin_lock_init(&he->hot_inode.lock);
+	he->hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
+	he->hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
+	he->hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
+	hot_range_tree_init(he);
+}
+
+/*
+ * Initialize kmem cache for hot_inode_item and hot_range_item.
+ */
+static int __init hot_cache_init(void)
+{
+	hot_inode_item_cachep = kmem_cache_create("hot_inode_item",
+			sizeof(struct hot_inode_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+			NULL);
+	if (!hot_inode_item_cachep)
+		goto inode_err;
+
+	hot_range_item_cachep = kmem_cache_create("hot_range_item",
+			sizeof(struct hot_range_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+			NULL);
+	if (!hot_range_item_cachep)
+		goto range_err;
+
+	return 0;
+
+range_err:
+	kmem_cache_destroy(hot_inode_item_cachep);
+inode_err:
+	return -ENOMEM;
+}
+
+static inline void hot_cache_exit(void)
+{
+	if (hot_range_item_cachep)
+		kmem_cache_destroy(hot_range_item_cachep);
+
+	if (hot_inode_item_cachep)
+		kmem_cache_destroy(hot_inode_item_cachep);
+}
+
+/*
+ * Initialize the data structures for hot data tracking.
+ */
+void hot_track_init(struct super_block *sb)
+{
+	int err;
+
+	err = hot_cache_init();
+	if (err) {
+		printk(KERN_ERR "%s: hot_track_cache_init error: %d\n",
+				__func__, err);
+		return;
+	}
+}
+
+void hot_track_exit(struct super_block *sb)
+{
+	hot_cache_exit();
+}
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
new file mode 100644
index 0000000..4e8aa77
--- /dev/null
+++ b/fs/hot_tracking.h
@@ -0,0 +1,26 @@
+/*
+ * fs/hot_tracking.h
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_TRACKING__
+#define __HOT_TRACKING__
+
+#include <linux/radix-tree.h>
+#include <linux/workqueue.h>
+#include <linux/hot_tracking.h>
+
+/* values for hot_freq_data flags */
+#define FREQ_DATA_TYPE_INODE (1 << 0)
+#define FREQ_DATA_TYPE_RANGE (1 << 1)
+
+void hot_track_init(struct super_block *sb);
+void hot_track_exit(struct super_block *sb);
+
+#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
new file mode 100644
index 0000000..78adb0d
--- /dev/null
+++ b/include/linux/hot_tracking.h
@@ -0,0 +1,74 @@
+/*
+ *  include/linux/hot_tracking.h
+ *
+ * This file has definitions for VFS hot data tracking
+ * structures etc.
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_HOTTRACK_H
+#define _LINUX_HOTTRACK_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/fs.h>
+
+/*
+ * A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item
+ */
+struct hot_freq_data {
+	struct timespec last_read_time;
+	struct timespec last_write_time;
+	u32 nr_reads;
+	u32 nr_writes;
+	u64 avg_delta_reads;
+	u64 avg_delta_writes;
+	u32 flags;
+	u32 last_temperature;
+};
+
+/* The common info for both following structures */
+struct hot_comm_item {
+	struct hot_freq_data hot_freq_data;  /* frequency data */
+	spinlock_t lock; /* protects object data */
+	struct kref refs;  /* prevents kfree */
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+	struct hot_comm_item hot_inode; /* node in hot_inode_tree */
+	struct radix_tree_root hot_range_tree; /* tree of ranges */
+	spinlock_t lock; /* protect range tree */
+	struct radix_tree_root *hot_inode_tree;
+	u64 i_ino; /* inode number from inode */
+};
+
+/*
+ * An item representing a range inside of
+ * an inode whose frequency is being tracked
+ */
+struct hot_range_item {
+	struct hot_comm_item hot_range;
+	struct hot_inode_item *hot_inode; /* associated hot_inode_item */
+	u32 start; /* item index in hot_range_tree */
+	u32 len; /* length in bytes */
+};
+
+struct hot_info {
+	struct radix_tree_root hot_inode_tree;
+	spinlock_t lock; /*protect inode tree */
+};
+
+extern void hot_track_init(struct super_block *sb);
+extern void hot_track_exit(struct super_block *sb);
+
+#endif  /* _LINUX_HOTTRACK_H */
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 03/13] vfs: Initialize and free main data structures
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
  2012-10-10 10:07 ` [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' zwu.kernel
  2012-10-10 10:07 ` [RFC v3 02/13] vfs: introduce private radix tree structures zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 10:07 ` [RFC v3 04/13] vfs: add function for collecting raw access info zwu.kernel
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add initialization function to create some
key data structures when hot tracking is enabled;
Clean up them when hot tracking is disabled

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/super.c             |    8 +++
 fs/hot_tracking.c            |  118 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h           |    3 +
 include/linux/hot_tracking.h |    4 ++
 4 files changed, 133 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 00be9e3..da4438f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -294,6 +294,10 @@ static void btrfs_put_super(struct super_block *sb)
 	 * last process that kept it busy.  Or segfault in the aforementioned
 	 * process...  Whom would you report that to?
 	 */
+
+	/* Hot data tracking */
+	if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+		hot_track_exit(sb);
 }
 
 enum {
@@ -828,6 +832,10 @@ static int btrfs_fill_super(struct super_block *sb,
 		goto fail_close;
 	}
 
+	if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+		hot_track_init(sb);
+	}
+
 	save_mount_options(sb, data);
 	cleancache_init_fs(sb);
 	sb->s_flags |= MS_ACTIVE;
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 634ec03..5fd993e 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -21,6 +21,8 @@
 #include <linux/limits.h>
 #include "hot_tracking.h"
 
+struct hot_info *global_hot_tracking_info;
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep;
 static struct kmem_cache *hot_range_item_cachep;
@@ -81,6 +83,97 @@ static void hot_inode_item_init(struct hot_inode_item *he, u64 ino,
 	hot_range_tree_init(he);
 }
 
+static void hot_range_item_free(struct kref *kref)
+{
+	struct hot_comm_item *comm_item = container_of(kref,
+		struct hot_comm_item, refs);
+	struct hot_range_item *hr = container_of(comm_item,
+		struct hot_range_item, hot_range);
+
+	radix_tree_delete(&hr->hot_inode->hot_range_tree, hr->start);
+	kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure
+ * if the reference count hits zero
+ */
+static void hot_range_item_put(struct hot_range_item *hr)
+{
+	kref_put(&hr->hot_range.refs, hot_range_item_free);
+}
+
+/* Frees the entire hot_range_tree. */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+	struct hot_range_item *hr_nodes[8];
+	u32 start = 0;
+	int i, n;
+
+	while (1) {
+		spin_lock(&he->lock);
+		n = radix_tree_gang_lookup(&he->hot_range_tree,
+					(void **)hr_nodes, start,
+					ARRAY_SIZE(hr_nodes));
+		if (!n) {
+			spin_unlock(&he->lock);
+			break;
+		}
+
+		start = hr_nodes[n - 1]->start + 1;
+		for (i = 0; i < n; i++)
+			hot_range_item_put(hr_nodes[i]);
+		spin_unlock(&he->lock);
+	}
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+	struct hot_comm_item *comm_item = container_of(kref,
+			struct hot_comm_item, refs);
+	struct hot_inode_item *he = container_of(comm_item,
+			struct hot_inode_item, hot_inode);
+
+	hot_range_tree_free(he);
+	radix_tree_delete(he->hot_inode_tree, he->i_ino);
+	kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure
+ * if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+	kref_put(&he->hot_inode.refs, hot_inode_item_free);
+}
+
+/* Frees the entire hot_inode_tree. */
+static void hot_inode_tree_exit(struct hot_info *root)
+{
+	struct hot_inode_item *hi_nodes[8];
+	u64 ino = 0;
+	int i, n;
+
+	while (1) {
+		spin_lock(&root->lock);
+		n = radix_tree_gang_lookup(&root->hot_inode_tree,
+					   (void **)hi_nodes, ino,
+					   ARRAY_SIZE(hi_nodes));
+		if (!n) {
+			spin_unlock(&root->lock);
+			break;
+		}
+
+		ino = hi_nodes[n - 1]->i_ino + 1;
+		for (i = 0; i < n; i++)
+			hot_inode_item_put(hi_nodes[i]);
+		spin_unlock(&root->lock);
+	}
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -122,6 +215,7 @@ static inline void hot_cache_exit(void)
  */
 void hot_track_init(struct super_block *sb)
 {
+	struct hot_info *root;
 	int err;
 
 	err = hot_cache_init();
@@ -130,9 +224,33 @@ void hot_track_init(struct super_block *sb)
 				__func__, err);
 		return;
 	}
+
+	root = kmalloc(sizeof(struct hot_info), GFP_NOFS);
+	if (!root) {
+		printk(KERN_ERR "%s: failed to malloc memory for "
+				"global_hot_tracking_info: %d\n",
+				__func__, err);
+		goto failed_root;
+	}
+
+	global_hot_tracking_info = root;
+	sb->hot_flags |= MS_HOT_TRACKING;
+	hot_inode_tree_init(root);
+
+	printk(KERN_INFO "vfs: turning on hot data tracking\n");
+
+	return;
+
+failed_root:
+	hot_cache_exit();
 }
 
 void hot_track_exit(struct super_block *sb)
 {
+	struct hot_info *root = global_hot_tracking_info;
+
+	hot_inode_tree_exit(root);
+	sb->hot_flags &= ~MS_HOT_TRACKING;
 	hot_cache_exit();
+	kfree(root);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c617ed0..3b1a389 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1582,6 +1582,9 @@ struct super_block {
 
 	/* Being remounted read-only */
 	int s_readonly_remount;
+
+	/* Hot data tracking*/
+	unsigned long hot_flags;
 };
 
 /* superblock cache pruning functions */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 78adb0d..13aa54b 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -20,6 +20,8 @@
 #include <linux/kref.h>
 #include <linux/fs.h>
 
+#define MS_HOT_TRACKING	(1<<0)
+
 /*
  * A frequency data struct holds values that are used to
  * determine temperature of files and file ranges. These structs
@@ -68,6 +70,8 @@ struct hot_info {
 	spinlock_t lock; /*protect inode tree */
 };
 
+extern struct hot_info *global_hot_tracking_info;
+
 extern void hot_track_init(struct super_block *sb);
 extern void hot_track_exit(struct super_block *sb);
 
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 04/13] vfs: add function for collecting raw access info
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (2 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 03/13] vfs: Initialize and free main data structures zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 10:07 ` [RFC v3 05/13] vfs: add two map arrays zwu.kernel
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add some utils helpers to update access frequencies
for one file or its range.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            |  190 ++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |   12 +++
 include/linux/hot_tracking.h |    4 +
 3 files changed, 206 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 5fd993e..86c87c7 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -174,6 +174,196 @@ static void hot_inode_tree_exit(struct hot_info *root)
 	}
 }
 
+struct hot_inode_item
+*hot_inode_item_find(struct hot_info *root, u64 ino)
+{
+	struct hot_inode_item *he;
+	int ret;
+
+again:
+	spin_lock(&root->lock);
+	he = radix_tree_lookup(&root->hot_inode_tree, ino);
+	if (he) {
+		kref_get(&he->hot_inode.refs);
+		spin_unlock(&root->lock);
+		return he;
+	}
+	spin_unlock(&root->lock);
+
+	he = kmem_cache_zalloc(hot_inode_item_cachep,
+				GFP_KERNEL | GFP_NOFS);
+	if (!he)
+		return ERR_PTR(-ENOMEM);
+
+	hot_inode_item_init(he, ino, &root->hot_inode_tree);
+
+	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
+	if (ret) {
+		kmem_cache_free(hot_inode_item_cachep, he);
+		return ERR_PTR(ret);
+	}
+
+	spin_lock(&root->lock);
+	ret = radix_tree_insert(&root->hot_inode_tree, ino, he);
+	if (ret == -EEXIST) {
+		kmem_cache_free(hot_inode_item_cachep, he);
+		spin_unlock(&root->lock);
+		radix_tree_preload_end();
+		goto again;
+	}
+	spin_unlock(&root->lock);
+	radix_tree_preload_end();
+
+	kref_get(&he->hot_inode.refs);
+	return he;
+}
+
+static struct hot_range_item
+*hot_range_item_find(struct hot_inode_item *he,
+			u32 start)
+{
+	struct hot_range_item *hr;
+	int ret;
+
+again:
+	spin_lock(&he->lock);
+	hr = radix_tree_lookup(&he->hot_range_tree, start);
+	if (hr) {
+		kref_get(&hr->hot_range.refs);
+		spin_unlock(&he->lock);
+		return hr;
+	}
+	spin_unlock(&he->lock);
+
+	hr = kmem_cache_zalloc(hot_range_item_cachep,
+				GFP_KERNEL | GFP_NOFS);
+	if (!hr)
+		return ERR_PTR(-ENOMEM);
+
+	hot_range_item_init(hr, start, he);
+
+	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
+	if (ret) {
+		kmem_cache_free(hot_range_item_cachep, hr);
+		return ERR_PTR(ret);
+	}
+
+	spin_lock(&he->lock);
+	ret = radix_tree_insert(&he->hot_range_tree, start, hr);
+	if (ret == -EEXIST) {
+		kmem_cache_free(hot_range_item_cachep, hr);
+		spin_unlock(&he->lock);
+		radix_tree_preload_end();
+		goto again;
+	}
+	spin_unlock(&he->lock);
+	radix_tree_preload_end();
+
+	kref_get(&hr->hot_range.refs);
+	return hr;
+}
+
+/*
+ * This function does the actual work of updating the frequency numbers,
+ * whatever they turn out to be. FREQ_POWER determines how many atime
+ * deltas we keep track of (as a power of 2). So, setting it to anything above
+ * 16ish is probably overkill. Also, the higher the power, the more bits get
+ * right shifted out of the timestamp, reducing precision, so take note of that
+ * as well.
+ *
+ * The caller should have already locked freq_data's parent's spinlock.
+ *
+ * FREQ_POWER, defined immediately below, determines how heavily to weight
+ * the current frequency numbers against the newest access. For example, a value
+ * of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+ * as heavily as the existing frequency info. In essence, this is a kludged-
+ * together version of a weighted average, since we can't afford to keep all of
+ * the information that it would take to get a _real_ weighted average.
+ */
+static u64 hot_average_update(struct timespec old_atime,
+		struct timespec cur_time, u64 old_avg)
+{
+	struct timespec delta_ts;
+	u64 new_avg;
+	u64 new_delta;
+
+	delta_ts = timespec_sub(cur_time, old_atime);
+	new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
+
+	new_avg = (old_avg << FREQ_POWER) - old_avg + new_delta;
+	new_avg = new_avg >> FREQ_POWER;
+
+	return new_avg;
+}
+
+static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
+{
+	struct timespec cur_time = current_kernel_time();
+
+	if (write) {
+		freq_data->nr_writes += 1;
+		freq_data->avg_delta_writes = hot_average_update(
+				freq_data->last_write_time,
+				cur_time,
+				freq_data->avg_delta_writes);
+		freq_data->last_write_time = cur_time;
+	} else {
+		freq_data->nr_reads += 1;
+		freq_data->avg_delta_reads = hot_average_update(
+				freq_data->last_read_time,
+				cur_time,
+				freq_data->avg_delta_reads);
+		freq_data->last_read_time = cur_time;
+	}
+}
+
+/*
+ * Main function to update access frequency from read/writepage(s) hooks
+ */
+inline void hot_update_freqs(struct hot_info *root,
+			struct inode *inode, u64 start,
+			u64 len, int rw)
+{
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+	u32 cur, end;
+
+	if (!TRACK_THIS_INODE(inode) || (len == 0))
+		return;
+
+	he = hot_inode_item_find(root, inode->i_ino);
+	if (IS_ERR(he)) {
+		WARN_ON(1);
+		return;
+	}
+
+	spin_lock(&he->hot_inode.lock);
+	hot_freq_data_update(&he->hot_inode.hot_freq_data, rw);
+	spin_unlock(&he->hot_inode.lock);
+
+	/*
+	 * Align ranges on RANGE_SIZE boundary
+	 * to prevent proliferation of range structs
+	 */
+	end = (start + len + RANGE_SIZE - 1) >> RANGE_BITS;
+	for (cur = (start >> RANGE_BITS); cur < end; cur++) {
+		hr = hot_range_item_find(he, cur);
+		if (IS_ERR(hr)) {
+			WARN_ON(1);
+			hot_inode_item_put(he);
+			return;
+		}
+
+		spin_lock(&hr->hot_range.lock);
+		hot_freq_data_update(&hr->hot_range.hot_freq_data, rw);
+		spin_unlock(&hr->hot_range.lock);
+
+		hot_range_item_put(hr);
+	}
+
+	hot_inode_item_put(he);
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 4e8aa77..37f69ee 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -19,6 +19,18 @@
 /* values for hot_freq_data flags */
 #define FREQ_DATA_TYPE_INODE (1 << 0)
 #define FREQ_DATA_TYPE_RANGE (1 << 1)
+/* size of sub-file ranges */
+#define RANGE_BITS 20
+#define RANGE_SIZE (1 << RANGE_BITS)
+
+#define FREQ_POWER 4
+
+struct hot_inode_item
+*hot_inode_item_find(struct hot_info *root, u64 ino);
+void hot_inode_item_put(struct hot_inode_item *he);
+inline void hot_update_freqs(struct hot_info *root,
+                        struct inode *inode, u64 start,
+                        u64 len, int rw);
 
 void hot_track_init(struct super_block *sb);
 void hot_track_exit(struct super_block *sb);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 13aa54b..1e0aed5 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -75,4 +75,8 @@ extern struct hot_info *global_hot_tracking_info;
 extern void hot_track_init(struct super_block *sb);
 extern void hot_track_exit(struct super_block *sb);
 
+extern inline void hot_update_freqs(struct hot_info *root,
+                        struct inode *inode, u64 start,
+                        u64 len, int rw);
+
 #endif  /* _LINUX_HOTTRACK_H */
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 05/13] vfs: add two map arrays
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (3 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 04/13] vfs: add function for collecting raw access info zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 10:07 ` [RFC v3 06/13] vfs: add hooks to enable hot data tracking zwu.kernel
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Adds two map arrays which contains
a lot of list and is used to efficiently
look up the data temperature of a file or its
ranges.
  In each list of map arrays, the array node
will keep track of temperature info.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            |   50 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/hot_tracking.h |   16 +++++++++++++
 2 files changed, 66 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 86c87c7..08c42c5 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -60,6 +60,7 @@ static void hot_range_item_init(struct hot_range_item *hr, u32 start,
 	hr->hot_inode = he;
 	kref_init(&hr->hot_range.refs);
 	spin_lock_init(&hr->hot_range.lock);
+	INIT_LIST_HEAD(&hr->hot_range.n_list);
 	hr->hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
 	hr->hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
 	hr->hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
@@ -77,6 +78,7 @@ static void hot_inode_item_init(struct hot_inode_item *he, u64 ino,
 	he->hot_inode_tree = hot_inode_tree;
 	kref_init(&he->hot_inode.refs);
 	spin_lock_init(&he->hot_inode.lock);
+	INIT_LIST_HEAD(&he->hot_inode.n_list);
 	he->hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
 	he->hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
 	he->hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
@@ -90,6 +92,11 @@ static void hot_range_item_free(struct kref *kref)
 	struct hot_range_item *hr = container_of(comm_item,
 		struct hot_range_item, hot_range);
 
+	spin_lock(&hr->hot_range.lock);
+	if (!list_empty(&hr->hot_range.n_list))
+		list_del_init(&hr->hot_range.n_list);
+	spin_unlock(&hr->hot_range.lock);
+
 	radix_tree_delete(&hr->hot_inode->hot_range_tree, hr->start);
 	kmem_cache_free(hot_range_item_cachep, hr);
 }
@@ -135,6 +142,11 @@ static void hot_inode_item_free(struct kref *kref)
 	struct hot_inode_item *he = container_of(comm_item,
 			struct hot_inode_item, hot_inode);
 
+	spin_lock(&he->hot_inode.lock);
+	if (!list_empty(&he->hot_inode.n_list))
+		list_del_init(&he->hot_inode.n_list);
+	spin_unlock(&he->hot_inode.lock);
+
 	hot_range_tree_free(he);
 	radix_tree_delete(he->hot_inode_tree, he->i_ino);
 	kmem_cache_free(hot_inode_item_cachep, he);
@@ -365,6 +377,42 @@ inline void hot_update_freqs(struct hot_info *root,
 }
 
 /*
+ * Initialize inode and range map arrays.
+ */
+static void hot_map_array_init(struct hot_info *root)
+{
+	int i;
+	for (i = 0; i < HEAT_MAP_SIZE; i++) {
+		INIT_LIST_HEAD(&root->heat_inode_map[i].node_list);
+		INIT_LIST_HEAD(&root->heat_range_map[i].node_list);
+		root->heat_inode_map[i].temperature = i;
+		root->heat_range_map[i].temperature = i;
+	}
+}
+
+static void hot_map_list_free(struct list_head *node_list)
+{
+	struct list_head *pos, *next;
+	struct hot_comm_item *node;
+
+	list_for_each_safe(pos, next, node_list) {
+		node = list_entry(pos, struct hot_comm_item, n_list);
+		list_del_init(&node->n_list);
+	}
+
+}
+
+/* Free inode and range map arrays */
+static void hot_map_array_exit(struct hot_info *root)
+{
+	int i;
+	for (i = 0; i < HEAT_MAP_SIZE; i++) {
+		hot_map_list_free(&root->heat_inode_map[i].node_list);
+		hot_map_list_free(&root->heat_range_map[i].node_list);
+	}
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 static int __init hot_cache_init(void)
@@ -426,6 +474,7 @@ void hot_track_init(struct super_block *sb)
 	global_hot_tracking_info = root;
 	sb->hot_flags |= MS_HOT_TRACKING;
 	hot_inode_tree_init(root);
+	hot_map_array_init(root);
 
 	printk(KERN_INFO "vfs: turning on hot data tracking\n");
 
@@ -439,6 +488,7 @@ void hot_track_exit(struct super_block *sb)
 {
 	struct hot_info *root = global_hot_tracking_info;
 
+	hot_map_array_exit(root);
 	hot_inode_tree_exit(root);
 	sb->hot_flags &= ~MS_HOT_TRACKING;
 	hot_cache_exit();
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 1e0aed5..7114179 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -22,6 +22,9 @@
 
 #define MS_HOT_TRACKING	(1<<0)
 
+#define HEAT_MAP_BITS 8
+#define HEAT_MAP_SIZE (1 << HEAT_MAP_BITS)
+
 /*
  * A frequency data struct holds values that are used to
  * determine temperature of files and file ranges. These structs
@@ -38,11 +41,18 @@ struct hot_freq_data {
 	u32 last_temperature;
 };
 
+/* List heads in hot map array */
+struct hot_map_head {
+	struct list_head node_list;
+	u32 temperature;
+};
+
 /* The common info for both following structures */
 struct hot_comm_item {
 	struct hot_freq_data hot_freq_data;  /* frequency data */
 	spinlock_t lock; /* protects object data */
 	struct kref refs;  /* prevents kfree */
+	struct list_head n_list; /* list node index */
 };
 
 /* An item representing an inode and its access frequency */
@@ -68,6 +78,12 @@ struct hot_range_item {
 struct hot_info {
 	struct radix_tree_root hot_inode_tree;
 	spinlock_t lock; /*protect inode tree */
+
+	/* map of inode temperature */
+	struct hot_map_head heat_inode_map[HEAT_MAP_SIZE];
+
+	/* map of range temperature */
+	struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
 };
 
 extern struct hot_info *global_hot_tracking_info;
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 06/13] vfs: add hooks to enable hot data tracking
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (4 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 05/13] vfs: add two map arrays zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 10:07 ` [RFC v3 07/13] vfs: add function for updating map arrays zwu.kernel
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Miscellaneous features that implement hot data tracking
and generally make the hot data functions a bit more friendly.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/direct-io.c      |    8 ++++++++
 fs/hot_tracking.h   |    5 +++++
 mm/filemap.c        |    7 +++++++
 mm/page-writeback.c |   13 +++++++++++++
 mm/readahead.c      |    7 +++++++
 5 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..8960024 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
 #include <linux/uio.h>
 #include <linux/atomic.h>
 #include <linux/prefetch.h>
+#include "hot_tracking.h"
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1297,6 +1298,13 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	prefetch(bdev->bd_queue);
 	prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
 
+	/* Hot data tracking */
+	hot_update_freqs(global_hot_tracking_info,
+			iocb->ki_filp->f_mapping->host,
+			(u64)offset,
+			(u64)iov_length(iov, nr_segs),
+			rw & WRITE);
+
 	return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 				     nr_segs, get_block, end_io,
 				     submit_io, flags);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 37f69ee..42e0273 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -16,6 +16,11 @@
 #include <linux/workqueue.h>
 #include <linux/hot_tracking.h>
 
+/* Hot data tracking -- guard macros */
+#define TRACK_THIS_INODE(inode) \
+                ((inode->i_sb->hot_flags & MS_HOT_TRACKING) && \
+                !(inode->i_flags & S_NOHOTDATATRACK))
+
 /* values for hot_freq_data flags */
 #define FREQ_DATA_TYPE_INODE (1 << 0)
 #define FREQ_DATA_TYPE_RANGE (1 << 1)
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..6b63b77 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/hot_tracking.h>
 #include "internal.h"
 
 /*
@@ -1224,6 +1225,12 @@ readpage:
 		 * PG_error will be set again if readpage fails.
 		 */
 		ClearPageError(page);
+
+		/* Hot data tracking */
+		hot_update_freqs(global_hot_tracking_info, inode,
+				(u64)page->index << PAGE_CACHE_SHIFT,
+				PAGE_CACHE_SIZE, 0);
+
 		/* Start the actual read. The read will unlock the page. */
 		error = mapping->a_ops->readpage(filp, page);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5ad5ce2..cf5a1c8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -35,6 +35,7 @@
 #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
 #include <linux/pagevec.h>
 #include <linux/timer.h>
+#include <linux/hot_tracking.h>
 #include <trace/events/writeback.h>
 
 /*
@@ -1895,13 +1896,25 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	int ret;
+	pgoff_t start = 0;
+	u64 count = 0;
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
+
+	start = mapping->writeback_index << PAGE_CACHE_SHIFT;
+	count = (u64)wbc->nr_to_write;
+
 	if (mapping->a_ops->writepages)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
+
+	/* Hot data tracking */
+	hot_update_freqs(global_hot_tracking_info,
+			mapping->host, (u64)start,
+			(count - (u64)wbc->nr_to_write) * PAGE_CACHE_SIZE, 1);
+
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..b62f1bb 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/file.h>
+#include <linux/hot_tracking.h>
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -138,6 +139,12 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 out:
 	blk_finish_plug(&plug);
 
+	/* Hot data tracking */
+	hot_update_freqs(global_hot_tracking_info,
+			mapping->host, (u64)(list_entry(pages->prev,\
+				struct page, lru)->index) << PAGE_CACHE_SHIFT,
+			(u64)nr_pages * PAGE_CACHE_SIZE, 0);
+
 	return ret;
 }
 
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 07/13] vfs: add function for updating map arrays
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (5 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 06/13] vfs: add hooks to enable hot data tracking zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 10:07 ` [RFC v3 08/13] vfs: add aging function for old map info zwu.kernel
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c |  153 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h |   60 +++++++++++++++++++++
 2 files changed, 213 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 08c42c5..717faa7 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -376,6 +376,159 @@ inline void hot_update_freqs(struct hot_info *root,
 	hot_inode_item_put(he);
 }
 
+static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
+{
+	if (dir)
+		return counter << bits;
+	else
+		return counter >> bits;
+}
+
+/*
+ * hot_temperature_calculate() is responsible for distilling the six heat
+ * criteria, which are described in detail in hot_tracking.h) down into a single
+ * temperature value for the data, which is an integer between 0
+ * and HEAT_MAX_VALUE.
+ *
+ * To accomplish this, the raw values from the hot_freq_data structure
+ * are shifted various ways in order to make the temperature calculation more
+ * or less sensitive to each value.
+ *
+ * Once this calibration has happened, we do some additional normalization and
+ * make sure that everything fits nicely in a u32. From there, we take a very
+ * rudimentary kind of "average" of each of the values, where the *_COEFF_POWER
+ * values act as weights for the average.
+ *
+ * Finally, we use the HEAT_HASH_BITS value, which determines the size of the
+ * heat list array, to normalize the temperature to the proper granularity.
+ */
+int hot_temperature_calculate(struct hot_freq_data *freq_data)
+{
+	u64 result = 0;
+
+	struct timespec ckt = current_kernel_time();
+	u64 cur_time = timespec_to_ns(&ckt);
+
+	u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data->nr_reads,
+					NRR_MULTIPLIER_POWER, true);
+	u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data->nr_writes,
+					NRW_MULTIPLIER_POWER, true);
+
+	u64 ltr_heat =
+	hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)),
+			LTR_DIVIDER_POWER, false);
+	u64 ltw_heat =
+	hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)),
+			LTW_DIVIDER_POWER, false);
+
+	u64 avr_heat =
+	hot_raw_shift((((u64) -1) - freq_data->avg_delta_reads),
+			AVR_DIVIDER_POWER, false);
+	u64 avw_heat =
+	hot_raw_shift((((u64) -1) - freq_data->avg_delta_writes),
+			AVW_DIVIDER_POWER, false);
+
+	/* ltr_heat is now guaranteed to be u32 safe */
+	if (ltr_heat >= hot_raw_shift((u64) 1, 32, true))
+		ltr_heat = 0;
+	else
+		ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat;
+
+	/* ltw_heat is now guaranteed to be u32 safe */
+	if (ltw_heat >= hot_raw_shift((u64) 1, 32, true))
+		ltw_heat = 0;
+	else
+		ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat;
+
+	/* avr_heat is now guaranteed to be u32 safe */
+	if (avr_heat >= hot_raw_shift((u64) 1, 32, true))
+		avr_heat = (u32) -1;
+
+	/* avw_heat is now guaranteed to be u32 safe */
+	if (avw_heat >= hot_raw_shift((u64) 1, 32, true))
+		avw_heat = (u32) -1;
+
+	nrr_heat = (u32)hot_raw_shift((u64)nrr_heat,
+		(3 - NRR_COEFF_POWER), false);
+	nrw_heat = (u32)hot_raw_shift((u64)nrw_heat,
+		(3 - NRW_COEFF_POWER), false);
+	ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+	ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+	avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+	avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+	result = nrr_heat + nrw_heat + (u32) ltr_heat +
+		(u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+	return result >> (32 - HEAT_MAP_BITS);
+}
+
+/*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature
+ */
+static void hot_map_array_update(struct hot_freq_data *freq_data,
+				struct hot_info *root)
+{
+	struct hot_map_head *buckets, *cur_bucket;
+	struct hot_comm_item *comm_item;
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+	u32 temperature = 0;
+
+	comm_item = container_of(freq_data,
+			struct hot_comm_item, hot_freq_data);
+
+	if (freq_data->flags & FREQ_DATA_TYPE_INODE) {
+		he = container_of(comm_item,
+			struct hot_inode_item, hot_inode);
+		buckets = root->heat_inode_map;
+
+		spin_lock(&he->hot_inode.lock);
+		temperature = hot_temperature_calculate(freq_data);
+		spin_unlock(&he->hot_inode.lock);
+
+		if (he == NULL)
+			return;
+
+		spin_lock(&he->hot_inode.lock);
+		if (list_empty(&he->hot_inode.n_list)
+			|| (freq_data->last_temperature != temperature)) {
+			if (!list_empty(&he->hot_inode.n_list))
+				list_del_init(&he->hot_inode.n_list);
+
+			cur_bucket = buckets + temperature;
+			list_add_tail(&he->hot_inode.n_list, &cur_bucket->node_list);
+			freq_data->last_temperature = temperature;
+		}
+		spin_unlock(&he->hot_inode.lock);
+	} else if (freq_data->flags & FREQ_DATA_TYPE_RANGE) {
+		hr = container_of(comm_item,
+			struct hot_range_item, hot_range);
+		buckets = root->heat_range_map;
+
+		spin_lock(&hr->hot_range.lock);
+		temperature = hot_temperature_calculate(freq_data);
+		spin_unlock(&hr->hot_range.lock);
+
+		if (hr == NULL)
+			return;
+
+		spin_lock(&hr->hot_range.lock);
+		if (list_empty(&hr->hot_range.n_list)
+			|| (freq_data->last_temperature != temperature)) {
+			if (!list_empty(&hr->hot_range.n_list))
+				list_del_init(&hr->hot_range.n_list);
+
+			cur_bucket = buckets + temperature;
+			list_add_tail(&hr->hot_range.n_list, &cur_bucket->node_list);
+			freq_data->last_temperature = temperature;
+		}
+		spin_unlock(&hr->hot_range.lock);
+	}
+}
+
 /*
  * Initialize inode and range map arrays.
  */
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 42e0273..5a9517b 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -30,6 +30,64 @@
 
 #define FREQ_POWER 4
 
+/*
+ * The following comments explain what exactly comprises a unit of heat.
+ *
+ * Each of six values of heat are calculated and combined in order to form an
+ * overall temperature for the data:
+ *
+ * NRR - number of reads since mount
+ * NRW - number of writes since mount
+ * LTR - time elapsed since last read (ns)
+ * LTW - time elapsed since last write (ns)
+ * AVR - average delta between recent reads (ns)
+ * AVW - average delta between recent writes (ns)
+ *
+ * These values are divided (right-shifted) according to the *_DIVIDER_POWER
+ * values defined below to bring the numbers into a reasonable range. You can
+ * modify these values to fit your needs. However, each heat unit is a u32 and
+ * thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+ * carefully or else they could max out or be stuck at zero quite easily.
+ *
+ * (E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+ * delta would bring the temperature above zero, ever.)
+ *
+ * Finally, each value is added to the overall temperature between 0 and 8
+ * times, depending on its *_COEFF_POWER value. Note that the coefficients are
+ * also actually implemented with shifts, so take care to treat these values
+ * as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+ */
+
+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ *
+ * E.g., data with an average delta between 0 and 2^X ns
+ * will have a cold value of 0, which means a heat value
+ * equal to HEAT_MAX_VALUE.
+ */
+#define AVR_DIVIDER_POWER 40
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40
+#define AVW_COEFF_POWER 0
+
+struct hot_update_work {
+	struct work_struct work;
+	struct hot_info *hot_info;
+};
+
 struct hot_inode_item
 *hot_inode_item_find(struct hot_info *root, u64 ino);
 void hot_inode_item_put(struct hot_inode_item *he);
@@ -37,6 +95,8 @@ inline void hot_update_freqs(struct hot_info *root,
                         struct inode *inode, u64 start,
                         u64 len, int rw);
 
+int hot_temperature_calculate(struct hot_freq_data *freq_data);
+
 void hot_track_init(struct super_block *sb);
 void hot_track_exit(struct super_block *sb);
 
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 08/13] vfs: add aging function for old map info
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (6 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 07/13] vfs: add function for updating map arrays zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 10:07 ` [RFC v3 09/13] vfs: add one wq to update map info periodically zwu.kernel
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h |    6 +++++
 2 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 717faa7..a8dc599 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -376,6 +376,24 @@ inline void hot_update_freqs(struct hot_info *root,
 	hot_inode_item_put(he);
 }
 
+static bool hot_freq_data_is_aging(struct hot_freq_data *freq_data)
+{
+        int ret = 0;
+        struct timespec ckt = current_kernel_time();
+
+        u64 cur_time = timespec_to_ns(&ckt);
+        u64 last_read_ns =
+                (cur_time - timespec_to_ns(&freq_data->last_read_time));
+        u64 last_write_ns =
+                (cur_time - timespec_to_ns(&freq_data->last_write_time));
+        u64 kick_ns = TIME_TO_KICK * (u64)1000000000;
+
+        if ((last_read_ns > kick_ns) && (last_write_ns > kick_ns))
+                ret = 1;
+
+        return ret;
+}
+
 static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
 {
 	if (dir)
@@ -529,6 +547,45 @@ static void hot_map_array_update(struct hot_freq_data *freq_data,
 	}
 }
 
+/* Update temperatures for each range item for aging purposes */
+static void hot_range_update(struct hot_inode_item *he,
+					struct hot_info *root)
+{
+	struct hot_range_item *hr_nodes[8];
+	u32 start = 0;
+	bool range_is_aging;
+	int i, n;
+
+	while (1) {
+		spin_lock(&he->lock);
+		n = radix_tree_gang_lookup(&he->hot_range_tree,
+				(void **)hr_nodes, start,
+				ARRAY_SIZE(hr_nodes));
+		if (!n) {
+			spin_unlock(&he->lock);
+			break;
+		}
+
+		start = hr_nodes[n - 1]->start + 1;
+		for (i = 0; i < n; i++) {
+			kref_get(&hr_nodes[i]->hot_range.refs);
+			hot_map_array_update(
+				&hr_nodes[i]->hot_range.hot_freq_data, root);
+
+			spin_lock(&hr_nodes[i]->hot_range.lock);
+			range_is_aging = hot_freq_data_is_aging(
+					&hr_nodes[i]->hot_range.hot_freq_data);
+			spin_unlock(&hr_nodes[i]->hot_range.lock);
+
+			hot_range_item_put(hr_nodes[i]);
+			if (range_is_aging) {
+				hot_range_item_put(hr_nodes[i]);
+			}
+		}
+		spin_unlock(&he->lock);
+	}
+}
+
 /*
  * Initialize inode and range map arrays.
  */
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 5a9517b..d19e64a 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -31,6 +31,12 @@
 #define FREQ_POWER 4
 
 /*
+ * time to quit keeping track of
+ * tracking data (seconds)
+ */
+#define TIME_TO_KICK 400
+
+/*
  * The following comments explain what exactly comprises a unit of heat.
  *
  * Each of six values of heat are calculated and combined in order to form an
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 09/13] vfs: add one wq to update map info periodically
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (7 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 08/13] vfs: add aging function for old map info zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-16  0:27   ` Dave Chinner
  2012-10-10 10:07 ` [RFC v3 10/13] vfs: register one memory shrinker zwu.kernel
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add a per-superblock workqueue and a work_struct
 to run periodic work to update map info on each superblock.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            |   94 ++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |    3 +
 include/linux/hot_tracking.h |    2 +
 3 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index a8dc599..f333c47 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,8 @@
 #include <linux/module.h>
 #include <linux/spinlock.h>
 #include <linux/hardirq.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/fs.h>
 #include <linux/blkdev.h>
 #include <linux/types.h>
@@ -623,6 +625,88 @@ static void hot_map_array_exit(struct hot_info *root)
 }
 
 /*
+ * Update temperatures for each hot inode item and
+ * hot range item for aging purposes
+ */
+static void hot_temperature_update_work(struct work_struct *work)
+{
+	struct hot_update_work *hot_work =
+			container_of(work, struct hot_update_work, work);
+	struct hot_info *root = hot_work->hot_info;
+	struct hot_inode_item *hi_nodes[8];
+	unsigned long delay = HZ * HEAT_UPDATE_DELAY;
+	u64 ino = 0;
+	int i, n;
+
+	do {
+		while (1) {
+			spin_lock(&root->lock);
+			n = radix_tree_gang_lookup(&root->hot_inode_tree,
+					   (void **)hi_nodes, ino,
+					   ARRAY_SIZE(hi_nodes));
+			if (!n) {
+				spin_unlock(&root->lock);
+				break;
+			}
+
+			ino = hi_nodes[n - 1]->i_ino + 1;
+			for (i = 0; i < n; i++) {
+				kref_get(&hi_nodes[i]->hot_inode.refs);
+				hot_map_array_update(
+					&hi_nodes[i]->hot_inode.hot_freq_data, root);
+				hot_range_update(hi_nodes[i], root);
+				hot_inode_item_put(hi_nodes[i]);
+			}
+			spin_unlock(&root->lock);
+		}
+
+		if (unlikely(freezing(current))) {
+			__refrigerator(true);
+		} else {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (!kthread_should_stop()) {
+				schedule_timeout(delay);
+			}
+			__set_current_state(TASK_RUNNING);
+		}
+	} while (!kthread_should_stop());
+}
+
+static int hot_wq_init(struct hot_info *root)
+{
+	struct hot_update_work *hot_work;
+	int ret = 0;
+
+	root->update_wq = alloc_workqueue(
+		"hot_temperature_update", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+	if (!root->update_wq) {
+		printk(KERN_ERR "%s: failed to create "
+			"temperature update workqueue\n",
+			__func__);
+		return 1;
+	}
+
+	hot_work = kmalloc(sizeof(*hot_work), GFP_NOFS);
+	if (hot_work) {
+		hot_work->hot_info = root;
+		INIT_WORK(&hot_work->work, hot_temperature_update_work);
+		queue_work(root->update_wq, &hot_work->work);
+	} else {
+		printk(KERN_ERR "%s: failed to create update work\n",
+				__func__);
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static void hot_wq_exit(struct workqueue_struct *wq)
+{
+	flush_workqueue(wq);
+	destroy_workqueue(wq);
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 static int __init hot_cache_init(void)
@@ -686,10 +770,19 @@ void hot_track_init(struct super_block *sb)
 	hot_inode_tree_init(root);
 	hot_map_array_init(root);
 
+	err = hot_wq_init(root);
+	if (err)
+		goto failed_wq;
+
 	printk(KERN_INFO "vfs: turning on hot data tracking\n");
 
 	return;
 
+failed_wq:
+	hot_map_array_exit(root);
+	hot_inode_tree_exit(root);
+	sb->hot_flags &= ~MS_HOT_TRACKING;
+	kfree(root);
 failed_root:
 	hot_cache_exit();
 }
@@ -698,6 +791,7 @@ void hot_track_exit(struct super_block *sb)
 {
 	struct hot_info *root = global_hot_tracking_info;
 
+	hot_wq_exit(root->update_wq);
 	hot_map_array_exit(root);
 	hot_inode_tree_exit(root);
 	sb->hot_flags &= ~MS_HOT_TRACKING;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index d19e64a..7a79a6d 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -36,6 +36,9 @@
  */
 #define TIME_TO_KICK 400
 
+/* set how often to update temperatures (seconds) */
+#define HEAT_UPDATE_DELAY 400
+
 /*
  * The following comments explain what exactly comprises a unit of heat.
  *
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 7114179..b37e0f8 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -84,6 +84,8 @@ struct hot_info {
 
 	/* map of range temperature */
 	struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
+
+	struct workqueue_struct *update_wq;
 };
 
 extern struct hot_info *global_hot_tracking_info;
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 10/13] vfs: register one memory shrinker
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (8 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 09/13] vfs: add one wq to update map info periodically zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 10:07 ` [RFC v3 11/13] vfs: add 3 new ioctl interfaces zwu.kernel
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Register a shrinker to control the amount of
memory that is used in tracking hot regions - if we are throwing
inodes out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking code is
using, even if it means losing useful information (i.e. the shrinker
accelerates the aging process).

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            |   59 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/hot_tracking.h |    1 +
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index f333c47..fcde55e 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -742,6 +742,59 @@ static inline void hot_cache_exit(void)
 		kmem_cache_destroy(hot_inode_item_cachep);
 }
 
+static int hot_track_comm_prune(struct hot_map_head *map_head,
+				bool type, unsigned long nr)
+{
+	struct list_head *pos, *next;
+	struct hot_comm_item *node;
+	int i;
+
+	for (i = 0; i < HEAT_MAP_SIZE; i++) {
+		list_for_each_safe(pos, next, &(map_head + i)->node_list) {
+			if (nr == 0)
+				break;
+			nr--;
+			node = list_entry(pos, struct hot_comm_item, n_list);
+			if (type) {
+				struct hot_inode_item *hot_inode =
+					container_of(node,
+					struct hot_inode_item, hot_inode);
+				hot_inode_item_put(hot_inode);
+			} else {
+				struct hot_range_item *hot_range =
+					container_of(node,
+					struct hot_range_item, hot_range);
+				hot_range_item_put(hot_range);
+			}
+		}
+
+		if (nr == 0)
+			break;
+	}
+
+	return nr;
+}
+
+/* The shrinker callback function */
+static int hot_track_prune(struct shrinker *shrink,
+			struct shrink_control *sc)
+{
+	struct hot_info *root =
+		container_of(shrink, struct hot_info, hot_shrink);
+	int ret;
+
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		return (sc->nr_to_scan == 0) ? 0 : -1;
+
+	ret = hot_track_comm_prune(root->heat_range_map,
+				false, sc->nr_to_scan);
+	if (ret > 0)
+		ret = hot_track_comm_prune(root->heat_inode_map,
+					true, ret);
+
+	return ret;
+}
+
 /*
  * Initialize the data structures for hot data tracking.
  */
@@ -774,6 +827,11 @@ void hot_track_init(struct super_block *sb)
 	if (err)
 		goto failed_wq;
 
+	/* Register a shrinker callback */
+	root->hot_shrink.shrink = hot_track_prune;
+	root->hot_shrink.seeks = DEFAULT_SEEKS;
+	register_shrinker(&root->hot_shrink);
+
 	printk(KERN_INFO "vfs: turning on hot data tracking\n");
 
 	return;
@@ -791,6 +849,7 @@ void hot_track_exit(struct super_block *sb)
 {
 	struct hot_info *root = global_hot_tracking_info;
 
+	unregister_shrinker(&root->hot_shrink);
 	hot_wq_exit(root->update_wq);
 	hot_map_array_exit(root);
 	hot_inode_tree_exit(root);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index b37e0f8..6f31090 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -86,6 +86,7 @@ struct hot_info {
 	struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
 
 	struct workqueue_struct *update_wq;
+	struct shrinker hot_shrink;
 };
 
 extern struct hot_info *global_hot_tracking_info;
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 11/13] vfs: add 3 new ioctl interfaces
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (9 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 10/13] vfs: register one memory shrinker zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-15  7:48   ` Dave Chinner
  2012-10-16  3:17   ` Dave Chinner
  2012-10-10 10:07 ` [RFC v3 12/13] vfs: add debugfs support zwu.kernel
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in btrfs_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally, retrieve
the temperature from the hot data hash list instead of recalculating it.

  FS_IOC_GET_HEAT_OPTS: return an integer representing the current
state of hot data tracking and migration:

0 = do nothing
1 = track frequency of access

  FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
migration, as described above.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/compat_ioctl.c            |    9 +++
 fs/ioctl.c                   |  122 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h           |    1 +
 include/linux/hot_tracking.h |   22 ++++++++
 4 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index f505402..820f4cc 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
 #include <linux/i2c-dev.h>
 #include <linux/atalk.h>
 #include <linux/gfp.h>
+#include <linux/hot_tracking.h>
 
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci.h>
@@ -1398,6 +1399,11 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+COMPATIBLE_IOCTL(FS_IOC_SET_HEAT_OPTS)
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_OPTS)
+
 /* fat 'r' ioctls. These are handled by fat with ->compat_ioctl,
    but we don't want warnings on other file systems. So declare
    them as compatible here. */
@@ -1577,6 +1583,9 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
 	case FIBMAP:
 	case FIGETBSZ:
 	case FIONREAD:
+	case FS_IOC_GET_HEAT_INFO:
+	case FS_IOC_SET_HEAT_OPTS:
+	case FS_IOC_GET_HEAT_OPTS:
 		if (S_ISREG(f.file->f_path.dentry->d_inode->i_mode))
 			break;
 		/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 3bdad6d..35127ed 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include <linux/writeback.h>
 #include <linux/buffer_head.h>
 #include <linux/falloc.h>
+#include "hot_tracking.h"
 
 #include <asm/ioctls.h>
 
@@ -537,6 +538,118 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the hashtable, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info->live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+	struct inode *file_inode;
+	struct file *file_filp;
+	struct hot_info *root = global_hot_tracking_info;
+	struct hot_heat_info *heat_info;
+	struct hot_inode_item *he;
+	int ret = 0;
+
+	heat_info = kmalloc(sizeof(struct hot_heat_info),
+				GFP_KERNEL | GFP_NOFS);
+
+	if (copy_from_user((void *) heat_info,
+			argp,
+			sizeof(struct hot_heat_info)) != 0) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	file_filp = filp_open(heat_info->filename, O_RDONLY, 0);
+	file_inode = file_filp->f_dentry->d_inode;
+	filp_close(file_filp, NULL);
+
+	he = hot_inode_item_find(root, file_inode->i_ino);
+	if (!he) {
+		/* we don't have any info on this file yet */
+		ret = -ENODATA;
+		goto err;
+	}
+
+	spin_lock(&he->hot_inode.lock);
+	heat_info->avg_delta_reads =
+		(__u64) he->hot_inode.hot_freq_data.avg_delta_reads;
+	heat_info->avg_delta_writes =
+		(__u64) he->hot_inode.hot_freq_data.avg_delta_writes;
+	heat_info->last_read_time =
+		(__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_read_time);
+	heat_info->last_write_time =
+		(__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_write_time);
+	heat_info->num_reads =
+		(__u32) he->hot_inode.hot_freq_data.nr_reads;
+	heat_info->num_writes =
+		(__u32) he->hot_inode.hot_freq_data.nr_writes;
+
+	if (heat_info->live > 0) {
+		/*
+		 * got a request for live temperature,
+		 * call hot_hash_calc_temperature to recalculate
+		 */
+		heat_info->temperature =
+			hot_temperature_calculate(&he->hot_inode.hot_freq_data);
+	} else {
+		/* not live temperature, get it from the hashlist */
+		heat_info->temperature = he->hot_inode.hot_freq_data.last_temperature;
+	}
+	spin_unlock(&he->hot_inode.lock);
+
+	hot_inode_item_put(he);
+
+	if (copy_to_user(argp, (void *) heat_info,
+			sizeof(struct hot_heat_info))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+err:
+	kfree(heat_info);
+	return ret;
+}
+
+static int ioctl_heat_opts(struct file *file, void __user *argp, int set)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	unsigned arg;
+	int ret = 0;
+
+	if (!set) {
+		arg = TRACK_THIS_INODE(inode) ? 1 : 0;
+
+		if (copy_to_user(argp, (void *) &arg, sizeof(unsigned long)) != 0)
+			ret = -EFAULT;
+	} else {
+		if (copy_from_user((void *) &arg, argp, sizeof(unsigned long)) != 0) {
+			ret = -EFAULT;
+		} else {
+			switch (arg) {
+			case 0: /* track nothing */
+				/* set S_NOHOTDATATRACK */
+				inode->i_flags |= S_NOHOTDATATRACK;
+				break;
+			case 1: /* do tracking */
+				/* clear S_NOHOTDATATRACK */
+				inode->i_flags &= ~S_NOHOTDATATRACK;
+				break;
+			default:
+				ret = -EINVAL;
+			}
+		}
+	}
+
+	return ret;
+}
+
+/*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
@@ -591,6 +704,15 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, argp);
 
+	case FS_IOC_GET_HEAT_INFO:
+		return ioctl_heat_info(filp, argp);
+
+	case FS_IOC_SET_HEAT_OPTS:
+		return ioctl_heat_opts(filp, argp, 1);
+
+	case FS_IOC_GET_HEAT_OPTS:
+		return ioctl_heat_opts(filp, argp, 0);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b1a389..c2e2d0f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -256,6 +256,7 @@ struct inodes_stat_t {
 #define S_IMA		1024	/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT	2048	/* Automount/referral quasi-directory */
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
+#define S_NOHOTDATATRACK (1 << 13)	/* hot data tracking */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 6f31090..e3ca136 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -41,6 +41,18 @@ struct hot_freq_data {
 	u32 last_temperature;
 };
 
+struct hot_heat_info {
+	__u64 avg_delta_reads;
+	__u64 avg_delta_writes;
+	__u64 last_read_time;
+	__u64 last_write_time;
+	__u32 num_reads;
+	__u32 num_writes;
+	__u32 temperature;
+	__u8 live;
+	char filename[PATH_MAX];
+};
+
 /* List heads in hot map array */
 struct hot_map_head {
 	struct list_head node_list;
@@ -89,6 +101,16 @@ struct hot_info {
 	struct shrinker hot_shrink;
 };
 
+/*
+ * Hot data tracking ioctls:
+ *
+ * HOT_INFO - retrieve info on frequency of access
+ */
+#define FS_IOC_GET_HEAT_INFO _IOR('f', 17, \
+                                struct hot_heat_info)
+#define FS_IOC_SET_HEAT_OPTS _IOW('f', 18, unsigned long)
+#define FS_IOC_GET_HEAT_OPTS _IOR('f', 19, unsigned long)
+
 extern struct hot_info *global_hot_tracking_info;
 
 extern void hot_track_init(struct super_block *sb);
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 12/13] vfs: add debugfs support
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (10 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 11/13] vfs: add 3 new ioctl interfaces zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-10 16:53   ` David Sterba
                     ` (3 more replies)
  2012-10-10 10:07 ` [RFC v3 13/13] vfs: add documentation zwu.kernel
                   ` (4 subsequent siblings)
  16 siblings, 4 replies; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add a /sys/kernel/debug/hot_track/<device_name>/ directory for each
volume that contains two files. The first, `inode_data', contains the
heat information for inodes that have been brought into the hot data map
structures. The second, `range_data', contains similar information for
subfile ranges.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c |  462 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h |   43 +++++
 2 files changed, 505 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index fcde55e..60e93e6 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -20,6 +20,8 @@
 #include <linux/fs.h>
 #include <linux/blkdev.h>
 #include <linux/types.h>
+#include <linux/debugfs.h> 
+#include <linux/vmalloc.h> 
 #include <linux/limits.h>
 #include "hot_tracking.h"
 
@@ -29,6 +31,13 @@ struct hot_info *global_hot_tracking_info;
 static struct kmem_cache *hot_inode_item_cachep;
 static struct kmem_cache *hot_range_item_cachep;
 
+/* list to keep track of each mounted volumes debugfs_vol_data */
+static struct list_head hot_debugfs_vol_data_list;
+/* lock for debugfs_vol_data_list */
+static spinlock_t hot_debugfs_data_list_lock;
+/* pointer to top level debugfs dentry */
+static struct dentry *hot_debugfs_root_dentry;
+
 /*
  * Initialize the inode tree. Should be called for each new inode
  * access or other user of the hot_inode interface.
@@ -706,6 +715,451 @@ static void hot_wq_exit(struct workqueue_struct *wq)
 	destroy_workqueue(wq);
 }
 
+static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len)
+{
+	struct lstring *debugfs_log = data->debugfs_log;
+	uint new_log_alloc_size;
+	char *new_log;
+	static char err_msg[] = "No more memory!\n";
+
+	if (len >= data->log_alloc_size - debugfs_log->len) {
+		/*
+		 * Not enough room in the log buffer for the new message.
+		 * Allocate a bigger buffer.
+		 */
+		new_log_alloc_size = data->log_alloc_size + LOG_PAGE_SIZE;
+		new_log = vmalloc(new_log_alloc_size);
+
+		if (new_log) {
+			memcpy(new_log, debugfs_log->str, debugfs_log->len);
+			memset(new_log + debugfs_log->len, 0,
+				new_log_alloc_size - debugfs_log->len);
+			vfree(debugfs_log->str);
+			debugfs_log->str = new_log;
+			data->log_alloc_size = new_log_alloc_size;
+		} else {
+			WARN_ON(1);
+			if (data->log_alloc_size - debugfs_log->len) {
+				strlcpy(debugfs_log->str +
+				debugfs_log->len,
+				err_msg,
+				data->log_alloc_size - debugfs_log->len);
+				debugfs_log->len +=
+				min((typeof(debugfs_log->len))
+				sizeof(err_msg),
+				((typeof(debugfs_log->len))
+				data->log_alloc_size - debugfs_log->len));
+			}
+			return 0;
+		}
+	}
+
+	memcpy(debugfs_log->str + debugfs_log->len, data->log_work_buff, len);
+	debugfs_log->len += (unsigned long) len;
+
+	return len;
+}
+
+/* Returns the number of bytes written to the log. */
+static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...)
+{
+	struct lstring *debugfs_log = data->debugfs_log;
+	va_list args;
+	int len;
+	static char trunc_msg[] =
+			"The next message has been truncated.\n";
+
+	if (debugfs_log->str == NULL)
+		return -1;
+
+	spin_lock(&data->log_lock);
+
+	va_start(args, fmt);
+	len = vsnprintf(data->log_work_buff,
+			sizeof(data->log_work_buff), fmt, args);
+	va_end(args);
+
+	if (len >= sizeof(data->log_work_buff)) {
+		hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg));
+	}
+
+	len = hot_debugfs_copy(data, data->log_work_buff, len);
+	spin_unlock(&data->log_lock);
+
+	return len;
+}
+
+/* initialize a log corresponding to a fs volume */
+static int hot_debugfs_log_init(struct debugfs_vol_data *data)
+{
+	int err = 0;
+	struct lstring *debugfs_log = data->debugfs_log;
+
+	spin_lock(&data->log_lock);
+	debugfs_log->str = vmalloc(INIT_LOG_ALLOC_SIZE);
+	if (debugfs_log->str) {
+		memset(debugfs_log->str, 0, INIT_LOG_ALLOC_SIZE);
+		data->log_alloc_size = INIT_LOG_ALLOC_SIZE;
+	} else {
+		err = -ENOMEM;
+	}
+	spin_unlock(&data->log_lock);
+
+	return err;
+}
+
+/* free a log corresponding to a fs volume */
+static void hot_debugfs_log_exit(struct debugfs_vol_data *data)
+{
+	struct lstring *debugfs_log = data->debugfs_log;
+
+	spin_lock(&data->log_lock);
+	vfree(debugfs_log->str);
+	debugfs_log->str = NULL;
+	debugfs_log->len = 0;
+	spin_unlock(&data->log_lock);
+}
+
+/* debugfs open file override from fops table */
+static int __hot_debugfs_open(struct inode *inode, struct file *file)
+{
+	if (inode->i_private)
+		file->private_data = inode->i_private;
+
+	return 0;
+}
+
+static void __hot_debugfs_print_range_freq_data(
+			struct hot_inode_item *he,
+			struct hot_range_item *hr,
+			struct debugfs_vol_data *data,
+			struct hot_info *root)
+{
+	struct hot_freq_data *freq_data;
+
+	freq_data = &hr->hot_range.hot_freq_data;
+
+	/* Always lock hot_inode_item first */
+	spin_lock(&he->hot_inode.lock);
+	spin_lock(&hr->hot_range.lock);
+	hot_debugfs_log(data, "inode #%lu, range start " \
+			"%llu (range len %llu) reads %u, writes %u, "
+			"avg read time %llu, avg write time %llu, temp %u\n",
+			he->i_ino,
+			hr->start,
+			hr->len,
+			freq_data->nr_reads,
+			freq_data->nr_writes,
+			freq_data->avg_delta_reads,
+			freq_data->avg_delta_writes,
+			freq_data->last_temperature);
+	spin_unlock(&hr->hot_range.lock);
+	spin_unlock(&he->hot_inode.lock);
+}
+
+/*
+ * take the inode, find ranges associated with inode
+ * and print each range data struct
+ */
+static void __hot_debugfs_walk_range_tree(struct hot_inode_item *he,
+				struct debugfs_vol_data *data,
+				struct hot_info *root)
+{
+	struct hot_range_item *hr_nodes[8];
+	u32 start = 0;
+	int i, n;
+
+	/* Walk the hot_range_tree for inode */
+	while (1) {
+		spin_lock(&he->lock);
+		n = radix_tree_gang_lookup(&he->hot_range_tree,
+					   (void **)hr_nodes, start,
+					   ARRAY_SIZE(hr_nodes));
+		if (!n) {
+			spin_unlock(&he->lock);
+			break;
+		}
+
+		start = hr_nodes[n - 1]->start + 1;
+		for (i = 0; i < n; i++) {
+			kref_get(&hr_nodes[i]->hot_range.refs);
+			__hot_debugfs_print_range_freq_data(he,
+						hr_nodes[i], data, root);
+			hot_range_item_put(hr_nodes[i]);
+		}
+		spin_unlock(&he->lock);
+	}
+}
+
+/* Print frequency data for each freq data to log */
+static void __hot_debugfs_print_inode_freq_data(
+				struct hot_inode_item *he,
+				struct debugfs_vol_data *data,
+				struct hot_info *root)
+{
+	struct hot_freq_data *freq_data = &he->hot_inode.hot_freq_data;
+
+	spin_lock(&he->hot_inode.lock);
+	hot_debugfs_log(data, "inode #%lu, reads %u, writes %u, " \
+		"avg read time %llu, avg write time %llu, temp %u\n",
+		he->i_ino,
+		freq_data->nr_reads,
+		freq_data->nr_writes,
+		freq_data->avg_delta_reads,
+		freq_data->avg_delta_writes,
+		freq_data->last_temperature);
+	spin_unlock(&he->hot_inode.lock);
+}
+
+/* debugfs common read file override from fops table */
+static ssize_t __hot_debugfs_comm_read(struct file *file, char __user *user,
+					size_t count, loff_t *ppos,
+					hot_debugfs_walk_t private_walk)
+{
+	int err = 0;
+	struct hot_info *root;
+	struct debugfs_vol_data *data;
+	struct lstring *debugfs_log;
+	struct hot_inode_item *hi_nodes[8];
+	u64 ino = 0;
+	int i, n;
+
+	data = (struct debugfs_vol_data *) file->private_data;
+	root = global_hot_tracking_info;
+
+	if (!data->debugfs_log) {
+		/* initialize debugfs log corresponding to this volume */
+		debugfs_log = kmalloc(sizeof(struct lstring),
+					GFP_KERNEL | GFP_NOFS);
+		debugfs_log->str = NULL,
+		debugfs_log->len = 0;
+		data->debugfs_log = debugfs_log;
+		hot_debugfs_log_init(data);
+	}
+
+	if ((unsigned long) *ppos > 0) {
+		/* caller is continuing a previous read, don't walk tree */
+		if ((unsigned long) *ppos >= data->debugfs_log->len)
+			goto clean_up;
+
+		goto print_to_user;
+	}
+
+	/* walk the inode tree */
+	while (1) {
+		spin_lock(&root->lock);
+		n = radix_tree_gang_lookup(&root->hot_inode_tree,
+					   (void **)hi_nodes, ino,
+					   ARRAY_SIZE(hi_nodes));
+		if (!n) {
+			spin_unlock(&root->lock);
+			break;
+		}
+
+		ino = hi_nodes[n - 1]->i_ino + 1;
+		for (i = 0; i < n; i++) {
+			kref_get(&hi_nodes[i]->hot_inode.refs);
+			/* walk ranges, print data to debugfs log */
+			private_walk(hi_nodes[i], data, root);
+			hot_inode_item_put(hi_nodes[i]);
+		}
+		spin_unlock(&root->lock);
+	}
+
+print_to_user:
+	if (data->debugfs_log->len) {
+		err = simple_read_from_buffer(user, count, ppos,
+					data->debugfs_log->str,
+					data->debugfs_log->len);
+	}
+
+	return err;
+
+clean_up:
+	/* reader has finished the file, clean up */
+	hot_debugfs_log_exit(data);
+	kfree(data->debugfs_log);
+	data->debugfs_log = NULL;
+
+	return 0;
+}
+
+/* debugfs read file override from fops table */
+static ssize_t __hot_debugfs_range_read(struct file *file, char __user *user,
+					size_t count, loff_t *ppos)
+{
+	return __hot_debugfs_comm_read(file, user,count, ppos,
+				__hot_debugfs_walk_range_tree);
+}
+
+/* debugfs read file override from fops table */
+static ssize_t __hot_debugfs_inode_read(struct file *file, char __user *user,
+					size_t count, loff_t *ppos)
+{
+	return __hot_debugfs_comm_read(file, user,count, ppos,
+				__hot_debugfs_print_inode_freq_data);
+
+}
+
+/* fops to override for printing range data */
+static const struct file_operations hot_debugfs_range_fops = {
+	.read = __hot_debugfs_range_read,
+	.open = __hot_debugfs_open,
+};
+
+/* fops to override for printing inode data */
+static const struct file_operations hot_debugfs_inode_fops = {
+	.read = __hot_debugfs_inode_read,
+	.open = __hot_debugfs_open,
+};
+
+/*
+ * on each volume mount, initialize the debugfs dentries and associated
+ * structures (debugfs_vol_data and debugfs_log)
+ */
+static int hot_debugfs_volume_init(struct super_block *sb)
+{
+	struct dentry *debugfs_volume_entry = NULL;
+	struct dentry *debugfs_range_entry = NULL;
+	struct dentry *debugfs_inode_entry = NULL;
+	struct debugfs_vol_data *range_data = NULL;
+	struct debugfs_vol_data *inode_data = NULL;
+
+	if (!hot_debugfs_root_dentry)
+		goto debugfs_error;
+
+	/* create debugfs folder for this volume by mounted dev name */
+	debugfs_volume_entry = debugfs_create_dir(sb->s_id, hot_debugfs_root_dentry);
+
+	if (!debugfs_volume_entry)
+		goto debugfs_error;
+
+	/* malloc and initialize debugfs_vol_data for range_data */
+	range_data = kmalloc(sizeof(struct debugfs_vol_data),
+				GFP_KERNEL | GFP_NOFS);
+	memset(range_data, 0, sizeof(struct debugfs_vol_data));
+	range_data->debugfs_log = NULL;
+	range_data->sb = sb;
+	spin_lock_init(&range_data->log_lock);
+	range_data->log_alloc_size = 0;
+
+	/* malloc and initialize debugfs_vol_data for inode_data */
+	inode_data = kmalloc(sizeof(struct debugfs_vol_data),
+				GFP_KERNEL | GFP_NOFS);
+	memset(inode_data, 0, sizeof(struct debugfs_vol_data));
+	inode_data->debugfs_log = NULL;
+	inode_data->sb = sb;
+	spin_lock_init(&inode_data->log_lock);
+	inode_data->log_alloc_size = 0;
+
+	/*
+	 * add debugfs_vol_data for inode data and range data for
+	 * volume to list
+	 */
+	range_data->de = debugfs_volume_entry;
+	inode_data->de = debugfs_volume_entry;
+	spin_lock(&hot_debugfs_data_list_lock);
+	list_add(&range_data->node, &hot_debugfs_vol_data_list);
+	list_add(&inode_data->node, &hot_debugfs_vol_data_list);
+	spin_unlock(&hot_debugfs_data_list_lock);
+
+	/* create debugfs range_data file */
+	debugfs_range_entry = debugfs_create_file("range_data",
+				S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO,
+				debugfs_volume_entry,
+				(void *) range_data,
+				&hot_debugfs_range_fops);
+	if (!debugfs_range_entry)
+		goto debugfs_error;
+
+	/* create debugfs inode_data file */
+	debugfs_inode_entry = debugfs_create_file("inode_data",
+				S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO,
+				debugfs_volume_entry,
+				(void *) inode_data,
+				&hot_debugfs_inode_fops);
+
+	if (!debugfs_inode_entry)
+		goto debugfs_error;
+
+	return 0;
+
+debugfs_error:
+	kfree(range_data);
+	kfree(inode_data);
+
+	return -EIO;
+}
+
+/*
+ * find volume mounted (match by superblock) and remove
+ * debugfs dentry
+ */
+static void hot_debugfs_volume_exit(struct super_block *sb)
+{
+	struct list_head *head;
+	struct list_head *pos;
+	struct debugfs_vol_data *data;
+
+	spin_lock(&hot_debugfs_data_list_lock);
+	head = &hot_debugfs_vol_data_list;
+	/* must clean up memory assicatied with superblock */
+	list_for_each(pos, head)
+	{
+		data = list_entry(pos, struct debugfs_vol_data, node);
+		if (data->sb == sb) {
+			list_del(pos);
+			debugfs_remove_recursive(data->de);
+			kfree(data);
+			data = NULL;
+		}
+	}
+	spin_unlock(&hot_debugfs_data_list_lock);
+}
+
+/* initialize debugfs */
+static int hot_debugfs_init(struct super_block *sb)
+{
+	hot_debugfs_root_dentry = debugfs_create_dir(DEBUGFS_ROOT_NAME, NULL);
+	/*init list of debugfs data list */
+	INIT_LIST_HEAD(&hot_debugfs_vol_data_list);
+	/*init lock to list of debugfs data list */
+	spin_lock_init(&hot_debugfs_data_list_lock);
+	if (!hot_debugfs_root_dentry)
+		goto debugfs_error;
+
+	hot_debugfs_volume_init(sb);
+
+	return 0;
+
+debugfs_error:
+	return -EIO;
+}
+
+/* clean up memory and remove dentries for debugsfs */
+static void hot_debugfs_exit(struct super_block *sb)
+{
+	/* first iterate through debugfs_vol_data_list and free memory */
+	struct list_head *head;
+	struct list_head *pos;
+	struct list_head *cur;
+	struct debugfs_vol_data *data;
+
+	hot_debugfs_volume_exit(sb);
+
+	spin_lock(&hot_debugfs_data_list_lock);
+	head = &hot_debugfs_vol_data_list;
+	list_for_each_safe(pos, cur, head) {
+		data = list_entry(pos, struct debugfs_vol_data, node);
+		if (data && pos != head)
+			kfree(data);
+	}
+	spin_unlock(&hot_debugfs_data_list_lock);
+
+	/* remove all debugfs entries recursively from the root */
+	debugfs_remove_recursive(hot_debugfs_root_dentry);
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -832,6 +1286,13 @@ void hot_track_init(struct super_block *sb)
 	root->hot_shrink.seeks = DEFAULT_SEEKS;
 	register_shrinker(&root->hot_shrink);
 
+	err = hot_debugfs_init(sb);
+	if (err) {
+		printk(KERN_ERR "%s: hot_debugfs_init error: %d\n",
+				__func__, err);
+		return;
+	}
+
 	printk(KERN_INFO "vfs: turning on hot data tracking\n");
 
 	return;
@@ -855,5 +1316,6 @@ void hot_track_exit(struct super_block *sb)
 	hot_inode_tree_exit(root);
 	sb->hot_flags &= ~MS_HOT_TRACKING;
 	hot_cache_exit();
+	hot_debugfs_exit(sb);
 	kfree(root);
 }
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 7a79a6d..76d7469 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -92,6 +92,49 @@
 #define AVW_DIVIDER_POWER 40
 #define AVW_COEFF_POWER 0
 
+/* size of log to vmalloc */
+#define INIT_LOG_ALLOC_SIZE (PAGE_SIZE * 10)
+#define LOG_PAGE_SIZE (PAGE_SIZE * 10)
+
+/*
+ * number of chars of device name of chop off
+ * for making debugfs folder e.g. /dev/sda -> sda
+ */
+#define DEV_NAME_CHOP 5
+
+/*
+ * Name for VFS data in debugfs directory
+ * e.g. /sys/kernel/debug/hot_track
+ */
+#define DEBUGFS_ROOT_NAME "hot_track"
+
+/* log to output to userspace in debugfs files */
+struct lstring {
+	char *str;
+	unsigned long len;
+};
+
+/*
+ * debugfs_vol_data is a struct of items
+ * that is passed to the debugfs
+ */
+struct debugfs_vol_data {
+	/* protected by hot_debugfs_data_list_lock */
+	struct list_head node;
+	struct lstring *debugfs_log;
+	struct super_block *sb;
+	struct dentry *de;
+	/* protects debugfs_log */
+	spinlock_t log_lock;
+	char log_work_buff[1024];
+	uint log_alloc_size;
+};
+
+typedef void (*hot_debugfs_walk_t)(
+			struct hot_inode_item *hot_inode,
+			struct debugfs_vol_data *data,
+			struct hot_info *root);
+
 struct hot_update_work {
 	struct work_struct work;
 	struct hot_info *hot_info;
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v3 13/13] vfs: add documentation
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (11 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 12/13] vfs: add debugfs support zwu.kernel
@ 2012-10-10 10:07 ` zwu.kernel
  2012-10-15  0:35   ` Zheng Liu
  2012-10-15  0:39 ` [RFC v3 00/13] vfs: hot data tracking Zheng Liu
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: zwu.kernel @ 2012-10-10 10:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-btrfs, linux-kernel, linuxram, viro, david,
	dave, tytso, cmm, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 Documentation/filesystems/00-INDEX         |    2 +
 Documentation/filesystems/hot_tracking.txt |  165 ++++++++++++++++++++++++++++
 2 files changed, 167 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8c624a1..b68bdff 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -118,3 +118,5 @@ xfs.txt
 	- info and mount options for the XFS filesystem.
 xip.txt
 	- info on execute-in-place for file mappings.
+hot_tracking.txt
+	- info on hot data tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 0000000..34dc232
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,165 @@
+Hot Data Tracking
+
+September, 2012		Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. Git Development Tree
+5. Usage Example
+
+
+1. Introduction
+
+  The feature adds experimental support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+"temperature" value that reflects what data is "hot," and using that
+temperature to move data to SSDs.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+  Of course, users are warned not to run this code outside of development
+environments. These patches are EXPERIMENTAL, and as such they might eat
+your data and/or memory. That said, the code should be relatively safe
+when the hottrack mount option are disabled.
+
+2. Motivation
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+<https://btrfs.wiki.kernel.org/index.php/Project_ideas>.
+It will divide into two steps. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, it is hoped that the patchset
+for hot data tracking will eventually mature into VFS.
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+
+3. The Design
+
+These include the following parts:
+
+    * Hooks in existing vfs functions to track data access frequency
+
+    * New rbtrees for tracking access frequency of inodes and sub-file
+ranges (hot_rb.c)
+    The relationship between super_block and rbtree is as below:
+super_block->s_hotinfo.hot_inode_tree
+    In include/linux/fs.h, one struct hot_info s_hotinfo is added to
+super_block struct. Each FS instance can find hot tracking info
+s_hotinfo via its super_block. In this hot_info, it store a lot of hot
+tracking info such as hot_inode_tree, inode and range hash list, etc.
+
+    * A hash list for indexing data by its temperature (hot_hash.c)
+
+    * A debugfs interface for dumping data from the rbtrees (hot_debugfs.c)
+
+    * A background kthread for updating inode heat info
+
+    * Mount options for enabling temperature tracking(-o hottrack,
+default mean disabled) (hot_track.c)
+    * An ioctl to retrieve the frequency information collected for a certain
+file
+    * Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+    * hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+    * hot_inode_item contains access frequency data for that inode
+
+    * hot_inode_item holds a heat hash node to index the access
+frequency data for that inode
+
+    * hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+    * hot_range_item contains access frequency data for that range
+
+    * hot_range_item holds a heat hash node to index the access
+frequency data for that range
+
+    * hot_info.heat_inode_map indexes per-inode heat hash nodes
+
+    * hot_info.heat_range_map indexes per-range heat hash nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+heat_inode_map           hot_inode_tree
+    |                         |
+    |                         V
+    |           +-------hot_comm_item--------+
+    |           |       frequency data       |
++---+           |        list_head           |
+|               V            ^ |             V
+| ...<--hot_comm_item-->...  | |  ...<--hot_comm_item-->...
+|       frequency data       | |        frequency data
++-------->list_head----------+ +--------->list_head--->.....
+       hot_range_tree                  hot_range_tree
+                                             |
+             heat_range_map                  V
+                   |           +-------hot_comm_item--------+
+                   |           |       frequency data       |
+               +---+           |        list_head           |
+               |               V            ^ |             V
+               | ...<--hot_comm_item-->...  | |  ...<--hot_comm_item-->...
+               |       frequency data       | |        frequency data
+               +-------->list_head----------+ +--------->list_head--->.....
+
+4. Git Development Tree
+
+  The feature is still on development and review, so if you're interested,
+you can pull from the git repository at the following location:
+  https://github.com/wuzhy/kernel.git hot_tracking
+  git://github.com/wuzhy/kernel.git hot_tracking
+
+
+5. Usage Example
+
+To use hot tracking, you should mount like this:
+
+$ mount -o hot_track /dev/sdb /mnt
+[ 1505.894078] device label test devid 1 transid 29 /dev/sdb
+[ 1505.952977] btrfs: disk space caching is enabled
+[ 1506.069678] vfs: turning on hot data tracking
+
+Mount debugfs at first:
+
+$ mount -t debugfs none /sys/kernel/debug
+$ ls -l /sys/kernel/debug/hot_track/
+total 0
+drwxr-xr-x 2 root root 0 Aug  8 04:40 sdb
+$ ls -l /sys/kernel/debug/hot_track/sdb
+total 0
+-rw-r--r-- 1 root root 0 Aug  8 04:40 inode_data
+-rw-r--r-- 1 root root 0 Aug  8 04:40 range_data
+
+View information about hot tracking from debugfs:
+
+$ echo "hot tracking test" > /mnt/file
+$ cat /sys/kernel/debug/hot_track/sdb/inode_data
+inode #279, reads 0, writes 1, avg read time 18446744073709551615,
+avg write time 5251566408153596, temp 109
+$ cat /sys/kernel/debug/hot_track/sdb/range_data
+inode #279, range start 0 (range len 1048576) reads 0, writes 1,
+avg read time 18446744073709551615, avg write time 1128690176623144209, temp 64
+
+$ echo "hot data tracking test" >> /mnt/file
+$ cat /sys/kernel/debug/hot_track/sdb/inode_data
+inode #279, reads 0, writes 2, avg read time 18446744073709551615,
+avg write time 4923343766042451, temp 109
+$ cat /sys/kernel/debug/hot_track/sdb/range_data
+inode #279, range start 0 (range len 1048576) reads 0, writes 2,
+avg read time 18446744073709551615, avg write time 1058147040842596150, temp 64
+
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
       [not found]   ` <5075632c.03cc440a.1b33.7805SMTPIN_ADDED@mx.google.com>
@ 2012-10-10 12:21       ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-10 12:21 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner <lczerner@redhat.com> wrote:
> On Wed, 10 Oct 2012, zwu.kernel@gmail.com wrote:
>
>> Date: Wed, 10 Oct 2012 18:07:23 +0800
>> From: zwu.kernel@gmail.com
>> To: linux-fsdevel@vger.kernel.org
>> Cc: linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org,
>>     linux-kernel@vger.kernel.org, linuxram@linux.vnet.ibm.com,
>>     viro@zeniv.linux.org.uk, david@fromorbit.com, dave@jikos.cz,
>>     tytso@mit.edu, cmm@us.ibm.com, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
>>
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>> Introduce one new mount option '-o hot_track',
>> and add its parsing support.
>>   Its usage looks like:
>>    mount -o hot_track
>>    mount -o nouser,hot_track
>>    mount -o nouser,hot_track,loop
>>    mount -o hot_track,nouser
>
> This patch should probably be at the end of the series.
Can you let me know your reason? I think that it is not necessary to
be at the end of the series.

>
> -Lukas
>
>>
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  fs/btrfs/ctree.h |    1 +
>>  fs/btrfs/super.c |    7 ++++++-
>>  2 files changed, 7 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 9821b67..094bec6 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
>>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR     (1 << 22)
>> +#define BTRFS_MOUNT_HOT_TRACK                (1 << 23)
>>
>>  #define btrfs_clear_opt(o, opt)              ((o) &= ~BTRFS_MOUNT_##opt)
>>  #define btrfs_set_opt(o, opt)                ((o) |= BTRFS_MOUNT_##opt)
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 83d6f9f..00be9e3 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/cleancache.h>
>>  #include <linux/ratelimit.h>
>> +#include <linux/hot_tracking.h>
>>  #include "compat.h"
>>  #include "delayed-inode.h"
>>  #include "ctree.h"
>> @@ -303,7 +304,7 @@ enum {
>>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
>>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
>> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
>> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
>>       Opt_check_integrity, Opt_check_integrity_including_extent_data,
>>       Opt_check_integrity_print_mask, Opt_fatal_errors,
>>       Opt_err,
>> @@ -342,6 +343,7 @@ static match_table_t tokens = {
>>       {Opt_no_space_cache, "nospace_cache"},
>>       {Opt_recovery, "recovery"},
>>       {Opt_skip_balance, "skip_balance"},
>> +     {Opt_hot_track, "hot_track"},
>>       {Opt_check_integrity, "check_int"},
>>       {Opt_check_integrity_including_extent_data, "check_int_data"},
>>       {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
>> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>>               case Opt_skip_balance:
>>                       btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
>>                       break;
>> +             case Opt_hot_track:
>> +                     btrfs_set_opt(info->mount_opt, HOT_TRACK);
>> +                     break;
>>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>>               case Opt_check_integrity_including_extent_data:
>>                       printk(KERN_INFO "btrfs: enabling check integrity"
>>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
@ 2012-10-10 12:21       ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-10 12:21 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner <lczerner@redhat.com> wrote:
> On Wed, 10 Oct 2012, zwu.kernel@gmail.com wrote:
>
>> Date: Wed, 10 Oct 2012 18:07:23 +0800
>> From: zwu.kernel@gmail.com
>> To: linux-fsdevel@vger.kernel.org
>> Cc: linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org,
>>     linux-kernel@vger.kernel.org, linuxram@linux.vnet.ibm.com,
>>     viro@zeniv.linux.org.uk, david@fromorbit.com, dave@jikos.cz,
>>     tytso@mit.edu, cmm@us.ibm.com, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
>>
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>> Introduce one new mount option '-o hot_track',
>> and add its parsing support.
>>   Its usage looks like:
>>    mount -o hot_track
>>    mount -o nouser,hot_track
>>    mount -o nouser,hot_track,loop
>>    mount -o hot_track,nouser
>
> This patch should probably be at the end of the series.
Can you let me know your reason? I think that it is not necessary to
be at the end of the series.

>
> -Lukas
>
>>
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  fs/btrfs/ctree.h |    1 +
>>  fs/btrfs/super.c |    7 ++++++-
>>  2 files changed, 7 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 9821b67..094bec6 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
>>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR     (1 << 22)
>> +#define BTRFS_MOUNT_HOT_TRACK                (1 << 23)
>>
>>  #define btrfs_clear_opt(o, opt)              ((o) &= ~BTRFS_MOUNT_##opt)
>>  #define btrfs_set_opt(o, opt)                ((o) |= BTRFS_MOUNT_##opt)
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 83d6f9f..00be9e3 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/cleancache.h>
>>  #include <linux/ratelimit.h>
>> +#include <linux/hot_tracking.h>
>>  #include "compat.h"
>>  #include "delayed-inode.h"
>>  #include "ctree.h"
>> @@ -303,7 +304,7 @@ enum {
>>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
>>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
>> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
>> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
>>       Opt_check_integrity, Opt_check_integrity_including_extent_data,
>>       Opt_check_integrity_print_mask, Opt_fatal_errors,
>>       Opt_err,
>> @@ -342,6 +343,7 @@ static match_table_t tokens = {
>>       {Opt_no_space_cache, "nospace_cache"},
>>       {Opt_recovery, "recovery"},
>>       {Opt_skip_balance, "skip_balance"},
>> +     {Opt_hot_track, "hot_track"},
>>       {Opt_check_integrity, "check_int"},
>>       {Opt_check_integrity_including_extent_data, "check_int_data"},
>>       {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
>> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>>               case Opt_skip_balance:
>>                       btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
>>                       break;
>> +             case Opt_hot_track:
>> +                     btrfs_set_opt(info->mount_opt, HOT_TRACK);
>> +                     break;
>>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>>               case Opt_check_integrity_including_extent_data:
>>                       printk(KERN_INFO "btrfs: enabling check integrity"
>>



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-10 12:21       ` Zhi Yong Wu
  (?)
@ 2012-10-10 13:11       ` Lukáš Czerner
  2012-10-10 13:16         ` Zhi Yong Wu
  -1 siblings, 1 reply; 55+ messages in thread
From: Lukáš Czerner @ 2012-10-10 13:11 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Lukáš Czerner, linux-fsdevel, linux-ext4, linux-btrfs,
	linux-kernel, linuxram, viro, david, dave, tytso, cmm,
	Zhi Yong Wu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4775 bytes --]

On Wed, 10 Oct 2012, Zhi Yong Wu wrote:

> Date: Wed, 10 Oct 2012 20:21:48 +0800
> From: Zhi Yong Wu <zwu.kernel@gmail.com>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
>     linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org,
>     linuxram@linux.vnet.ibm.com, viro@zeniv.linux.org.uk, david@fromorbit.com,
>     dave@jikos.cz, tytso@mit.edu, cmm@us.ibm.com,
>     Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> Subject: Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
> 
> On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner <lczerner@redhat.com> wrote:
> > On Wed, 10 Oct 2012, zwu.kernel@gmail.com wrote:
> >
> >> Date: Wed, 10 Oct 2012 18:07:23 +0800
> >> From: zwu.kernel@gmail.com
> >> To: linux-fsdevel@vger.kernel.org
> >> Cc: linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org,
> >>     linux-kernel@vger.kernel.org, linuxram@linux.vnet.ibm.com,
> >>     viro@zeniv.linux.org.uk, david@fromorbit.com, dave@jikos.cz,
> >>     tytso@mit.edu, cmm@us.ibm.com, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> >> Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
> >>
> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> >>
> >> Introduce one new mount option '-o hot_track',
> >> and add its parsing support.
> >>   Its usage looks like:
> >>    mount -o hot_track
> >>    mount -o nouser,hot_track
> >>    mount -o nouser,hot_track,loop
> >>    mount -o hot_track,nouser
> >
> > This patch should probably be at the end of the series.
> Can you let me know your reason? I think that it is not necessary to
> be at the end of the series.

Simply because you're adding the mount option which does not do
anything yet. Moreover you change the implementation of the hot track
as you go. You should enable this once it is ready to use, not the other
way around. So, please move this at the end of the patch set when
the feature is supposed to be ready to use.

Thanks!
-Lukas

> 
> >
> > -Lukas
> >
> >>
> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> >> ---
> >>  fs/btrfs/ctree.h |    1 +
> >>  fs/btrfs/super.c |    7 ++++++-
> >>  2 files changed, 7 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >> index 9821b67..094bec6 100644
> >> --- a/fs/btrfs/ctree.h
> >> +++ b/fs/btrfs/ctree.h
> >> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
> >>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
> >>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
> >>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR     (1 << 22)
> >> +#define BTRFS_MOUNT_HOT_TRACK                (1 << 23)
> >>
> >>  #define btrfs_clear_opt(o, opt)              ((o) &= ~BTRFS_MOUNT_##opt)
> >>  #define btrfs_set_opt(o, opt)                ((o) |= BTRFS_MOUNT_##opt)
> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> >> index 83d6f9f..00be9e3 100644
> >> --- a/fs/btrfs/super.c
> >> +++ b/fs/btrfs/super.c
> >> @@ -41,6 +41,7 @@
> >>  #include <linux/slab.h>
> >>  #include <linux/cleancache.h>
> >>  #include <linux/ratelimit.h>
> >> +#include <linux/hot_tracking.h>
> >>  #include "compat.h"
> >>  #include "delayed-inode.h"
> >>  #include "ctree.h"
> >> @@ -303,7 +304,7 @@ enum {
> >>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
> >>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
> >>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
> >> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
> >> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
> >>       Opt_check_integrity, Opt_check_integrity_including_extent_data,
> >>       Opt_check_integrity_print_mask, Opt_fatal_errors,
> >>       Opt_err,
> >> @@ -342,6 +343,7 @@ static match_table_t tokens = {
> >>       {Opt_no_space_cache, "nospace_cache"},
> >>       {Opt_recovery, "recovery"},
> >>       {Opt_skip_balance, "skip_balance"},
> >> +     {Opt_hot_track, "hot_track"},
> >>       {Opt_check_integrity, "check_int"},
> >>       {Opt_check_integrity_including_extent_data, "check_int_data"},
> >>       {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
> >> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
> >>               case Opt_skip_balance:
> >>                       btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
> >>                       break;
> >> +             case Opt_hot_track:
> >> +                     btrfs_set_opt(info->mount_opt, HOT_TRACK);
> >> +                     break;
> >>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
> >>               case Opt_check_integrity_including_extent_data:
> >>                       printk(KERN_INFO "btrfs: enabling check integrity"
> >>
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-10 13:11       ` Lukáš Czerner
@ 2012-10-10 13:16         ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-10 13:16 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 9:11 PM, Lukáš Czerner <lczerner@redhat.com> wrote:
> On Wed, 10 Oct 2012, Zhi Yong Wu wrote:
>
>> Date: Wed, 10 Oct 2012 20:21:48 +0800
>> From: Zhi Yong Wu <zwu.kernel@gmail.com>
>> To: Lukáš Czerner <lczerner@redhat.com>
>> Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
>>     linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org,
>>     linuxram@linux.vnet.ibm.com, viro@zeniv.linux.org.uk, david@fromorbit.com,
>>     dave@jikos.cz, tytso@mit.edu, cmm@us.ibm.com,
>>     Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> Subject: Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
>>
>> On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner <lczerner@redhat.com> wrote:
>> > On Wed, 10 Oct 2012, zwu.kernel@gmail.com wrote:
>> >
>> >> Date: Wed, 10 Oct 2012 18:07:23 +0800
>> >> From: zwu.kernel@gmail.com
>> >> To: linux-fsdevel@vger.kernel.org
>> >> Cc: linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org,
>> >>     linux-kernel@vger.kernel.org, linuxram@linux.vnet.ibm.com,
>> >>     viro@zeniv.linux.org.uk, david@fromorbit.com, dave@jikos.cz,
>> >>     tytso@mit.edu, cmm@us.ibm.com, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> >> Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
>> >>
>> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> >>
>> >> Introduce one new mount option '-o hot_track',
>> >> and add its parsing support.
>> >>   Its usage looks like:
>> >>    mount -o hot_track
>> >>    mount -o nouser,hot_track
>> >>    mount -o nouser,hot_track,loop
>> >>    mount -o hot_track,nouser
>> >
>> > This patch should probably be at the end of the series.
>> Can you let me know your reason? I think that it is not necessary to
>> be at the end of the series.
>
> Simply because you're adding the mount option which does not do
> anything yet. Moreover you change the implementation of the hot track
> as you go. You should enable this once it is ready to use, not the other
> way around. So, please move this at the end of the patch set when
> the feature is supposed to be ready to use.
OK, done, thanks. If you have comments on other patches, it will be appreciated.

>
> Thanks!
> -Lukas
>
>>
>> >
>> > -Lukas
>> >
>> >>
>> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> >> ---
>> >>  fs/btrfs/ctree.h |    1 +
>> >>  fs/btrfs/super.c |    7 ++++++-
>> >>  2 files changed, 7 insertions(+), 1 deletions(-)
>> >>
>> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> >> index 9821b67..094bec6 100644
>> >> --- a/fs/btrfs/ctree.h
>> >> +++ b/fs/btrfs/ctree.h
>> >> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
>> >>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
>> >>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
>> >>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR     (1 << 22)
>> >> +#define BTRFS_MOUNT_HOT_TRACK                (1 << 23)
>> >>
>> >>  #define btrfs_clear_opt(o, opt)              ((o) &= ~BTRFS_MOUNT_##opt)
>> >>  #define btrfs_set_opt(o, opt)                ((o) |= BTRFS_MOUNT_##opt)
>> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> >> index 83d6f9f..00be9e3 100644
>> >> --- a/fs/btrfs/super.c
>> >> +++ b/fs/btrfs/super.c
>> >> @@ -41,6 +41,7 @@
>> >>  #include <linux/slab.h>
>> >>  #include <linux/cleancache.h>
>> >>  #include <linux/ratelimit.h>
>> >> +#include <linux/hot_tracking.h>
>> >>  #include "compat.h"
>> >>  #include "delayed-inode.h"
>> >>  #include "ctree.h"
>> >> @@ -303,7 +304,7 @@ enum {
>> >>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>> >>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
>> >>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
>> >> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
>> >> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
>> >>       Opt_check_integrity, Opt_check_integrity_including_extent_data,
>> >>       Opt_check_integrity_print_mask, Opt_fatal_errors,
>> >>       Opt_err,
>> >> @@ -342,6 +343,7 @@ static match_table_t tokens = {
>> >>       {Opt_no_space_cache, "nospace_cache"},
>> >>       {Opt_recovery, "recovery"},
>> >>       {Opt_skip_balance, "skip_balance"},
>> >> +     {Opt_hot_track, "hot_track"},
>> >>       {Opt_check_integrity, "check_int"},
>> >>       {Opt_check_integrity_including_extent_data, "check_int_data"},
>> >>       {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
>> >> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>> >>               case Opt_skip_balance:
>> >>                       btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
>> >>                       break;
>> >> +             case Opt_hot_track:
>> >> +                     btrfs_set_opt(info->mount_opt, HOT_TRACK);
>> >> +                     break;
>> >>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>> >>               case Opt_check_integrity_including_extent_data:
>> >>                       printk(KERN_INFO "btrfs: enabling check integrity"
>> >>
>>
>>
>>
>>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 02/13] vfs: introduce private radix tree structures
  2012-10-10 10:07 ` [RFC v3 02/13] vfs: introduce private radix tree structures zwu.kernel
@ 2012-10-10 15:34   ` David Sterba
  2012-10-11 13:35     ` Zhi Yong Wu
  0 siblings, 1 reply; 55+ messages in thread
From: David Sterba @ 2012-10-10 15:34 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:24PM +0800, zwu.kernel@gmail.com wrote:
> +void hot_track_init(struct super_block *sb)
> +{
...
> +}

> +void hot_track_exit(struct super_block *sb)
> +{
> +	hot_cache_exit();
> +}

Needs to be exported if btrfs is built as a module, otherwise does not
link

  LDS     arch/x86/boot/compressed/vmlinux.lds
  AS      arch/x86/boot/compressed/head_64.o
  CC      arch/x86/boot/compressed/misc.o
  CC      arch/x86/boot/compressed/string.o
  CC      arch/x86/boot/compressed/cmdline.o
  CC      arch/x86/boot/compressed/early_serial_console.o
  OBJCOPY arch/x86/boot/compressed/vmlinux.bin
ERROR: "hot_track_init" [fs/btrfs/btrfs.ko] undefined!
ERROR: "hot_track_exit" [fs/btrfs/btrfs.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
make: *** Waiting for unfinished jobs....


david

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-10 10:07 ` [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' zwu.kernel
       [not found]   ` <5075632c.03cc440a.1b33.7805SMTPIN_ADDED@mx.google.com>
@ 2012-10-10 16:28   ` David Sterba
  2012-10-11 13:41     ` Zhi Yong Wu
  2012-10-11 14:35     ` Zhi Yong Wu
  1 sibling, 2 replies; 55+ messages in thread
From: David Sterba @ 2012-10-10 16:28 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

Hi,

On Wed, Oct 10, 2012 at 06:07:23PM +0800, zwu.kernel@gmail.com wrote:
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
>  #define BTRFS_MOUNT_CHECK_INTEGRITY	(1 << 20)
>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
> +#define BTRFS_MOUNT_HOT_TRACK		(1 << 23)

Please don't forget to add new options to btrfs_show_options(),
otherwise we can't tell what filesystems have hot tracking enabled.

> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -303,7 +304,7 @@ enum {
>  	Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>  	Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
>  	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
> -	Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
> +	Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,

Please add the new option to the end.

>  	Opt_check_integrity, Opt_check_integrity_including_extent_data,
>  	Opt_check_integrity_print_mask, Opt_fatal_errors,
>  	Opt_err,
> @@ -342,6 +343,7 @@ static match_table_t tokens = {
>  	{Opt_no_space_cache, "nospace_cache"},
>  	{Opt_recovery, "recovery"},
>  	{Opt_skip_balance, "skip_balance"},
> +	{Opt_hot_track, "hot_track"},

(also this one)

>  	{Opt_check_integrity, "check_int"},
>  	{Opt_check_integrity_including_extent_data, "check_int_data"},
>  	{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>  		case Opt_skip_balance:
>  			btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
>  			break;
> +		case Opt_hot_track:

It's a common patter in the surrounding code that a message is printed
when enabling options, but the vfs prints its own, so I'm not sure if
it's needed here as well. Just thinking, leave it as it is now.

> +			btrfs_set_opt(info->mount_opt, HOT_TRACK);
> +			break;
>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>  		case Opt_check_integrity_including_extent_data:
>  			printk(KERN_INFO "btrfs: enabling check integrity"

david

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 12/13] vfs: add debugfs support
  2012-10-10 10:07 ` [RFC v3 12/13] vfs: add debugfs support zwu.kernel
@ 2012-10-10 16:53   ` David Sterba
  2012-10-10 21:05   ` David Sterba
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: David Sterba @ 2012-10-10 16:53 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.kernel@gmail.com wrote:
> +static int hot_debugfs_log_init(struct debugfs_vol_data *data)
> +{
> +	int err = 0;
> +	struct lstring *debugfs_log = data->debugfs_log;
> +
> +	spin_lock(&data->log_lock);
> +	debugfs_log->str = vmalloc(INIT_LOG_ALLOC_SIZE);

vmalloc __might_sleep(), do the allocation outside of the lock and assign
the value inside. Also, you may use vzalloc and drop the following memset.

dmesg:

vfs: turning on hot data tracking
BUG: sleeping function called from invalid context at mm/slab.c:3220
in_atomic(): 1, irqs_disabled(): 0, pid: 3103, name: mc
1 lock held by mc/3103:
 #0:  (&(&inode_data->log_lock)->rlock){+.+.+.}, at: [<ffffffff8118c656>] __hot_debugfs_comm_read+0x216/0x280
Pid: 3103, comm: mc Tainted: G        W    3.6.0hottrack-default+ #208
Call Trace:
 [<ffffffff8108068c>] __might_sleep+0xfc/0x130
 [<ffffffff8114f7c1>] kmem_cache_alloc_trace+0xe1/0x270
 [<ffffffff81142005>] __get_vm_area_node+0x95/0x1a0
 [<ffffffff8108630f>] ? local_clock+0x6f/0x80
 [<ffffffff8118c660>] ? __hot_debugfs_comm_read+0x220/0x280
 [<ffffffff8114283d>] __vmalloc_node_range+0x6d/0x200
 [<ffffffff8118c660>] ? __hot_debugfs_comm_read+0x220/0x280
 [<ffffffff8118b810>] ? hot_debugfs_log+0xe0/0xe0
 [<ffffffff81142a05>] __vmalloc_node+0x35/0x40
 [<ffffffff8118c660>] ? __hot_debugfs_comm_read+0x220/0x280
 [<ffffffff81142bcc>] vmalloc+0x2c/0x30
 [<ffffffff8118c660>] __hot_debugfs_comm_read+0x220/0x280
 [<ffffffff8191fd58>] ? __do_page_fault+0x238/0x590
 [<ffffffff810ab825>] ? trace_hardirqs_on_caller+0x155/0x1d0
 [<ffffffff8138d01e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8118c6d5>] __hot_debugfs_inode_read+0x15/0x20
 [<ffffffff8115a19b>] vfs_read+0xcb/0x190
 [<ffffffff8115a2c2>] sys_read+0x62/0xb0
 [<ffffffff8138d01e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff81924979>] system_call_fastpath+0x16/0x1b

> +	if (debugfs_log->str) {
> +		memset(debugfs_log->str, 0, INIT_LOG_ALLOC_SIZE);
> +		data->log_alloc_size = INIT_LOG_ALLOC_SIZE;
> +	} else {
> +		err = -ENOMEM;
> +	}
> +	spin_unlock(&data->log_lock);
> +
> +	return err;
> +}

I'm now playing with it, and haven't gone through the code yet,
david

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 12/13] vfs: add debugfs support
  2012-10-10 10:07 ` [RFC v3 12/13] vfs: add debugfs support zwu.kernel
  2012-10-10 16:53   ` David Sterba
@ 2012-10-10 21:05   ` David Sterba
  2012-10-15  7:55   ` Dave Chinner
  2012-10-15  8:04   ` Dave Chinner
  3 siblings, 0 replies; 55+ messages in thread
From: David Sterba @ 2012-10-10 21:05 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.kernel@gmail.com wrote:
> +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len)
> +{
> +	struct lstring *debugfs_log = data->debugfs_log;
> +	uint new_log_alloc_size;
> +	char *new_log;
> +	static char err_msg[] = "No more memory!\n";
> +
> +	if (len >= data->log_alloc_size - debugfs_log->len) {
> +		/*
> +		 * Not enough room in the log buffer for the new message.
> +		 * Allocate a bigger buffer.
> +		 */
> +		new_log_alloc_size = data->log_alloc_size + LOG_PAGE_SIZE;
> +		new_log = vmalloc(new_log_alloc_size);

This is also called with a spinlock from hot_debugfs_log, and it is a
frequent call. I found my testbox inaccessible after an hour of md5sums
on a partition when I tried to print contents of the /sys/debug files.

Serial console log filled with

[ 4886.141736] BUG: scheduling while atomic: mc/3176/0x00000004
[ 4886.148443] INFO: lockdep is turned off.
[ 4886.153424] Modules linked in: aoe dm_crypt loop btrfs
[ 4886.159705] Pid: 3176, comm: mc Tainted: G        W    3.6.0hottrack-default+ #209
[ 4886.168346] Call Trace:
[ 4886.171842]  [<ffffffff8107e528>] __schedule_bug+0x68/0x90
[ 4886.178427]  [<ffffffff81919b6c>] __schedule+0x73c/0x810
[ 4886.184809]  [<ffffffff81919cf9>] schedule+0x29/0x70
[ 4886.190838]  [<ffffffff8191729c>] schedule_timeout+0x17c/0x2f0
[ 4886.197732]  [<ffffffff8105c260>] ? del_timer+0x100/0x100
[ 4886.204198]  [<ffffffff8191b59b>] ? _raw_spin_unlock+0x2b/0x50
[ 4886.211099]  [<ffffffff8191742e>] schedule_timeout_uninterruptible+0x1e/0x20
[ 4886.219211]  [<ffffffff811150a9>] __alloc_pages_nodemask+0x839/0x9f0
[ 4886.226624]  [<ffffffff811428c3>] __vmalloc_node_range+0xf3/0x200
[ 4886.233788]  [<ffffffff8118b659>] ? hot_debugfs_copy+0x59/0x130
[ 4886.240774]  [<ffffffff81142a05>] __vmalloc_node+0x35/0x40
[ 4886.247335]  [<ffffffff8118b659>] ? hot_debugfs_copy+0x59/0x130
[ 4886.254331]  [<ffffffff81142bcc>] vmalloc+0x2c/0x30
[ 4886.260299]  [<ffffffff8118b659>] hot_debugfs_copy+0x59/0x130
[ 4886.267130]  [<ffffffff8118b7c6>] hot_debugfs_log+0x96/0xe0
[ 4886.273772]  [<ffffffff8118b86f>] __hot_debugfs_print_inode_freq_data+0x5f/0x80
[ 4886.282149]  [<ffffffff8118c58c>] __hot_debugfs_comm_read+0x14c/0x280
[ 4886.289653]  [<ffffffff8118b810>] ? hot_debugfs_log+0xe0/0xe0
[ 4886.296461]  [<ffffffff8118c6d5>] __hot_debugfs_inode_read+0x15/0x20
[ 4886.303853]  [<ffffffff8115a19b>] vfs_read+0xcb/0x190
[ 4886.309934]  [<ffffffff8115a2c2>] sys_read+0x62/0xb0
[ 4886.315927]  [<ffffffff8138d01e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4886.323399]  [<ffffffff81924979>] system_call_fastpath+0x16/0x1b


david

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 02/13] vfs: introduce private radix tree structures
  2012-10-10 15:34   ` David Sterba
@ 2012-10-11 13:35     ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-11 13:35 UTC (permalink / raw)
  To: dave, zwu.kernel, linux-fsdevel, linux-ext4, linux-btrfs,
	linux-kernel, linuxram, viro, david, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 11:34 PM, David Sterba <dave@jikos.cz> wrote:
> On Wed, Oct 10, 2012 at 06:07:24PM +0800, zwu.kernel@gmail.com wrote:
>> +void hot_track_init(struct super_block *sb)
>> +{
> ...
>> +}
>
>> +void hot_track_exit(struct super_block *sb)
>> +{
>> +     hot_cache_exit();
>> +}
>
> Needs to be exported if btrfs is built as a module, otherwise does not
> link
>
>   LDS     arch/x86/boot/compressed/vmlinux.lds
>   AS      arch/x86/boot/compressed/head_64.o
>   CC      arch/x86/boot/compressed/misc.o
>   CC      arch/x86/boot/compressed/string.o
>   CC      arch/x86/boot/compressed/cmdline.o
>   CC      arch/x86/boot/compressed/early_serial_console.o
>   OBJCOPY arch/x86/boot/compressed/vmlinux.bin
> ERROR: "hot_track_init" [fs/btrfs/btrfs.ko] undefined!
> ERROR: "hot_track_exit" [fs/btrfs/btrfs.ko] undefined!
> make[1]: *** [__modpost] Error 1
> make: *** [modules] Error 2
> make: *** Waiting for unfinished jobs....
Sorry for late response at first. Great, thanks.

>
>
> david



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-10 16:28   ` David Sterba
@ 2012-10-11 13:41     ` Zhi Yong Wu
  2012-10-11 14:35     ` Zhi Yong Wu
  1 sibling, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-11 13:41 UTC (permalink / raw)
  To: dave, zwu.kernel, linux-fsdevel, linux-ext4, linux-btrfs,
	linux-kernel, linuxram, viro, david, tytso, cmm, Zhi Yong Wu

On Thu, Oct 11, 2012 at 12:28 AM, David Sterba <dave@jikos.cz> wrote:
> Hi,
>
> On Wed, Oct 10, 2012 at 06:07:23PM +0800, zwu.kernel@gmail.com wrote:
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
>>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR     (1 << 22)
>> +#define BTRFS_MOUNT_HOT_TRACK                (1 << 23)
>
> Please don't forget to add new options to btrfs_show_options(),
> otherwise we can't tell what filesystems have hot tracking enabled.
Great catch, thanks.
>
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -303,7 +304,7 @@ enum {
>>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
>>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
>> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
>> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
>
> Please add the new option to the end.
OK.
>
>>       Opt_check_integrity, Opt_check_integrity_including_extent_data,
>>       Opt_check_integrity_print_mask, Opt_fatal_errors,
>>       Opt_err,
>> @@ -342,6 +343,7 @@ static match_table_t tokens = {
>>       {Opt_no_space_cache, "nospace_cache"},
>>       {Opt_recovery, "recovery"},
>>       {Opt_skip_balance, "skip_balance"},
>> +     {Opt_hot_track, "hot_track"},
>
> (also this one)
ditto.
>
>>       {Opt_check_integrity, "check_int"},
>>       {Opt_check_integrity_including_extent_data, "check_int_data"},
>>       {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
>> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>>               case Opt_skip_balance:
>>                       btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
>>                       break;
>> +             case Opt_hot_track:
>
> It's a common patter in the surrounding code that a message is printed
> when enabling options, but the vfs prints its own, so I'm not sure if
> it's needed here as well. Just thinking, leave it as it is now.
OK
>
>> +                     btrfs_set_opt(info->mount_opt, HOT_TRACK);
>> +                     break;
>>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>>               case Opt_check_integrity_including_extent_data:
>>                       printk(KERN_INFO "btrfs: enabling check integrity"
>
> david



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-10 16:28   ` David Sterba
  2012-10-11 13:41     ` Zhi Yong Wu
@ 2012-10-11 14:35     ` Zhi Yong Wu
  2012-10-11 14:41       ` David Sterba
  1 sibling, 1 reply; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-11 14:35 UTC (permalink / raw)
  To: dave, zwu.kernel, linux-fsdevel, linux-ext4, linux-btrfs,
	linux-kernel, linuxram, viro, david, tytso, cmm, Zhi Yong Wu

On Thu, Oct 11, 2012 at 12:28 AM, David Sterba <dave@jikos.cz> wrote:
> Hi,
>
> On Wed, Oct 10, 2012 at 06:07:23PM +0800, zwu.kernel@gmail.com wrote:
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
>>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
>>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR     (1 << 22)
>> +#define BTRFS_MOUNT_HOT_TRACK                (1 << 23)
>
> Please don't forget to add new options to btrfs_show_options(),
> otherwise we can't tell what filesystems have hot tracking enabled.
>
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -303,7 +304,7 @@ enum {
>>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
>>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
>> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
>> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
>
> Please add the new option to the end.
To be honest, it can't be added to the end, if you check Opt_err's
pattern value, you will find it is NULL, it will cause match_one()
return 1. So if we add Opt_hot_track to the end of this array, it will
be covered by match_token(), so i prefer to add it to
Opt_fatal_errors. Do you think of it?
>
>>       Opt_check_integrity, Opt_check_integrity_including_extent_data,
>>       Opt_check_integrity_print_mask, Opt_fatal_errors,
>>       Opt_err,
>> @@ -342,6 +343,7 @@ static match_table_t tokens = {
>>       {Opt_no_space_cache, "nospace_cache"},
>>       {Opt_recovery, "recovery"},
>>       {Opt_skip_balance, "skip_balance"},
>> +     {Opt_hot_track, "hot_track"},
>
> (also this one)
>
>>       {Opt_check_integrity, "check_int"},
>>       {Opt_check_integrity_including_extent_data, "check_int_data"},
>>       {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
>> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>>               case Opt_skip_balance:
>>                       btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
>>                       break;
>> +             case Opt_hot_track:
>
> It's a common patter in the surrounding code that a message is printed
> when enabling options, but the vfs prints its own, so I'm not sure if
> it's needed here as well. Just thinking, leave it as it is now.
>
>> +                     btrfs_set_opt(info->mount_opt, HOT_TRACK);
>> +                     break;
>>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>>               case Opt_check_integrity_including_extent_data:
>>                       printk(KERN_INFO "btrfs: enabling check integrity"
>
> david



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-11 14:35     ` Zhi Yong Wu
@ 2012-10-11 14:41       ` David Sterba
  2012-10-11 14:46         ` Zhi Yong Wu
  0 siblings, 1 reply; 55+ messages in thread
From: David Sterba @ 2012-10-11 14:41 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: dave, linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel,
	linuxram, viro, david, tytso, cmm, Zhi Yong Wu

On Thu, Oct 11, 2012 at 10:35:28PM +0800, Zhi Yong Wu wrote:
> >> --- a/fs/btrfs/super.c
> >> +++ b/fs/btrfs/super.c
> >> @@ -303,7 +304,7 @@ enum {
> >>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
> >>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
> >>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
> >> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
> >> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
> >
> > Please add the new option to the end.
> To be honest, it can't be added to the end, if you check Opt_err's
> pattern value, you will find it is NULL, it will cause match_one()
> return 1. So if we add Opt_hot_track to the end of this array, it will
> be covered by match_token(), so i prefer to add it to
> Opt_fatal_errors. Do you think of it?

Ah, sorry, I was not clear what the 'end' means here. The Opt_err is a
end-of-the-list token, so yes please add hot_track between
Opt_fatal_errors and Opt_err.

david

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
  2012-10-11 14:41       ` David Sterba
@ 2012-10-11 14:46         ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-11 14:46 UTC (permalink / raw)
  To: dave, Zhi Yong Wu, linux-fsdevel, linux-ext4, linux-btrfs,
	linux-kernel, linuxram, viro, david, tytso, cmm, Zhi Yong Wu

On Thu, Oct 11, 2012 at 10:41 PM, David Sterba <dave@jikos.cz> wrote:
> On Thu, Oct 11, 2012 at 10:35:28PM +0800, Zhi Yong Wu wrote:
>> >> --- a/fs/btrfs/super.c
>> >> +++ b/fs/btrfs/super.c
>> >> @@ -303,7 +304,7 @@ enum {
>> >>       Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
>> >>       Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
>> >>       Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
>> >> -     Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
>> >> +     Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
>> >
>> > Please add the new option to the end.
>> To be honest, it can't be added to the end, if you check Opt_err's
>> pattern value, you will find it is NULL, it will cause match_one()
>> return 1. So if we add Opt_hot_track to the end of this array, it will
>> be covered by match_token(), so i prefer to add it to
>> Opt_fatal_errors. Do you think of it?
>
> Ah, sorry, I was not clear what the 'end' means here. The Opt_err is a
> end-of-the-list token, so yes please add hot_track between
> Opt_fatal_errors and Opt_err.
Done, thanks. It will be appreciated if you can make comments on other
patches of this series.:)

>
> david



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 13/13] vfs: add documentation
  2012-10-10 10:07 ` [RFC v3 13/13] vfs: add documentation zwu.kernel
@ 2012-10-15  0:35   ` Zheng Liu
  2012-10-15  7:04     ` Zhi Yong Wu
  0 siblings, 1 reply; 55+ messages in thread
From: Zheng Liu @ 2012-10-15  0:35 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

Hi Zhi Yong,

[cut...]
> +3. The Design
> +
> +These include the following parts:
> +
> +    * Hooks in existing vfs functions to track data access frequency
> +
> +    * New rbtrees for tracking access frequency of inodes and sub-file
             ^^^^^^^ s/rbtrees/radix-trees
> +ranges (hot_rb.c)
           ^^^^^^^^ Now it seems that all codes are in the same file.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (12 preceding siblings ...)
  2012-10-10 10:07 ` [RFC v3 13/13] vfs: add documentation zwu.kernel
@ 2012-10-15  0:39 ` Zheng Liu
  2012-10-15  7:05   ` Zhi Yong Wu
  2012-10-15 20:42 ` Dave Chinner
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Zheng Liu @ 2012-10-15  0:39 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, david, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
> NOTE:
> 
>   The patchset is currently post out mainly to make sure
> it is going in the correct direction and hope to get some
> helpful comments from other guys.
>   For more infomation, please check hot_tracking.txt in Documentation

Hi Zhi Yong,

If I want to use this patch set in ext4, could I apply this patch set
directly or I need to call some functions like in btrfs.  Thanks.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 13/13] vfs: add documentation
  2012-10-15  0:35   ` Zheng Liu
@ 2012-10-15  7:04     ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-15  7:04 UTC (permalink / raw)
  To: zwu.kernel, linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel,
	linuxram, viro, david, dave, tytso, cmm, Zhi Yong Wu

On Mon, Oct 15, 2012 at 8:35 AM, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> Hi Zhi Yong,
>
> [cut...]
>> +3. The Design
>> +
>> +These include the following parts:
>> +
>> +    * Hooks in existing vfs functions to track data access frequency
>> +
>> +    * New rbtrees for tracking access frequency of inodes and sub-file
>              ^^^^^^^ s/rbtrees/radix-trees
>> +ranges (hot_rb.c)
>            ^^^^^^^^ Now it seems that all codes are in the same file.
HI, Zheng,

Good catch, i will update them, thanks.
>
> Regards,
> Zheng



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-15  0:39 ` [RFC v3 00/13] vfs: hot data tracking Zheng Liu
@ 2012-10-15  7:05   ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-15  7:05 UTC (permalink / raw)
  To: zwu.kernel, linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel,
	linuxram, viro, david, dave, tytso, cmm, Zhi Yong Wu

On Mon, Oct 15, 2012 at 8:39 AM, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>> NOTE:
>>
>>   The patchset is currently post out mainly to make sure
>> it is going in the correct direction and hope to get some
>> helpful comments from other guys.
>>   For more infomation, please check hot_tracking.txt in Documentation
>
> Hi Zhi Yong,
>
> If I want to use this patch set in ext4, could I apply this patch set
> directly or I need to call some functions like in btrfs.  Thanks.
Hi ,Zheng,

It is the latter. If you have any questions, please feel free to catch me.

>
> Regards,
> Zheng



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
  2012-10-10 10:07 ` [RFC v3 11/13] vfs: add 3 new ioctl interfaces zwu.kernel
@ 2012-10-15  7:48   ` Dave Chinner
  2012-10-15  7:57     ` Zhi Yong Wu
  2012-10-16  3:17   ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2012-10-15  7:48 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
>   FS_IOC_GET_HEAT_INFO: return a struct containing the various
> metrics collected in btrfs_freq_data structs, and also return a
> calculated data temperature based on those metrics. Optionally, retrieve
> the temperature from the hot data hash list instead of recalculating it.
> 
>   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
> state of hot data tracking and migration:
> 
> 0 = do nothing
> 1 = track frequency of access
> 
>   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
> migration, as described above.
.....
> +struct hot_heat_info {
> +	__u64 avg_delta_reads;
> +	__u64 avg_delta_writes;
> +	__u64 last_read_time;
> +	__u64 last_write_time;
> +	__u32 num_reads;
> +	__u32 num_writes;
> +	__u32 temperature;
> +	__u8 live;
> +	char filename[PATH_MAX];

Don't put the filename in the ioctl and open the file in the kernel.
Have userspace open the file directly and issue the ioctl on the fd
that is returned.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 12/13] vfs: add debugfs support
  2012-10-10 10:07 ` [RFC v3 12/13] vfs: add debugfs support zwu.kernel
  2012-10-10 16:53   ` David Sterba
  2012-10-10 21:05   ` David Sterba
@ 2012-10-15  7:55   ` Dave Chinner
  2012-10-15  8:15     ` Zhi Yong Wu
  2012-10-15  8:04   ` Dave Chinner
  3 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2012-10-15  7:55 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
>   Add a /sys/kernel/debug/hot_track/<device_name>/ directory for each
> volume that contains two files. The first, `inode_data', contains the
> heat information for inodes that have been brought into the hot data map
> structures. The second, `range_data', contains similar information for
> subfile ranges.
....
> +	/* create debugfs range_data file */
> +	debugfs_range_entry = debugfs_create_file("range_data",
> +				S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO,
> +				debugfs_volume_entry,
> +				(void *) range_data,
> +				&hot_debugfs_range_fops);

These should not be world readable. 0600 is probably the correct
permissions for them as we do not want random users to be able to
infer what files users are accessing from this information.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
  2012-10-15  7:48   ` Dave Chinner
@ 2012-10-15  7:57     ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-15  7:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Mon, Oct 15, 2012 at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   FS_IOC_GET_HEAT_INFO: return a struct containing the various
>> metrics collected in btrfs_freq_data structs, and also return a
>> calculated data temperature based on those metrics. Optionally, retrieve
>> the temperature from the hot data hash list instead of recalculating it.
>>
>>   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
>> state of hot data tracking and migration:
>>
>> 0 = do nothing
>> 1 = track frequency of access
>>
>>   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
>> migration, as described above.
> .....
>> +struct hot_heat_info {
>> +     __u64 avg_delta_reads;
>> +     __u64 avg_delta_writes;
>> +     __u64 last_read_time;
>> +     __u64 last_write_time;
>> +     __u32 num_reads;
>> +     __u32 num_writes;
>> +     __u32 temperature;
>> +     __u8 live;
>> +     char filename[PATH_MAX];
>
> Don't put the filename in the ioctl and open the file in the kernel.
> Have userspace open the file directly and issue the ioctl on the fd
> that is returned.
OK, thanks. By the way, do you think that it is necessary to provide
another new ioctl interface to set the temperature value?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 12/13] vfs: add debugfs support
  2012-10-10 10:07 ` [RFC v3 12/13] vfs: add debugfs support zwu.kernel
                     ` (2 preceding siblings ...)
  2012-10-15  7:55   ` Dave Chinner
@ 2012-10-15  8:04   ` Dave Chinner
  2012-10-15  8:47     ` Zhi Yong Wu
  3 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2012-10-15  8:04 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
>   Add a /sys/kernel/debug/hot_track/<device_name>/ directory for each
> volume that contains two files. The first, `inode_data', contains the
> heat information for inodes that have been brought into the hot data map
> structures. The second, `range_data', contains similar information for
> subfile ranges.
> 
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
>  fs/hot_tracking.c |  462 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/hot_tracking.h |   43 +++++
>  2 files changed, 505 insertions(+), 0 deletions(-)
.....
> +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len)
> +{
> +	struct lstring *debugfs_log = data->debugfs_log;
> +	uint new_log_alloc_size;
> +	char *new_log;
> +	static char err_msg[] = "No more memory!\n";
> +
> +	if (len >= data->log_alloc_size - debugfs_log->len) {
......
> +	}
> +
> +	memcpy(debugfs_log->str + debugfs_log->len, data->log_work_buff, len);
> +	debugfs_log->len += (unsigned long) len;
> +
> +	return len;
> +}
> +
> +/* Returns the number of bytes written to the log. */
> +static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...)
> +{
> +	struct lstring *debugfs_log = data->debugfs_log;
> +	va_list args;
> +	int len;
> +	static char trunc_msg[] =
> +			"The next message has been truncated.\n";
> +
> +	if (debugfs_log->str == NULL)
> +		return -1;
> +
> +	spin_lock(&data->log_lock);
> +
> +	va_start(args, fmt);
> +	len = vsnprintf(data->log_work_buff,
> +			sizeof(data->log_work_buff), fmt, args);
> +	va_end(args);
> +
> +	if (len >= sizeof(data->log_work_buff)) {
> +		hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg));
> +	}
> +
> +	len = hot_debugfs_copy(data, data->log_work_buff, len);
> +	spin_unlock(&data->log_lock);
> +
> +	return len;
> +}

Aren't you just recreating seq_printf() here? i.e. can't you replace
all this complexity with generic seq_file/seq_operations constructs?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 12/13] vfs: add debugfs support
  2012-10-15  7:55   ` Dave Chinner
@ 2012-10-15  8:15     ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-15  8:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Mon, Oct 15, 2012 at 3:55 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   Add a /sys/kernel/debug/hot_track/<device_name>/ directory for each
>> volume that contains two files. The first, `inode_data', contains the
>> heat information for inodes that have been brought into the hot data map
>> structures. The second, `range_data', contains similar information for
>> subfile ranges.
> ....
>> +     /* create debugfs range_data file */
>> +     debugfs_range_entry = debugfs_create_file("range_data",
>> +                             S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO,
>> +                             debugfs_volume_entry,
>> +                             (void *) range_data,
>> +                             &hot_debugfs_range_fops);
>
> These should not be world readable. 0600 is probably the correct
> permissions for them as we do not want random users to be able to
> infer what files users are accessing from this information.
Good catch, its mode should be S_IFREG | S_IRUSR | S_IWUSR

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 12/13] vfs: add debugfs support
  2012-10-15  8:04   ` Dave Chinner
@ 2012-10-15  8:47     ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-15  8:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Mon, Oct 15, 2012 at 4:04 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   Add a /sys/kernel/debug/hot_track/<device_name>/ directory for each
>> volume that contains two files. The first, `inode_data', contains the
>> heat information for inodes that have been brought into the hot data map
>> structures. The second, `range_data', contains similar information for
>> subfile ranges.
>>
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  fs/hot_tracking.c |  462 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/hot_tracking.h |   43 +++++
>>  2 files changed, 505 insertions(+), 0 deletions(-)
> .....
>> +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len)
>> +{
>> +     struct lstring *debugfs_log = data->debugfs_log;
>> +     uint new_log_alloc_size;
>> +     char *new_log;
>> +     static char err_msg[] = "No more memory!\n";
>> +
>> +     if (len >= data->log_alloc_size - debugfs_log->len) {
> ......
>> +     }
>> +
>> +     memcpy(debugfs_log->str + debugfs_log->len, data->log_work_buff, len);
>> +     debugfs_log->len += (unsigned long) len;
>> +
>> +     return len;
>> +}
>> +
>> +/* Returns the number of bytes written to the log. */
>> +static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...)
>> +{
>> +     struct lstring *debugfs_log = data->debugfs_log;
>> +     va_list args;
>> +     int len;
>> +     static char trunc_msg[] =
>> +                     "The next message has been truncated.\n";
>> +
>> +     if (debugfs_log->str == NULL)
>> +             return -1;
>> +
>> +     spin_lock(&data->log_lock);
>> +
>> +     va_start(args, fmt);
>> +     len = vsnprintf(data->log_work_buff,
>> +                     sizeof(data->log_work_buff), fmt, args);
>> +     va_end(args);
>> +
>> +     if (len >= sizeof(data->log_work_buff)) {
>> +             hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg));
>> +     }
>> +
>> +     len = hot_debugfs_copy(data, data->log_work_buff, len);
>> +     spin_unlock(&data->log_lock);
>> +
>> +     return len;
>> +}
>
> Aren't you just recreating seq_printf() here? i.e. can't you replace
> all this complexity with generic seq_file/seq_operations constructs?
It seems to be a good suggestion, let me try it. thanks.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (13 preceding siblings ...)
  2012-10-15  0:39 ` [RFC v3 00/13] vfs: hot data tracking Zheng Liu
@ 2012-10-15 20:42 ` Dave Chinner
  2012-10-17  8:57   ` Zhi Yong Wu
  2012-10-19  8:29   ` Zhi Yong Wu
  2012-10-16  0:04 ` [PATCH] xfs: add hot tracking support Dave Chinner
  2012-10-16  0:11 ` [RFC v3 00/13] vfs: hot data tracking Dave Chinner
  16 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2012-10-15 20:42 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
> NOTE:
> 
>   The patchset is currently post out mainly to make sure
> it is going in the correct direction and hope to get some
> helpful comments from other guys.
>   For more infomation, please check hot_tracking.txt in Documentation
> 
> TODO List:

1) Fix OOM issues - the hot inode tracking caches grow very large
and don't get trimmed under memory pressure. From slabtop, after
creating roughly 24 million single byte files(*) on a machine with
8GB RAM:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
23859510 23859476  99%    0.12K 795317       30   3181268K hot_range_item
23859441 23859439  99%    0.16K 1037367       23   4149468K hot_inode_item
572530 572530 100%    0.55K  81790        7    327160K radix_tree_node
241706 241406  99%    0.22K  14218       17     56872K xfs_ili
241206 241204  99%    1.06K  80402        3    321608K xfs_inode

The inode tracking is trying to track all 24 million inodes even
though they have been written only once, and there are only 240,000
inodes in the cache at this point in time. That was the last update
that slabtop got, so it is indicative of the impending OOM situation
that occurred.

> Changelog from v2:
>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>  2.) Added memory shrinker [Dave Chinner]

I haven't looked at the shrinker, but clearly it is not working,
otherwise the above OOM situation would not be occurring.

Cheers,

Dave.

(*) Tested on an empty 17TB XFS filesystem with:

$ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
meta-data=/dev/vdc               isize=256    agcount=17, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=4563402735, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d \
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
/mnt/scratch/6  -d  /mnt/scratch/7
.....
     0     21600000            1      16679.3         12552262
     0     22400000            1      15412.4         12588587
     0     23200000            1      16367.6         14199322
     0     24000000            1      15680.4         15741205
<hangs here w/ OOM>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] xfs: add hot tracking support.
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (14 preceding siblings ...)
  2012-10-15 20:42 ` Dave Chinner
@ 2012-10-16  0:04 ` Dave Chinner
  2012-11-07  8:38   ` Zhi Yong Wu
  2012-10-16  0:11 ` [RFC v3 00/13] vfs: hot data tracking Dave Chinner
  16 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2012-10-16  0:04 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu


From: Dave Chinner <dchinner@redhat.com>

Connect up the VFS hot tracking support so XFS filesystems can make
use of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |    9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a631ca3..d5e7277 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -215,6 +215,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_WSYNC		(1ULL << 0)	/* for nfs - all metadata ops
 						   must be synchronous except
 						   for space allocations */
+#define XFS_MOUNT_HOTTRACK	(1ULL << 1)	/* hot inode tracking */
 #define XFS_MOUNT_WAS_CLEAN	(1ULL << 3)
 #define XFS_MOUNT_FS_SHUTDOWN	(1ULL << 4)	/* atomic stop of all filesystem
 						   operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 56c2537..17786ff 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -61,6 +61,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/parser.h>
+#include <linux/hot_tracking.h>
 
 static const struct super_operations xfs_super_operations;
 static kmem_zone_t *xfs_ioend_zone;
@@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_NODELAYLOG  "nodelaylog"	/* Delayed logging disabled */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
+#define MNTOPT_HOTTRACK	"hot_track"	/* hot inode tracking */
 
 /*
  * Table driven mount option parser.
@@ -371,6 +373,8 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
+		} else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
+			mp->m_flags |= XFS_MOUNT_HOTTRACK;
 		} else if (!strcmp(this_char, "ihashsize")) {
 			xfs_warn(mp,
 	"ihashsize no longer used, option is deprecated.");
@@ -1040,6 +1044,9 @@ xfs_fs_put_super(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	if (mp->m_flags & XFS_MOUNT_HOTTRACK)
+		hot_track_exit(sb);
+
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
 
@@ -1470,6 +1477,8 @@ xfs_fs_fill_super(
 		error = ENOMEM;
 		goto out_unmount;
 	}
+	if (mp->m_flags & XFS_MOUNT_HOTTRACK)
+		hot_track_init(sb);
 
 	return 0;
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
                   ` (15 preceding siblings ...)
  2012-10-16  0:04 ` [PATCH] xfs: add hot tracking support Dave Chinner
@ 2012-10-16  0:11 ` Dave Chinner
  16 siblings, 0 replies; 55+ messages in thread
From: Dave Chinner @ 2012-10-16  0:11 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
> NOTE:
> 
>   The patchset is currently post out mainly to make sure
> it is going in the correct direction and hope to get some
> helpful comments from other guys.
>   For more infomation, please check hot_tracking.txt in Documentation
> 

# mount -o hot_track /dev/vdc /mnt/scratch
# umount /mnt/scratch

hangs here on XFS:

#  echo w > /proc/sysrq-trigger 
[   44.229252] SysRq : Show Blocked State
[   44.230044]   task                        PC stack   pid father
[   44.231187] umount          D ffff88021fd52dc0  3632  4107   4106 0x00000000
[   44.231946]  ffff880212a91bd8 0000000000000086 ffff8802153da300 ffff880212a91fd8
[   44.231946]  ffff880212a91fd8 ffff880212a91fd8 ffff880216cf6340 ffff8802153da300
[   44.231946]  00000000ffffffff 7fffffffffffffff ffff880212a91d78 ffff880212a91d80
[   44.231946] Call Trace:
[   44.231946]  [<ffffffff81b6c1b9>] schedule+0x29/0x70
[   44.231946]  [<ffffffff81b6a079>] schedule_timeout+0x159/0x220
[   44.231946]  [<ffffffff8171a2f4>] ? do_raw_spin_lock+0x54/0x120
[   44.231946]  [<ffffffff8171a45d>] ? do_raw_spin_unlock+0x5d/0xb0
[   44.231946]  [<ffffffff81b6bfee>] wait_for_common+0xee/0x190
[   44.231946]  [<ffffffff810b6a10>] ? try_to_wake_up+0x2f0/0x2f0
[   44.231946]  [<ffffffff81b6c18d>] wait_for_completion+0x1d/0x20
[   44.231946]  [<ffffffff8109efec>] flush_workqueue+0x14c/0x3f0
[   44.231946]  [<ffffffff811aad89>] hot_track_exit+0x39/0x180
[   44.231946]  [<ffffffff81454e83>] xfs_fs_put_super+0x23/0x70
[   44.231946]  [<ffffffff8117a991>] generic_shutdown_super+0x61/0xf0
[   44.231946]  [<ffffffff8117aa50>] kill_block_super+0x30/0x80
[   44.231946]  [<ffffffff8117ae45>] deactivate_locked_super+0x45/0x70
[   44.231946]  [<ffffffff8117ba0e>] deactivate_super+0x4e/0x70
[   44.231946]  [<ffffffff81197541>] mntput_no_expire+0x101/0x160
[   44.231946]  [<ffffffff811985b6>] sys_umount+0x76/0x3a0
[   44.231946]  [<ffffffff81b755a9>] system_call_fastpath+0x16/0x1b

because this is stuck:

[  200.064574] kworker/u:2     S ffff88021fc12dc0  5208   669      2 0x00000000
[  200.064574]  ffff88021532fc60 0000000000000046 ffff88021532c7c0 ffff88021532ffd8
[  200.064574]  ffff88021532ffd8 ffff88021532ffd8 ffffffff81fc3420 ffff88021532c7c0
[  200.064574]  ffff88021532fc50 ffff88021532fc98 ffffffff8221e700 ffffffff8221e700
[  200.064574] Call Trace:
[  200.064574]  [<ffffffff81b6c1b9>] schedule+0x29/0x70
[  200.064574]  [<ffffffff81b6a03b>] schedule_timeout+0x11b/0x220
[  200.064574]  [<ffffffff810908d0>] ? usleep_range+0x50/0x50
[  200.064574]  [<ffffffff811aa78a>] hot_temperature_update_work+0x16a/0x1c0
[  200.064574]  [<ffffffff8171a2f4>] ? do_raw_spin_lock+0x54/0x120
[  200.064574]  [<ffffffff81b6d3ce>] ? _raw_spin_unlock_irq+0xe/0x30
[  200.064574]  [<ffffffff810b0b3c>] ? finish_task_switch+0x5c/0x100
[  200.064574]  [<ffffffff8109d959>] process_one_work+0x139/0x500
[  200.064574]  [<ffffffff811aa620>] ? hot_range_update+0x1f0/0x1f0
[  200.064574]  [<ffffffff8109e63e>] worker_thread+0x15e/0x460
[  200.064574]  [<ffffffff8109e4e0>] ? manage_workers+0x2f0/0x2f0
[  200.064574]  [<ffffffff810a4b33>] kthread+0x93/0xa0
[  200.064574]  [<ffffffff81b765c4>] kernel_thread_helper+0x4/0x10
[  200.064574]  [<ffffffff810a4aa0>] ? __init_kthread_worker+0x40/0x40
[  200.064574]  [<ffffffff81b765c0>] ? gs_change+0x13/0x13

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
  2012-10-10 10:07 ` [RFC v3 09/13] vfs: add one wq to update map info periodically zwu.kernel
@ 2012-10-16  0:27   ` Dave Chinner
  2012-10-17  6:34     ` Zhi Yong Wu
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2012-10-16  0:27 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:31PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
>   Add a per-superblock workqueue and a work_struct
>  to run periodic work to update map info on each superblock.
> 
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
>  fs/hot_tracking.c            |   94 ++++++++++++++++++++++++++++++++++++++++++
>  fs/hot_tracking.h            |    3 +
>  include/linux/hot_tracking.h |    2 +
>  3 files changed, 99 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
> index a8dc599..f333c47 100644
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> @@ -15,6 +15,8 @@
>  #include <linux/module.h>
>  #include <linux/spinlock.h>
>  #include <linux/hardirq.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
>  #include <linux/fs.h>
>  #include <linux/blkdev.h>
>  #include <linux/types.h>
> @@ -623,6 +625,88 @@ static void hot_map_array_exit(struct hot_info *root)
>  }
>  
>  /*
> + * Update temperatures for each hot inode item and
> + * hot range item for aging purposes
> + */
> +static void hot_temperature_update_work(struct work_struct *work)
> +{
> +	struct hot_update_work *hot_work =
> +			container_of(work, struct hot_update_work, work);
> +	struct hot_info *root = hot_work->hot_info;
> +	struct hot_inode_item *hi_nodes[8];
> +	unsigned long delay = HZ * HEAT_UPDATE_DELAY;
> +	u64 ino = 0;
> +	int i, n;
> +
> +	do {
> +		while (1) {
> +			spin_lock(&root->lock);
> +			n = radix_tree_gang_lookup(&root->hot_inode_tree,
> +					   (void **)hi_nodes, ino,
> +					   ARRAY_SIZE(hi_nodes));
> +			if (!n) {
> +				spin_unlock(&root->lock);
> +				break;
> +			}
> +
> +			ino = hi_nodes[n - 1]->i_ino + 1;
> +			for (i = 0; i < n; i++) {
> +				kref_get(&hi_nodes[i]->hot_inode.refs);
> +				hot_map_array_update(
> +					&hi_nodes[i]->hot_inode.hot_freq_data, root);
> +				hot_range_update(hi_nodes[i], root);
> +				hot_inode_item_put(hi_nodes[i]);
> +			}
> +			spin_unlock(&root->lock);

This is a lot of work to do under a spin lock. Perhaps you should
get a reference on all the nodes, then drop the root->lock and then
update all the nodes in a separate loop.

> +		}
> +
> +		if (unlikely(freezing(current))) {
> +			__refrigerator(true);
> +		} else {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			if (!kthread_should_stop()) {
> +				schedule_timeout(delay);
> +			}
> +			__set_current_state(TASK_RUNNING);
> +		}
> +	} while (!kthread_should_stop());

I don't think you understand workqueues fully. A work queue worker
function is not something that executes endlessly. It is a
"one-shot" function that does the work once, not an endless loop
that has to delay it's execution for periodic work.

If you need periodic work, then you should use a struct delayed_work
and queue the next work iteration to be run a later time. See, for
example, xfs_syncd_worker() and xfs_syncd_queue_sync() and how that
reschedules itself for periodic work. It also means you don't have
to handle kthread freezing, as the WQ infrastructure takes care of
that for you.

This is why unmount is hanging for me - this work never completes,
so flush_workqueue() will never return.

> +}
> +
> +static int hot_wq_init(struct hot_info *root)
> +{
> +	struct hot_update_work *hot_work;
> +	int ret = 0;
> +
> +	root->update_wq = alloc_workqueue(
> +		"hot_temperature_update", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> +	if (!root->update_wq) {
> +		printk(KERN_ERR "%s: failed to create "
> +			"temperature update workqueue\n",
> +			__func__);
> +		return 1;
> +	}
> +
> +	hot_work = kmalloc(sizeof(*hot_work), GFP_NOFS);
> +	if (hot_work) {
> +		hot_work->hot_info = root;
> +		INIT_WORK(&hot_work->work, hot_temperature_update_work);
> +		queue_work(root->update_wq, &hot_work->work);
> +	} else {
> +		printk(KERN_ERR "%s: failed to create update work\n",
> +				__func__);
> +		ret = 1;
> +	}

I don't understand why you need a separate "hot_work" structure.
just embed a struct delayed_work in the struct hot_info and use
container_of() to get the struct hot_info from the work structure.
As such, there's no need for a separate function just for this
initialisation - just put it in line.

> +
> +	return ret;
> +}
> +
> +static void hot_wq_exit(struct workqueue_struct *wq)
> +{
> +	flush_workqueue(wq);

flush_workqueue_sync().

> +	destroy_workqueue(wq);
> +}

And there's not need for separate function for this - put it in
line.

FWIW, it also leaks the hot_work structure, but you're going to
remove that anyway. ;)

> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
> index d19e64a..7a79a6d 100644
> --- a/fs/hot_tracking.h
> +++ b/fs/hot_tracking.h
> @@ -36,6 +36,9 @@
>   */
>  #define TIME_TO_KICK 400
>  
> +/* set how often to update temperatures (seconds) */
> +#define HEAT_UPDATE_DELAY 400

FWIW, 400 seconds is an unusual time period. It's expected that
periodic work might take place at intervals of 5 minutes, 10
minutes, etc, not 6m40s. It's much easier to predict and understand
behaviour if it's at a interval of whole units like minutes,
especially when looking at timestamped event traces. Hence 300s (5
minutes) makes a lot more sense as a period for updates...

>  /*
>   * The following comments explain what exactly comprises a unit of heat.
>   *
> diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
> index 7114179..b37e0f8 100644
> --- a/include/linux/hot_tracking.h
> +++ b/include/linux/hot_tracking.h
> @@ -84,6 +84,8 @@ struct hot_info {
>  
>  	/* map of range temperature */
>  	struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
> +
> +	struct workqueue_struct *update_wq;

Add the struct delayed_work here, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
  2012-10-10 10:07 ` [RFC v3 11/13] vfs: add 3 new ioctl interfaces zwu.kernel
  2012-10-15  7:48   ` Dave Chinner
@ 2012-10-16  3:17   ` Dave Chinner
  2012-10-16  4:18     ` Zhi Yong Wu
  2012-10-19  8:21     ` Zhi Yong Wu
  1 sibling, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2012-10-16  3:17 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
>   FS_IOC_GET_HEAT_INFO: return a struct containing the various
> metrics collected in btrfs_freq_data structs, and also return a

I think you mean hot_freq_data :P

> calculated data temperature based on those metrics. Optionally, retrieve
> the temperature from the hot data hash list instead of recalculating it.

To get the heat info for a specific file you have to know what file
you want to get that info for, right?  I can see the usefulness of
asking for the heat data on a specific file, but how do you find the
hot files in the first place? i.e. the big question the user
interface needs to answer is "what files are hot?".

Once userspace knows what the hottest files are, it can open them
and query the data via the above ioctl, but expecting userspace to
iterate millions of inodes in a filesystem to find hot files is very
inefficient.

FWIW, if you were to return file handles to the hottest files, then
the application could open and query them without even needing to
know the path name to them. This woul dbe exceedingly useful for
defragmentation programs, especially as that is the way xfs_fsr
already operates on candidate files.(*)

IOWs, sometimes the pathname is irrelevant to the operations that
applications want to perform - all they care about having an
efficient method of finding the inode they want and getting a file
descriptor that points to the file. Given the heat map info fits
right in to the sort of operations defrag and data mover tools
already do, it kind of makes sense to optimise the interface towards
those uses....

(*) i.e. finds them via bulkstat which returns handle information
along with all the other inode data, then opens the file by handle
to do the defrag work....

>   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
> state of hot data tracking and migration:
> 
> 0 = do nothing
> 1 = track frequency of access
> 
>   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
> migration, as described above.

I can't see how this is a manageable interface. It is not
persistent, so after every filesystem mount you'd have to set the
flag on all your inodes again. Hence, for the moment, I'd suggest
that dropping per-inode tracking control until all the core issues
are sorted out....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
  2012-10-16  3:17   ` Dave Chinner
@ 2012-10-16  4:18     ` Zhi Yong Wu
  2012-10-19  8:21     ` Zhi Yong Wu
  1 sibling, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-16  4:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   FS_IOC_GET_HEAT_INFO: return a struct containing the various
>> metrics collected in btrfs_freq_data structs, and also return a
>
> I think you mean hot_freq_data :P
Yeah, sorry.
>
>> calculated data temperature based on those metrics. Optionally, retrieve
>> the temperature from the hot data hash list instead of recalculating it.
>
> To get the heat info for a specific file you have to know what file
> you want to get that info for, right?  I can see the usefulness of
Yes.
> asking for the heat data on a specific file, but how do you find the
> hot files in the first place? i.e. the big question the user
> interface needs to answer is "what files are hot?".
We only tell the user what the files' temperatures are, not what files are hot.
Their temperatures are in the output of debugfs.
>
> Once userspace knows what the hottest files are, it can open them
If the user need to know this type of info, it is easy for us to
provide it. But i don't know what way the user hope to get it via.
> and query the data via the above ioctl, but expecting userspace to
> iterate millions of inodes in a filesystem to find hot files is very
> inefficient.
>
> FWIW, if you were to return file handles to the hottest files, then
> the application could open and query them without even needing to
> know the path name to them. This woul dbe exceedingly useful for
> defragmentation programs, especially as that is the way xfs_fsr
> already operates on candidate files.(*)
ah.
>
> IOWs, sometimes the pathname is irrelevant to the operations that
> applications want to perform - all they care about having an
> efficient method of finding the inode they want and getting a file
> descriptor that points to the file. Given the heat map info fits
> right in to the sort of operations defrag and data mover tools
> already do, it kind of makes sense to optimise the interface towards
> those uses....
>
> (*) i.e. finds them via bulkstat which returns handle information
> along with all the other inode data, then opens the file by handle
> to do the defrag work....
OK.
>
>>   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
>> state of hot data tracking and migration:
>>
>> 0 = do nothing
>> 1 = track frequency of access
>>
>>   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
>> migration, as described above.
>
> I can't see how this is a manageable interface. It is not
> persistent, so after every filesystem mount you'd have to set the
> flag on all your inodes again. Hence, for the moment, I'd suggest
> that dropping per-inode tracking control until all the core issues
> are sorted out....
OK.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
  2012-10-16  0:27   ` Dave Chinner
@ 2012-10-17  6:34     ` Zhi Yong Wu
  2012-10-18  2:25       ` Zheng Liu
  0 siblings, 1 reply; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-17  6:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Tue, Oct 16, 2012 at 8:27 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:31PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   Add a per-superblock workqueue and a work_struct
>>  to run periodic work to update map info on each superblock.
>>
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  fs/hot_tracking.c            |   94 ++++++++++++++++++++++++++++++++++++++++++
>>  fs/hot_tracking.h            |    3 +
>>  include/linux/hot_tracking.h |    2 +
>>  3 files changed, 99 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
>> index a8dc599..f333c47 100644
>> --- a/fs/hot_tracking.c
>> +++ b/fs/hot_tracking.c
>> @@ -15,6 +15,8 @@
>>  #include <linux/module.h>
>>  #include <linux/spinlock.h>
>>  #include <linux/hardirq.h>
>> +#include <linux/kthread.h>
>> +#include <linux/freezer.h>
>>  #include <linux/fs.h>
>>  #include <linux/blkdev.h>
>>  #include <linux/types.h>
>> @@ -623,6 +625,88 @@ static void hot_map_array_exit(struct hot_info *root)
>>  }
>>
>>  /*
>> + * Update temperatures for each hot inode item and
>> + * hot range item for aging purposes
>> + */
>> +static void hot_temperature_update_work(struct work_struct *work)
>> +{
>> +     struct hot_update_work *hot_work =
>> +                     container_of(work, struct hot_update_work, work);
>> +     struct hot_info *root = hot_work->hot_info;
>> +     struct hot_inode_item *hi_nodes[8];
>> +     unsigned long delay = HZ * HEAT_UPDATE_DELAY;
>> +     u64 ino = 0;
>> +     int i, n;
>> +
>> +     do {
>> +             while (1) {
>> +                     spin_lock(&root->lock);
>> +                     n = radix_tree_gang_lookup(&root->hot_inode_tree,
>> +                                        (void **)hi_nodes, ino,
>> +                                        ARRAY_SIZE(hi_nodes));
>> +                     if (!n) {
>> +                             spin_unlock(&root->lock);
>> +                             break;
>> +                     }
>> +
>> +                     ino = hi_nodes[n - 1]->i_ino + 1;
>> +                     for (i = 0; i < n; i++) {
>> +                             kref_get(&hi_nodes[i]->hot_inode.refs);
>> +                             hot_map_array_update(
>> +                                     &hi_nodes[i]->hot_inode.hot_freq_data, root);
>> +                             hot_range_update(hi_nodes[i], root);
>> +                             hot_inode_item_put(hi_nodes[i]);
>> +                     }
>> +                     spin_unlock(&root->lock);
>
> This is a lot of work to do under a spin lock. Perhaps you should
> get a reference on all the nodes, then drop the root->lock and then
> update all the nodes in a separate loop.
OK, done
>
>> +             }
>> +
>> +             if (unlikely(freezing(current))) {
>> +                     __refrigerator(true);
>> +             } else {
>> +                     set_current_state(TASK_INTERRUPTIBLE);
>> +                     if (!kthread_should_stop()) {
>> +                             schedule_timeout(delay);
>> +                     }
>> +                     __set_current_state(TASK_RUNNING);
>> +             }
>> +     } while (!kthread_should_stop());
>
> I don't think you understand workqueues fully. A work queue worker
> function is not something that executes endlessly. It is a
> "one-shot" function that does the work once, not an endless loop
> that has to delay it's execution for periodic work.
ah, i have done this based on your following suggestions, thanks.
>
> If you need periodic work, then you should use a struct delayed_work
> and queue the next work iteration to be run a later time. See, for
> example, xfs_syncd_worker() and xfs_syncd_queue_sync() and how that
> reschedules itself for periodic work. It also means you don't have
> to handle kthread freezing, as the WQ infrastructure takes care of
> that for you.
ditto.
>
> This is why unmount is hanging for me - this work never completes,
> so flush_workqueue() will never return.
got it, thanks.
>
>> +}
>> +
>> +static int hot_wq_init(struct hot_info *root)
>> +{
>> +     struct hot_update_work *hot_work;
>> +     int ret = 0;
>> +
>> +     root->update_wq = alloc_workqueue(
>> +             "hot_temperature_update", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
>> +     if (!root->update_wq) {
>> +             printk(KERN_ERR "%s: failed to create "
>> +                     "temperature update workqueue\n",
>> +                     __func__);
>> +             return 1;
>> +     }
>> +
>> +     hot_work = kmalloc(sizeof(*hot_work), GFP_NOFS);
>> +     if (hot_work) {
>> +             hot_work->hot_info = root;
>> +             INIT_WORK(&hot_work->work, hot_temperature_update_work);
>> +             queue_work(root->update_wq, &hot_work->work);
>> +     } else {
>> +             printk(KERN_ERR "%s: failed to create update work\n",
>> +                             __func__);
>> +             ret = 1;
>> +     }
>
> I don't understand why you need a separate "hot_work" structure.
> just embed a struct delayed_work in the struct hot_info and use
> container_of() to get the struct hot_info from the work structure.
> As such, there's no need for a separate function just for this
> initialisation - just put it in line.
OK, done.
>
>> +
>> +     return ret;
>> +}
>> +
>> +static void hot_wq_exit(struct workqueue_struct *wq)
>> +{
>> +     flush_workqueue(wq);
>
> flush_workqueue_sync().
done, thanks
>
>> +     destroy_workqueue(wq);
>> +}
>
> And there's not need for separate function for this - put it in
> line.
ditto.
>
> FWIW, it also leaks the hot_work structure, but you're going to
> remove that anyway. ;)
>
>> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
>> index d19e64a..7a79a6d 100644
>> --- a/fs/hot_tracking.h
>> +++ b/fs/hot_tracking.h
>> @@ -36,6 +36,9 @@
>>   */
>>  #define TIME_TO_KICK 400
>>
>> +/* set how often to update temperatures (seconds) */
>> +#define HEAT_UPDATE_DELAY 400
>
> FWIW, 400 seconds is an unusual time period. It's expected that
> periodic work might take place at intervals of 5 minutes, 10
> minutes, etc, not 6m40s. It's much easier to predict and understand
> behaviour if it's at a interval of whole units like minutes,
> especially when looking at timestamped event traces. Hence 300s (5
> minutes) makes a lot more sense as a period for updates...
got it. thanks.
>
>>  /*
>>   * The following comments explain what exactly comprises a unit of heat.
>>   *
>> diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
>> index 7114179..b37e0f8 100644
>> --- a/include/linux/hot_tracking.h
>> +++ b/include/linux/hot_tracking.h
>> @@ -84,6 +84,8 @@ struct hot_info {
>>
>>       /* map of range temperature */
>>       struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
>> +
>> +     struct workqueue_struct *update_wq;
>
> Add the struct delayed_work here, too.
ditto
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-15 20:42 ` Dave Chinner
@ 2012-10-17  8:57   ` Zhi Yong Wu
  2012-10-18  4:29     ` Dave Chinner
  2012-10-19  8:29   ` Zhi Yong Wu
  1 sibling, 1 reply; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-17  8:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>> NOTE:
>>
>>   The patchset is currently post out mainly to make sure
>> it is going in the correct direction and hope to get some
>> helpful comments from other guys.
>>   For more infomation, please check hot_tracking.txt in Documentation
>>
>> TODO List:
>
> 1) Fix OOM issues - the hot inode tracking caches grow very large
> and don't get trimmed under memory pressure. From slabtop, after
> creating roughly 24 million single byte files(*) on a machine with
> 8GB RAM:
>
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 23859510 23859476  99%    0.12K 795317       30   3181268K hot_range_item
> 23859441 23859439  99%    0.16K 1037367       23   4149468K hot_inode_item
> 572530 572530 100%    0.55K  81790        7    327160K radix_tree_node
> 241706 241406  99%    0.22K  14218       17     56872K xfs_ili
> 241206 241204  99%    1.06K  80402        3    321608K xfs_inode
>
> The inode tracking is trying to track all 24 million inodes even
> though they have been written only once, and there are only 240,000
> inodes in the cache at this point in time. That was the last update
> that slabtop got, so it is indicative of the impending OOM situation
> that occurred.
>
>> Changelog from v2:
>>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>>  2.) Added memory shrinker [Dave Chinner]
>
> I haven't looked at the shrinker, but clearly it is not working,
> otherwise the above OOM situation would not be occurring.
>
> Cheers,
>
> Dave.
>
> (*) Tested on an empty 17TB XFS filesystem with:
>
> $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
> meta-data=/dev/vdc               isize=256    agcount=17, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=4563402735, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
> $ sudo chmod 777 /mnt/scratch
> $ fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d \
> /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
> /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
> /mnt/scratch/6  -d  /mnt/scratch/7
> .....
>      0     21600000            1      16679.3         12552262
>      0     22400000            1      15412.4         12588587
>      0     23200000            1      16367.6         14199322
>      0     24000000            1      15680.4         15741205
> <hangs here w/ OOM>
^^^^In this test, i haven't see you enable hot_track function via
mount, why did it meet OOM?
>
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
  2012-10-17  6:34     ` Zhi Yong Wu
@ 2012-10-18  2:25       ` Zheng Liu
  2012-10-18  2:26         ` Zhi Yong Wu
  0 siblings, 1 reply; 55+ messages in thread
From: Zheng Liu @ 2012-10-18  2:25 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Dave Chinner, linux-fsdevel, linux-ext4, linux-btrfs,
	linux-kernel, linuxram, viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 17, 2012 at 02:34:15PM +0800, Zhi Yong Wu wrote:
> >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
> >> index d19e64a..7a79a6d 100644
> >> --- a/fs/hot_tracking.h
> >> +++ b/fs/hot_tracking.h
> >> @@ -36,6 +36,9 @@
> >>   */
> >>  #define TIME_TO_KICK 400
> >>
> >> +/* set how often to update temperatures (seconds) */
> >> +#define HEAT_UPDATE_DELAY 400
> >
> > FWIW, 400 seconds is an unusual time period. It's expected that
> > periodic work might take place at intervals of 5 minutes, 10
> > minutes, etc, not 6m40s. It's much easier to predict and understand
> > behaviour if it's at a interval of whole units like minutes,
> > especially when looking at timestamped event traces. Hence 300s (5
> > minutes) makes a lot more sense as a period for updates...
> got it. thanks.

Hi Zhi Yong,

IMHO we'd better to make this value parameterized, and then the user
can adjust this value dynamically.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
  2012-10-18  2:25       ` Zheng Liu
@ 2012-10-18  2:26         ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-18  2:26 UTC (permalink / raw)
  To: Zhi Yong Wu, Dave Chinner, linux-fsdevel, linux-ext4,
	linux-btrfs, linux-kernel, linuxram, viro, dave, tytso, cmm,
	Zhi Yong Wu

On Thu, Oct 18, 2012 at 10:25 AM, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> On Wed, Oct 17, 2012 at 02:34:15PM +0800, Zhi Yong Wu wrote:
>> >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
>> >> index d19e64a..7a79a6d 100644
>> >> --- a/fs/hot_tracking.h
>> >> +++ b/fs/hot_tracking.h
>> >> @@ -36,6 +36,9 @@
>> >>   */
>> >>  #define TIME_TO_KICK 400
>> >>
>> >> +/* set how often to update temperatures (seconds) */
>> >> +#define HEAT_UPDATE_DELAY 400
>> >
>> > FWIW, 400 seconds is an unusual time period. It's expected that
>> > periodic work might take place at intervals of 5 minutes, 10
>> > minutes, etc, not 6m40s. It's much easier to predict and understand
>> > behaviour if it's at a interval of whole units like minutes,
>> > especially when looking at timestamped event traces. Hence 300s (5
>> > minutes) makes a lot more sense as a period for updates...
>> got it. thanks.
>
> Hi Zhi Yong,
>
> IMHO we'd better to make this value parameterized, and then the user
> can adjust this value dynamically.
Yeah, this has been in my TODO list. But i want to make the core
function can work at first.

>
> Regards,
> Zheng



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-17  8:57   ` Zhi Yong Wu
@ 2012-10-18  4:29     ` Dave Chinner
  2012-10-18  4:44       ` Zhi Yong Wu
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2012-10-18  4:29 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote:
> On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
....
> > (*) Tested on an empty 17TB XFS filesystem with:
> >
> > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
> > meta-data=/dev/vdc               isize=256    agcount=17, agsize=268435455 blks
> >          =                       sectsz=512   attr=2, projid32bit=0
> > data     =                       bsize=4096   blocks=4563402735, imaxpct=5
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0
> > log      =internal log           bsize=4096   blocks=131072, version=2
> >          =                       sectsz=512   sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
> > $ sudo chmod 777 /mnt/scratch
> > $ fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d \
> > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
> > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
> > /mnt/scratch/6  -d  /mnt/scratch/7
> > .....
> >      0     21600000            1      16679.3         12552262
> >      0     22400000            1      15412.4         12588587
> >      0     23200000            1      16367.6         14199322
> >      0     24000000            1      15680.4         15741205
> > <hangs here w/ OOM>
> ^^^^In this test, i haven't see you enable hot_track function via
> mount, why did it meet OOM?

I copied the wrong mount command. It was definitely enabled.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-18  4:29     ` Dave Chinner
@ 2012-10-18  4:44       ` Zhi Yong Wu
  2012-10-18  5:17         ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-18  4:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote:
>> On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
>> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ....
>> > (*) Tested on an empty 17TB XFS filesystem with:
>> >
>> > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
>> > meta-data=/dev/vdc               isize=256    agcount=17, agsize=268435455 blks
>> >          =                       sectsz=512   attr=2, projid32bit=0
>> > data     =                       bsize=4096   blocks=4563402735, imaxpct=5
>> >          =                       sunit=0      swidth=0 blks
>> > naming   =version 2              bsize=4096   ascii-ci=0
>> > log      =internal log           bsize=4096   blocks=131072, version=2
>> >          =                       sectsz=512   sunit=1 blks, lazy-count=1
>> > realtime =none                   extsz=4096   blocks=0, rtextents=0
>> > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
>> > $ sudo chmod 777 /mnt/scratch
>> > $ fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d \
>> > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
>> > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
>> > /mnt/scratch/6  -d  /mnt/scratch/7
>> > .....
>> >      0     21600000            1      16679.3         12552262
>> >      0     22400000            1      15412.4         12588587
>> >      0     23200000            1      16367.6         14199322
>> >      0     24000000            1      15680.4         15741205
>> > <hangs here w/ OOM>
>> ^^^^In this test, i haven't see you enable hot_track function via
>> mount, why did it meet OOM?
>
> I copied the wrong mount command. It was definitely enabled.
OK, BTW: fs_mark is the script written by you? After xfsprogs is
installed, i haven't found this command.

>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-18  4:44       ` Zhi Yong Wu
@ 2012-10-18  5:17         ` Dave Chinner
  2012-10-18  5:24           ` Zhi Yong Wu
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2012-10-18  5:17 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Thu, Oct 18, 2012 at 12:44:47PM +0800, Zhi Yong Wu wrote:
> On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote:
> >> On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
> >> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> > ....
> >> > (*) Tested on an empty 17TB XFS filesystem with:
> >> >
> >> > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
> >> > meta-data=/dev/vdc               isize=256    agcount=17, agsize=268435455 blks
> >> >          =                       sectsz=512   attr=2, projid32bit=0
> >> > data     =                       bsize=4096   blocks=4563402735, imaxpct=5
> >> >          =                       sunit=0      swidth=0 blks
> >> > naming   =version 2              bsize=4096   ascii-ci=0
> >> > log      =internal log           bsize=4096   blocks=131072, version=2
> >> >          =                       sectsz=512   sunit=1 blks, lazy-count=1
> >> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >> > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
> >> > $ sudo chmod 777 /mnt/scratch
> >> > $ fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d \
> >> > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
> >> > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
> >> > /mnt/scratch/6  -d  /mnt/scratch/7
> >> > .....
> >> >      0     21600000            1      16679.3         12552262
> >> >      0     22400000            1      15412.4         12588587
> >> >      0     23200000            1      16367.6         14199322
> >> >      0     24000000            1      15680.4         15741205
> >> > <hangs here w/ OOM>
> >> ^^^^In this test, i haven't see you enable hot_track function via
> >> mount, why did it meet OOM?
> >
> > I copied the wrong mount command. It was definitely enabled.
> OK, BTW: fs_mark is the script written by you? After xfsprogs is
> installed, i haven't found this command.

# apt-get install fsmark

Or get the source here:

http://sourceforge.net/projects/fsmark/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-18  5:17         ` Dave Chinner
@ 2012-10-18  5:24           ` Zhi Yong Wu
  0 siblings, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-18  5:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Thu, Oct 18, 2012 at 1:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Oct 18, 2012 at 12:44:47PM +0800, Zhi Yong Wu wrote:
>> On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote:
>> >> On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
>> >> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> > ....
>> >> > (*) Tested on an empty 17TB XFS filesystem with:
>> >> >
>> >> > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
>> >> > meta-data=/dev/vdc               isize=256    agcount=17, agsize=268435455 blks
>> >> >          =                       sectsz=512   attr=2, projid32bit=0
>> >> > data     =                       bsize=4096   blocks=4563402735, imaxpct=5
>> >> >          =                       sunit=0      swidth=0 blks
>> >> > naming   =version 2              bsize=4096   ascii-ci=0
>> >> > log      =internal log           bsize=4096   blocks=131072, version=2
>> >> >          =                       sectsz=512   sunit=1 blks, lazy-count=1
>> >> > realtime =none                   extsz=4096   blocks=0, rtextents=0
>> >> > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
>> >> > $ sudo chmod 777 /mnt/scratch
>> >> > $ fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d \
>> >> > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
>> >> > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
>> >> > /mnt/scratch/6  -d  /mnt/scratch/7
>> >> > .....
>> >> >      0     21600000            1      16679.3         12552262
>> >> >      0     22400000            1      15412.4         12588587
>> >> >      0     23200000            1      16367.6         14199322
>> >> >      0     24000000            1      15680.4         15741205
>> >> > <hangs here w/ OOM>
>> >> ^^^^In this test, i haven't see you enable hot_track function via
>> >> mount, why did it meet OOM?
>> >
>> > I copied the wrong mount command. It was definitely enabled.
>> OK, BTW: fs_mark is the script written by you? After xfsprogs is
>> installed, i haven't found this command.
>
> # apt-get install fsmark
thanks. let me try.
>
> Or get the source here:
>
> http://sourceforge.net/projects/fsmark/
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
  2012-10-16  3:17   ` Dave Chinner
  2012-10-16  4:18     ` Zhi Yong Wu
@ 2012-10-19  8:21     ` Zhi Yong Wu
  1 sibling, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-19  8:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   FS_IOC_GET_HEAT_INFO: return a struct containing the various
>> metrics collected in btrfs_freq_data structs, and also return a
>
> I think you mean hot_freq_data :P
>
>> calculated data temperature based on those metrics. Optionally, retrieve
>> the temperature from the hot data hash list instead of recalculating it.
>
> To get the heat info for a specific file you have to know what file
> you want to get that info for, right?  I can see the usefulness of
> asking for the heat data on a specific file, but how do you find the
> hot files in the first place? i.e. the big question the user
> interface needs to answer is "what files are hot?".
>
> Once userspace knows what the hottest files are, it can open them
> and query the data via the above ioctl, but expecting userspace to
> iterate millions of inodes in a filesystem to find hot files is very
> inefficient.
>
> FWIW, if you were to return file handles to the hottest files, then
Good idea. I am not very clear about how to implement it. file handles
mean file_handle??  How to return them to the application? via
debugfs? How many hottest files should be returned?? Top 100?

> the application could open and query them without even needing to
> know the path name to them. This woul dbe exceedingly useful for
> defragmentation programs, especially as that is the way xfs_fsr
> already operates on candidate files.(*)
>
> IOWs, sometimes the pathname is irrelevant to the operations that
> applications want to perform - all they care about having an
> efficient method of finding the inode they want and getting a file
> descriptor that points to the file. Given the heat map info fits
> right in to the sort of operations defrag and data mover tools
> already do, it kind of makes sense to optimise the interface towards
> those uses....
>
> (*) i.e. finds them via bulkstat which returns handle information
> along with all the other inode data, then opens the file by handle
> to do the defrag work....
>
>>   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
>> state of hot data tracking and migration:
>>
>> 0 = do nothing
>> 1 = track frequency of access
>>
>>   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
>> migration, as described above.
>
> I can't see how this is a manageable interface. It is not
> persistent, so after every filesystem mount you'd have to set the
> flag on all your inodes again. Hence, for the moment, I'd suggest
> that dropping per-inode tracking control until all the core issues
> are sorted out....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v3 00/13] vfs: hot data tracking
  2012-10-15 20:42 ` Dave Chinner
  2012-10-17  8:57   ` Zhi Yong Wu
@ 2012-10-19  8:29   ` Zhi Yong Wu
  1 sibling, 0 replies; 55+ messages in thread
From: Zhi Yong Wu @ 2012-10-19  8:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>> NOTE:
>>
>>   The patchset is currently post out mainly to make sure
>> it is going in the correct direction and hope to get some
>> helpful comments from other guys.
>>   For more infomation, please check hot_tracking.txt in Documentation
>>
>> TODO List:
>
> 1) Fix OOM issues - the hot inode tracking caches grow very large
> and don't get trimmed under memory pressure. From slabtop, after
> creating roughly 24 million single byte files(*) on a machine with
> 8GB RAM:
>
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 23859510 23859476  99%    0.12K 795317       30   3181268K hot_range_item
> 23859441 23859439  99%    0.16K 1037367       23   4149468K hot_inode_item
> 572530 572530 100%    0.55K  81790        7    327160K radix_tree_node
> 241706 241406  99%    0.22K  14218       17     56872K xfs_ili
> 241206 241204  99%    1.06K  80402        3    321608K xfs_inode
>
> The inode tracking is trying to track all 24 million inodes even
> though they have been written only once, and there are only 240,000
> inodes in the cache at this point in time. That was the last update
> that slabtop got, so it is indicative of the impending OOM situation
> that occurred.
>
>> Changelog from v2:
>>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>>  2.) Added memory shrinker [Dave Chinner]
>
> I haven't looked at the shrinker, but clearly it is not working,
HI, Dave,
Some guys suggest that when inode slab cache is shrinked, the
hot_inode[range]_item slab is accordingly also shrinked, this will
make hot tracking don't need to register its own shrinker. Do you
think of it?
If you don't like above idea. Do you have any good suggestion on how
to remove hot_inode_item and hot_range_item?

> otherwise the above OOM situation would not be occurring.
>
> Cheers,
>
> Dave.
>
> (*) Tested on an empty 17TB XFS filesystem with:
>
> $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
> meta-data=/dev/vdc               isize=256    agcount=17, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=4563402735, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
> $ sudo chmod 777 /mnt/scratch
> $ fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d \
> /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
> /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
> /mnt/scratch/6  -d  /mnt/scratch/7
> .....
>      0     21600000            1      16679.3         12552262
>      0     22400000            1      15412.4         12588587
>      0     23200000            1      16367.6         14199322
>      0     24000000            1      15680.4         15741205
> <hangs here w/ OOM>
>
> --
> Dave Chinner
> david@fromorbit.com



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] xfs: add hot tracking support.
  2012-10-16  0:04 ` [PATCH] xfs: add hot tracking support Dave Chinner
@ 2012-11-07  8:38   ` Zhi Yong Wu
  2012-11-08  5:13     ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: Zhi Yong Wu @ 2012-11-07  8:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

HI, Dave,

I guess that you should add some hot tracking stuff in some
xfs_show_xxx function, right?

On Tue, Oct 16, 2012 at 8:04 AM, Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Connect up the VFS hot tracking support so XFS filesystems can make
> use of it.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.h |    1 +
>  fs/xfs/xfs_super.c |    9 +++++++++
>  2 files changed, 10 insertions(+)
>
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index a631ca3..d5e7277 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -215,6 +215,7 @@ typedef struct xfs_mount {
>  #define XFS_MOUNT_WSYNC                (1ULL << 0)     /* for nfs - all metadata ops
>                                                    must be synchronous except
>                                                    for space allocations */
> +#define XFS_MOUNT_HOTTRACK     (1ULL << 1)     /* hot inode tracking */
>  #define XFS_MOUNT_WAS_CLEAN    (1ULL << 3)
>  #define XFS_MOUNT_FS_SHUTDOWN  (1ULL << 4)     /* atomic stop of all filesystem
>                                                    operations, typically for
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 56c2537..17786ff 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -61,6 +61,7 @@
>  #include <linux/kthread.h>
>  #include <linux/freezer.h>
>  #include <linux/parser.h>
> +#include <linux/hot_tracking.h>
>
>  static const struct super_operations xfs_super_operations;
>  static kmem_zone_t *xfs_ioend_zone;
> @@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool;
>  #define MNTOPT_NODELAYLOG  "nodelaylog"        /* Delayed logging disabled */
>  #define MNTOPT_DISCARD    "discard"    /* Discard unused blocks */
>  #define MNTOPT_NODISCARD   "nodiscard" /* Do not discard unused blocks */
> +#define MNTOPT_HOTTRACK        "hot_track"     /* hot inode tracking */
>
>  /*
>   * Table driven mount option parser.
> @@ -371,6 +373,8 @@ xfs_parseargs(
>                         mp->m_flags |= XFS_MOUNT_DISCARD;
>                 } else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
>                         mp->m_flags &= ~XFS_MOUNT_DISCARD;
> +               } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
> +                       mp->m_flags |= XFS_MOUNT_HOTTRACK;
>                 } else if (!strcmp(this_char, "ihashsize")) {
>                         xfs_warn(mp,
>         "ihashsize no longer used, option is deprecated.");
> @@ -1040,6 +1044,9 @@ xfs_fs_put_super(
>  {
>         struct xfs_mount        *mp = XFS_M(sb);
>
> +       if (mp->m_flags & XFS_MOUNT_HOTTRACK)
> +               hot_track_exit(sb);
> +
>         xfs_filestream_unmount(mp);
>         xfs_unmountfs(mp);
>
> @@ -1470,6 +1477,8 @@ xfs_fs_fill_super(
>                 error = ENOMEM;
>                 goto out_unmount;
>         }
> +       if (mp->m_flags & XFS_MOUNT_HOTTRACK)
> +               hot_track_init(sb);
>
>         return 0;
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] xfs: add hot tracking support.
  2012-11-07  8:38   ` Zhi Yong Wu
@ 2012-11-08  5:13     ` Dave Chinner
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Chinner @ 2012-11-08  5:13 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, linux-kernel, linuxram,
	viro, dave, tytso, cmm, Zhi Yong Wu

On Wed, Nov 07, 2012 at 04:38:23PM +0800, Zhi Yong Wu wrote:
> HI, Dave,
> 
> I guess that you should add some hot tracking stuff in some
> xfs_show_xxx function, right?

Yes, it should - I thought I did that. I recall seeing int
/proc/mounts, but maybe I was just hallucinating. I'll send an
updated version when I get to fixing it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2012-11-08  5:13 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-10 10:07 [RFC v3 00/13] vfs: hot data tracking zwu.kernel
2012-10-10 10:07 ` [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' zwu.kernel
     [not found]   ` <5075632c.03cc440a.1b33.7805SMTPIN_ADDED@mx.google.com>
2012-10-10 12:21     ` Zhi Yong Wu
2012-10-10 12:21       ` Zhi Yong Wu
2012-10-10 13:11       ` Lukáš Czerner
2012-10-10 13:16         ` Zhi Yong Wu
2012-10-10 16:28   ` David Sterba
2012-10-11 13:41     ` Zhi Yong Wu
2012-10-11 14:35     ` Zhi Yong Wu
2012-10-11 14:41       ` David Sterba
2012-10-11 14:46         ` Zhi Yong Wu
2012-10-10 10:07 ` [RFC v3 02/13] vfs: introduce private radix tree structures zwu.kernel
2012-10-10 15:34   ` David Sterba
2012-10-11 13:35     ` Zhi Yong Wu
2012-10-10 10:07 ` [RFC v3 03/13] vfs: Initialize and free main data structures zwu.kernel
2012-10-10 10:07 ` [RFC v3 04/13] vfs: add function for collecting raw access info zwu.kernel
2012-10-10 10:07 ` [RFC v3 05/13] vfs: add two map arrays zwu.kernel
2012-10-10 10:07 ` [RFC v3 06/13] vfs: add hooks to enable hot data tracking zwu.kernel
2012-10-10 10:07 ` [RFC v3 07/13] vfs: add function for updating map arrays zwu.kernel
2012-10-10 10:07 ` [RFC v3 08/13] vfs: add aging function for old map info zwu.kernel
2012-10-10 10:07 ` [RFC v3 09/13] vfs: add one wq to update map info periodically zwu.kernel
2012-10-16  0:27   ` Dave Chinner
2012-10-17  6:34     ` Zhi Yong Wu
2012-10-18  2:25       ` Zheng Liu
2012-10-18  2:26         ` Zhi Yong Wu
2012-10-10 10:07 ` [RFC v3 10/13] vfs: register one memory shrinker zwu.kernel
2012-10-10 10:07 ` [RFC v3 11/13] vfs: add 3 new ioctl interfaces zwu.kernel
2012-10-15  7:48   ` Dave Chinner
2012-10-15  7:57     ` Zhi Yong Wu
2012-10-16  3:17   ` Dave Chinner
2012-10-16  4:18     ` Zhi Yong Wu
2012-10-19  8:21     ` Zhi Yong Wu
2012-10-10 10:07 ` [RFC v3 12/13] vfs: add debugfs support zwu.kernel
2012-10-10 16:53   ` David Sterba
2012-10-10 21:05   ` David Sterba
2012-10-15  7:55   ` Dave Chinner
2012-10-15  8:15     ` Zhi Yong Wu
2012-10-15  8:04   ` Dave Chinner
2012-10-15  8:47     ` Zhi Yong Wu
2012-10-10 10:07 ` [RFC v3 13/13] vfs: add documentation zwu.kernel
2012-10-15  0:35   ` Zheng Liu
2012-10-15  7:04     ` Zhi Yong Wu
2012-10-15  0:39 ` [RFC v3 00/13] vfs: hot data tracking Zheng Liu
2012-10-15  7:05   ` Zhi Yong Wu
2012-10-15 20:42 ` Dave Chinner
2012-10-17  8:57   ` Zhi Yong Wu
2012-10-18  4:29     ` Dave Chinner
2012-10-18  4:44       ` Zhi Yong Wu
2012-10-18  5:17         ` Dave Chinner
2012-10-18  5:24           ` Zhi Yong Wu
2012-10-19  8:29   ` Zhi Yong Wu
2012-10-16  0:04 ` [PATCH] xfs: add hot tracking support Dave Chinner
2012-11-07  8:38   ` Zhi Yong Wu
2012-11-08  5:13     ` Dave Chinner
2012-10-16  0:11 ` [RFC v3 00/13] vfs: hot data tracking Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.