All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/11] VFS hot tracking
@ 2013-11-06 13:45 Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 01/11] VFS hot tracking: Define basic data structures and functions Zhi Yong Wu
                   ` (13 more replies)
  0 siblings, 14 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  The patchset is trying to introduce hot tracking function in
VFS layer, which will keep track of real disk I/O in memory.
By it, you will easily know more details about disk I/O, and
then detect where disk I/O hot spots are. Also, specific FS
can take use of it to do accurate defragment, and hot relocation
support, etc.

  Now it's time to send out its V6 for external review, and
any comments or ideas are appreciated, thanks.

NOTE:

  The patchset can be obtained via my kernel dev git on github:
git://github.com/wuzhy/kernel.git hot_tracking
  If you're interested, you can also review them via
https://github.com/wuzhy/kernel/commits/hot_tracking

  For how to use and more other info and performance report,
please check hot_tracking.txt in Documentation and following
links:
  1.) http://lwn.net/Articles/525651/
  2.) https://lkml.org/lkml/2012/12/20/199

  This patchset has been done scalability or performance tests
by fs_mark, ffsb and compilebench.

  The perf testings were done on Linux 3.12.0-rc7 with Model IBM,8231-E2C
Big Endian PPC64 with 64 CPUs and 2 NUMA nodes, 250G RAM and 1.50 TiB
test hard disk where each test file size is 20G or 100G.
Architecture:          ppc64
Byte Order:            Big Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    4
Core(s) per socket:    1
Socket(s):             16
NUMA node(s):          2
Model:                 IBM,8231-E2C
Hypervisor vendor:     pHyp
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-31
NUMA node1 CPU(s):     32-63

  Below is the perf testing report:

  Please focus on the two key points:
  - The overall overhead which is injected by the patchset
  - The stability of the perf results

1. fio tests

                            w/o hot tracking                               w/ hot tracking

RAM size                            32G          32G         16G           8G           4G           2G          250G  

sequential-8k-1jobs-read         61260KB/s    60918KB/s    60901KB/s    62610KB/s    60992KB/s    60213KB/s    60948KB/s

sequential-8k-1jobs-write         1329KB/s     1329KB/s     1328KB/s     1329KB/s     1328KB/s     1329KB/s     1329KB/s

sequential-8k-8jobs-read         91139KB/s    92614KB/s    90907KB/s    89895KB/s    92022KB/s    90851KB/s    91877KB/s

sequential-8k-8jobs-write         2523KB/s     2522KB/s     2516KB/s     2521KB/s     2516KB/s     2518KB/s     2521KB/s

sequential-256k-1jobs-read      151432KB/s   151403KB/s   151406KB/s   151422KB/s   151344KB/s   151446KB/s   151372KB/s

sequential-256k-1jobs-write      33451KB/s    33470KB/s    33481KB/s    33470KB/s    33459KB/s    33472KB/s    33477KB/s

sequential-256k-8jobs-read      235291KB/s   234555KB/s   234251KB/s   233656KB/s   234927KB/s   236380KB/s   235535KB/s

sequential-256k-8jobs-write      62419KB/s    62402KB/s    62191KB/s    62859KB/s    62629KB/s    62720KB/s    62523KB/s

random-io-mix-8k-1jobs  [READ]    2929KB/s     2942KB/s     2946KB/s     2929KB/s     2934KB/s     2947KB/s     2946KB/s
                        [WRITE]   1262KB/s     1266KB/s     1257KB/s     1262KB/s     1257KB/s     1257KB/s     1265KB/s

random-io-mix-8k-8jobs  [READ]    2444KB/s     2442KB/s     2436KB/s     2416KB/s     2353KB/s     2441KB/s     2442KB/s
                        [WRITE]   1047KB/s     1044KB/s     1047KB/s     1028KB/s     1017KB/s     1034KB/s     1049KB/s

random-io-mix-8k-16jobs [READ]    2182KB/s     2184KB/s     2169KB/s     2178KB/s     2190KB/s     2184KB/s     2180KB/s
                        [WRITE]    932KB/s      930KB/s      943KB/s      936KB/s      937KB/s      929KB/s      931KB/s

The above perf parameter is the aggregate bandwidth of threads in the group;
If you hope to know how about other perf parameters, or fio raw results, please let me know, thanks.

2. Locking stat - Contention & Cacheline Bouncing

RAM size         class name         con-bounces  contentions  acq-bounces   acquisitions   cacheline bouncing  locking contention
                                                                                                 ratio              ratio

              &(&root->t_lock)->rlock:  1508        1592         157834      374639292           0.96%              0.00%
250G          &(&root->m_lock)->rlock:  1469        1484         119221       43077842           1.23%              0.00%
              &(&he->i_lock)->rlock:       0           0         101879      376755218           0.00%              0.00%

              &(&root->t_lock)->rlock:  2912        2985         342575      374691186           0.85%              0.00%
32G           &(&root->m_lock)->rlock:   188         193         307765        8803163           0.00%              0.00%
              &(&he->i_lock)->rlock:       0           0         291860      376756084           0.00%              0.00%

              &(&root->t_lock)->rlock:  3863        3948         298041      374727038           1.30%              0.00%
16G           &(&root->m_lock)->rlock:   220         228         254451        8687057           0.00%              0.00%
              &(&he->i_lock)->rlock:       0           0         235027      376756830           0.00%              0.00%

              &(&root->t_lock)->rlock:  3283        3409         233790      374722064           1.40%              0.00%
8G            &(&root->m_lock)->rlock:   136         139         203917        8684313           0.00%              0.00%
              &(&he->i_lock)->rlock:       0           0         193746      376756438           0.00%              0.00%

              &(&root->t_lock)->rlock: 15090       15705         283460      374889666           5.32%              0.00%
4G            &(&root->m_lock)->rlock:   172         173         222480        8555052           0.00%              0.00%
              &(&he->i_lock)->rlock:       0           0         206431      376759452           0.00%              0.00%

              &(&root->t_lock)->rlock: 25515       27368         305129       375394828          8.36%              0.00% 
2G            &(&root->m_lock)->rlock:   100         101         216516        6752265           0.00%              0.00%
              &(&he->i_lock)->rlock:       0           0         214713      376765169           0.00%              0.00%

3. Perf test - Cacheline Ping-pong

                      w/o hot tracking                                                        w/ hot tracking

RAM size                    32G                  32G                 16G                  8G                   4G                    2G                  250G  

cache-references    1,264,996,437,581    1,401,504,955,577    1,398,308,614,801    1,396,525,544,527    1,384,793,467,410    1,432,042,560,409    1,571,627,148,771

cache-misses           45,424,567,057       58,432,749,807       59,200,504,032       59,762,030,933       58,104,156,576       57,283,962,840       61,963,839,419

seconds time elapsed  22956.327674298      23035.457069488      23017.232397085      23012.397142967      23008.420970731      23057.245578767      23342.456015188

cache-misses ratio            3.591 %              4.169 %              4.234 %              4.279 %              4.196 %              4.000 %              3.943 %

Changelog from v5:
 - Also added the hook hot_freqs_update() in the page cache I/O path,
   not only in real disk I/O path [viro]
 - Don't export the stuff until it's used by a module [viro]
 - Splitted hot_inode_item_lookup() [viro]
 - Prevented hot items from being re-created after the inode was unlinked. [viro]
 - Made hot_freqs_update() to be inline and adopt one private hot flag [viro]
 - Killed hot_bit_shift() [viro]
 - Used file_inode() instead of file->f_dentry->d_inode [viro]
 - Introduced one new file hot_tracking.h in include/uapi/linux/ [viro]
 - Made the checks for ->i_nlink to be protectd by ->i_mutex [viro]

v5:
 - Added all kinds of perf testing report [viro]
 - Covered mmap() now [viro]
 - Removed list_sort() in hot_update_worker() to avoid locking contention
   and cacheline bouncing [viro]
 - Removed a /proc interface to control low memory usage [Chandra]
 - Adjusted shrinker support due to the change of public shrinker APIs [zwu]
 - Fixed the locking missing issue when hot_inode_item_put() is called
   in ioctl_heat_info() [viro]
 - Fixed some locking contention issues [zwu]

v4:
 - Removed debugfs support, but leave it to TODO list [viro, Chandra]
 - Killed HOT_DELETING and HOT_IN_LIST flag [viro]
 - Fixed unlink issues [viro]
 - Fixed the issue on lookups (both for inode and range)
   leak on race with unlink  [viro]
 - Killed hot_comm_item and split the functions which take it [virio]
 - Fixed some other issues [zwu, Chandra]

v3:
 - Added memory caping function for hot items [Zhiyong]
 - Cleanup aging function [Zhiyong]

v2:
 - Refactored to be under RCU [Chandra Seetharaman]
  Merged some code changes [Chandra Seetharaman]
 - Fixed some issues [Chandra Seetharaman]

v1:
 - Solved 64 bits inode number issue. [David Sterba]
 - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
 - Cleanup Some issues [David Sterba]
 - Use a static hot debugfs root [Greg KH]

rfcv4:
 - Introduce hot func registering framework [Zhiyong]
 - Remove global variable for hot tracking [Zhiyong]
 - Add btrfs hot tracking support [Zhiyong]

rfcv3:
 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
 2.) Refactored workqueue support. [Dave Chinner]
 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
     TIME_TO_KICK, and HEAT_UPDATE_DELAY
 4.) Cleanedup a lot of other issues [Dave Chinner]


rfcv2:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]
 3.) Converted to one workqueue to update map info periodically [Dave Chinner]
 4.) Cleanedup a lot of other issues [Dave Chinner]

rfcv1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
                                        [Marco Stornelli , Dave Chinner]


Dave Chinner (1):
  VFS hot tracking, xfs: Add hot tracking support

Zhi Yong Wu (10):
  VFS hot tracking: Define basic data structures and functions
  VFS hot tracking: Track IO and record heat information
  VFS hot tracking: Add a workqueue to move items between hot maps
  VFS hot tracking: Add shrinker functionality to curtail memory usage
  VFS hot tracking: Add an ioctl to get hot tracking information
  VFS hot tracking: Add a /proc interface to make the interval tunable
  VFS hot tracking: Add a /proc interface to control memory usage
  VFS hot tracking: Add documentation
  VFS hot tracking, btrfs: Add hot tracking support
  MAINTAINERS: add the maintainers for VFS hot tracking

 Documentation/filesystems/00-INDEX         |   2 +
 Documentation/filesystems/hot_tracking.txt | 207 ++++++++
 MAINTAINERS                                |  12 +
 fs/Makefile                                |   2 +-
 fs/btrfs/ctree.h                           |   1 +
 fs/btrfs/super.c                           |  22 +-
 fs/compat_ioctl.c                          |   5 +
 fs/dcache.c                                |   2 +
 fs/hot_tracking.c                          | 816 +++++++++++++++++++++++++++++
 fs/hot_tracking.h                          |  72 +++
 fs/ioctl.c                                 |  71 +++
 fs/namei.c                                 |   4 +
 fs/xfs/xfs_mount.h                         |   1 +
 fs/xfs/xfs_super.c                         |  18 +
 include/linux/fs.h                         |   4 +
 include/linux/hot_tracking.h               | 107 ++++
 include/uapi/linux/fs.h                    |   1 +
 include/uapi/linux/hot_tracking.h          |  33 ++
 kernel/sysctl.c                            |  14 +
 mm/filemap.c                               |  24 +-
 mm/readahead.c                             |   6 +
 21 files changed, 1420 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h
 create mode 100644 include/uapi/linux/hot_tracking.h

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v6 01/11] VFS hot tracking: Define basic data structures and functions
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 02/11] VFS hot tracking: Track IO and record heat information Zhi Yong Wu
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

This patch includes the basic data structure and functions needed for
VFS hot tracking.

It adds hot_inode_tree struct to keep track of frequently accessed
files, and is keyed by {inode, offset}. Trees contain hot_inode_items
representing those files and hot_range_items representing ranges in that
file.

It defines a data structure hot_info, which is associated with a mounted
filesystem, and will be used to store the inode tree and range tree for
hot items pertaining to that filesystem.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/Makefile                  |   2 +-
 fs/dcache.c                  |   2 +
 fs/hot_tracking.c            | 227 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |  23 +++++
 include/linux/fs.h           |   4 +
 include/linux/hot_tracking.h |  66 +++++++++++++
 include/uapi/linux/fs.h      |   1 +
 7 files changed, 324 insertions(+), 1 deletion(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..5f9b8f1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o statfs.o
+		stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index ae6ebb8..40dfd63 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -40,6 +40,7 @@
 #include <linux/list_lru.h>
 #include "internal.h"
 #include "mount.h"
+#include "hot_tracking.h"
 
 /*
  * Usage:
@@ -3437,4 +3438,5 @@ void __init vfs_caches_init(unsigned long mempages)
 	mnt_init();
 	bdev_cache_init();
 	chrdev_init();
+	hot_cache_init();
 }
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 0000000..25e7858
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,227 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/spinlock.h>
+#include "hot_tracking.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+static void hot_range_item_init(struct hot_range_item *hr,
+			struct hot_inode_item *he, loff_t start)
+{
+	kref_init(&hr->refs);
+	hr->start = start;
+	hr->len = 1 << RANGE_BITS;
+	hr->hot_inode = he;
+}
+
+static void hot_range_item_free_cb(struct rcu_head *head)
+{
+	struct hot_range_item *hr = container_of(head,
+				struct hot_range_item, rcu);
+
+	kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+static void hot_range_item_free(struct kref *kref)
+{
+	struct hot_range_item *hr = container_of(kref,
+				struct hot_range_item, refs);
+
+	rb_erase(&hr->rb_node, &hr->hot_inode->hot_range_tree);
+
+	call_rcu(&hr->rcu, hot_range_item_free_cb);
+}
+
+static void hot_range_item_get(struct hot_range_item *hr)
+{
+        kref_get(&hr->refs);
+}
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure if the reference count hits zero
+ */
+static void hot_range_item_put(struct hot_range_item *hr)
+{
+        kref_put(&hr->refs, hot_range_item_free);
+}
+
+/*
+ * Free the entire hot_range_tree.
+ */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+	struct rb_node *node;
+	struct hot_range_item *hr;
+
+	/* Free hot inode and range trees on fs root */
+	spin_lock(&he->i_lock);
+	node = rb_first(&he->hot_range_tree);
+	while (node) {
+		hr = rb_entry(node, struct hot_range_item, rb_node);
+		node = rb_next(node);
+		hot_range_item_put(hr);
+	}
+	spin_unlock(&he->i_lock);
+}
+
+static void hot_inode_item_init(struct hot_inode_item *he,
+			struct hot_info *root, u64 ino)
+{
+	kref_init(&he->refs);
+	he->ino = ino;
+	he->hot_root = root;
+	spin_lock_init(&he->i_lock);
+}
+
+static void hot_inode_item_free_cb(struct rcu_head *head)
+{
+	struct hot_inode_item *he = container_of(head,
+				struct hot_inode_item, rcu);
+
+	kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+	struct hot_inode_item *he = container_of(kref,
+				struct hot_inode_item, refs);
+
+	rb_erase(&he->rb_node, &he->hot_root->hot_inode_tree);
+	hot_range_tree_free(he);
+
+	call_rcu(&he->rcu, hot_inode_item_free_cb);
+}
+
+static void hot_inode_item_get(struct hot_inode_item *he)
+{
+        kref_get(&he->refs);
+}
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+        kref_put(&he->refs, hot_inode_item_free);
+}
+
+/*
+ * Initialize kmem cache for hot_inode_item and hot_range_item.
+ */
+void __init hot_cache_init(void)
+{
+	hot_inode_item_cachep = KMEM_CACHE(hot_inode_item,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD);
+	if (!hot_inode_item_cachep)
+		return;
+
+	hot_range_item_cachep = KMEM_CACHE(hot_range_item,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD);
+	if (!hot_range_item_cachep)
+		kmem_cache_destroy(hot_inode_item_cachep);
+}
+
+static struct hot_info *hot_tree_init(struct super_block *sb)
+{
+	struct hot_info *root;
+	int i, j;
+
+	root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
+	if (!root) {
+		printk(KERN_ERR "%s: Failed to malloc memory for "
+				"hot_info\n", __func__);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	root->hot_inode_tree = RB_ROOT;
+	spin_lock_init(&root->t_lock);
+
+	return root;
+}
+
+/*
+ * Frees the entire hot tree.
+ */
+static void hot_tree_exit(struct hot_info *root)
+{
+	struct hot_inode_item *he;
+	struct rb_node *node;
+
+	spin_lock(&root->t_lock);
+	node = rb_first(&root->hot_inode_tree);
+	while (node) {
+		he = rb_entry(node, struct hot_inode_item, rb_node);
+		node = rb_next(node);
+		hot_inode_item_put(he);
+	}
+	spin_unlock(&root->t_lock);
+}
+
+/*
+ * Initialize the data structures for hot tracking.
+ * This function will be called by *_fill_super()
+ * when filesystem is mounted.
+ */
+int hot_track_init(struct super_block *sb)
+{
+	struct hot_info *root;
+	int ret = 0;
+
+	if (!hot_inode_item_cachep || !hot_range_item_cachep) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	root = hot_tree_init(sb);
+	if (IS_ERR(root)) {
+		ret = PTR_ERR(root);
+		goto err;
+	}
+
+	sb->s_hot_root = root;
+	sb->s_flags |= MS_HOTTRACK;
+
+	printk(KERN_INFO "VFS: Turning on hot tracking\n");
+
+	return ret;
+
+err:
+	sb->s_hot_root = NULL;
+
+	printk(KERN_ERR "VFS: Fail to turn on hot tracking\n");
+
+	return ret;
+}
+EXPORT_SYMBOL(hot_track_init);
+
+/*
+ * This function will be called by *_put_super()
+ * when filesystem is umounted, or also by *_fill_super()
+ * in some exceptional cases.
+ */
+void hot_track_exit(struct super_block *sb)
+{
+	struct hot_info *root = sb->s_hot_root;
+
+	sb->s_hot_root = NULL;
+	sb->s_flags &= ~MS_HOTTRACK;
+	hot_tree_exit(root);
+	rcu_barrier();
+	kfree(root);
+}
+EXPORT_SYMBOL(hot_track_exit);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
new file mode 100644
index 0000000..51d829e
--- /dev/null
+++ b/fs/hot_tracking.h
@@ -0,0 +1,23 @@
+/*
+ * fs/hot_tracking.h
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_TRACKING__
+#define __HOT_TRACKING__
+
+#include <linux/hot_tracking.h>
+
+/* size of sub-file ranges */
+#define RANGE_BITS 20
+
+void __init hot_cache_init(void);
+void hot_inode_item_put(struct hot_inode_item *he);
+
+#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f40547..8c8c40d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -29,6 +29,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/hot_tracking.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1324,6 +1325,9 @@ struct super_block {
 	/* AIO completions deferred from interrupt context */
 	struct workqueue_struct *s_dio_done_wq;
 
+	/* Hot data tracking*/
+	struct hot_info *s_hot_root;
+
 	/*
 	 * Keep the lru lists last in the structure so they always sit on their
 	 * own individual cachelines.
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
new file mode 100644
index 0000000..91633db
--- /dev/null
+++ b/include/linux/hot_tracking.h
@@ -0,0 +1,66 @@
+/*
+ *  include/linux/hot_tracking.h
+ *
+ * This file has definitions for VFS hot tracking
+ * structures etc.
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_HOTTRACK_H
+#define _LINUX_HOTTRACK_H
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/fs.h>
+
+#define MAP_BITS 8
+#define MAP_SIZE (1 << MAP_BITS)
+
+/* values for hot_freq flags */
+enum {
+	TYPE_INODE = 0,
+	TYPE_RANGE,
+	MAX_TYPES,
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+	struct kref refs;
+	struct rb_node rb_node;         /* rbtree index */
+	struct rcu_head rcu;
+	struct rb_root hot_range_tree;	/* tree of ranges */
+	spinlock_t i_lock;		/* protect above tree */
+	struct hot_info *hot_root;	/* associated hot_info */
+	u64 ino;			/* inode number from inode */
+};
+
+/*
+ * An item representing a range inside of
+ * an inode whose frequency is being tracked
+ */
+struct hot_range_item {
+	struct kref refs;
+	struct rb_node rb_node;                 /* rbtree index */
+	struct rcu_head rcu;
+	struct hot_inode_item *hot_inode;	/* associated hot_inode_item */
+	loff_t start;				/* offset in bytes */
+	size_t len;				/* length in bytes */
+};
+
+struct hot_info {
+	struct rb_root hot_inode_tree;
+	spinlock_t t_lock;				/* protect above tree */
+};
+
+extern int hot_track_init(struct super_block *sb);
+extern void hot_track_exit(struct super_block *sb);
+
+#endif  /* _LINUX_HOTTRACK_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 6c28b61..d105d8d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -86,6 +86,7 @@ struct inodes_stat_t {
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
+#define MS_HOTTRACK	(1<<25) /* Enable VFS hot tracking */
 
 /* These sb flags are internal to the kernel */
 #define MS_NOSEC	(1<<28)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 02/11] VFS hot tracking: Track IO and record heat information
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 01/11] VFS hot tracking: Define basic data structures and functions Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 03/11] VFS hot tracking: Add a workqueue to move items between hot maps Zhi Yong Wu
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

This patch adds read/write code paths: include read_pages(),
do_writepages(), do_generic_file_read() and __blockdev_direct_IO()
to record heat information.

When real disk i/o for an inode is done, its own hot_inode_item will
be created or updated in the RB tree for the filesystem, and the i/o freq for
all of its extents will also be created/updated in the RB-tree per inode.

Each of the two structures hot_inode_item and hot_range_item
contains a hot_freq_data struct with its frequency of access metrics
(number of {reads, writes}, last {read,write} time, frequency of
{reads,writes}).

Each hot_inode_item contains one hot_range_tree struct which is keyed by
{inode, offset, length} and used to keep track of all the ranges in this file.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 259 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |   3 +
 fs/namei.c                   |   4 +
 include/linux/hot_tracking.h |  19 ++++
 mm/filemap.c                 |  24 +++-
 mm/readahead.c               |   6 +
 6 files changed, 313 insertions(+), 2 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 25e7858..d68c458 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -22,6 +22,8 @@ static void hot_range_item_init(struct hot_range_item *hr,
 			struct hot_inode_item *he, loff_t start)
 {
 	kref_init(&hr->refs);
+	hr->freq.avg_delta_reads = (u64) -1;
+	hr->freq.avg_delta_writes = (u64) -1;
 	hr->start = start;
 	hr->len = 1 << RANGE_BITS;
 	hr->hot_inode = he;
@@ -59,6 +61,62 @@ static void hot_range_item_put(struct hot_range_item *hr)
         kref_put(&hr->refs, hot_range_item_free);
 }
 
+static struct hot_range_item
+*hot_range_item_alloc(struct hot_inode_item *he, loff_t start)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct hot_range_item *hr, *hr_new = NULL;
+
+	start = start << RANGE_BITS;
+
+	/* walk tree to find insertion point */
+redo:
+	spin_lock(&he->i_lock);
+	p = &he->hot_range_tree.rb_node;
+	while (*p) {
+		parent = *p;
+		hr = rb_entry(parent, struct hot_range_item, rb_node);
+		if (start < hr->start)
+			p = &(*p)->rb_left;
+		else if (start > (hr->start + hr->len - 1))
+			p = &(*p)->rb_right;
+		else {
+			hot_range_item_get(hr);
+			if (hr_new) {
+				/*
+				 * Lost the race. Somebody else inserted
+				 * the item for the range. Free the
+				 * newly allocated item.
+				 */
+				kmem_cache_free(hot_range_item_cachep, hr_new);
+			}
+			spin_unlock(&he->i_lock);
+
+			return hr;
+		}
+	}
+
+	if (hr_new) {
+		rb_link_node(&hr_new->rb_node, parent, p);
+		rb_insert_color(&hr_new->rb_node, &he->hot_range_tree);
+		hot_range_item_get(hr_new); /* For the caller */
+		spin_unlock(&he->i_lock);
+		return hr_new;
+	}
+        spin_unlock(&he->i_lock);
+
+	hr_new = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS);
+	if (!hr_new)
+		return ERR_PTR(-ENOMEM);
+
+	hot_range_item_init(hr_new, he, start);
+
+	cond_resched();
+
+	goto redo;
+}
+
 /*
  * Free the entire hot_range_tree.
  */
@@ -82,6 +140,8 @@ static void hot_inode_item_init(struct hot_inode_item *he,
 			struct hot_info *root, u64 ino)
 {
 	kref_init(&he->refs);
+	he->freq.avg_delta_reads = (u64) -1;
+	he->freq.avg_delta_writes = (u64) -1;
 	he->ino = ino;
 	he->hot_root = root;
 	spin_lock_init(&he->i_lock);
@@ -120,6 +180,153 @@ void hot_inode_item_put(struct hot_inode_item *he)
         kref_put(&he->refs, hot_inode_item_free);
 }
 
+static struct hot_inode_item
+*hot_inode_item_alloc(struct hot_info *root, u64 ino)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct hot_inode_item *he, *he_new = NULL;
+
+	/* walk tree to find insertion point */
+redo:
+	spin_lock(&root->t_lock);
+	p = &root->hot_inode_tree.rb_node;
+	while (*p) {
+		parent = *p;
+		he = rb_entry(parent, struct hot_inode_item, rb_node);
+		if (ino < he->ino)
+			p = &(*p)->rb_left;
+		else if (ino > he->ino)
+			p = &(*p)->rb_right;
+		else {
+			hot_inode_item_get(he);
+			if (he_new) {
+				/*
+				 * Lost the race. Somebody else inserted
+				 * the item for the inode. Free the
+				 * newly allocated item.
+				 */
+				kmem_cache_free(hot_inode_item_cachep, he_new);
+			}
+			spin_unlock(&root->t_lock);
+
+			return he;
+		}
+	}
+
+	if (he_new) {
+		rb_link_node(&he_new->rb_node, parent, p);
+		rb_insert_color(&he_new->rb_node, &root->hot_inode_tree);
+		hot_inode_item_get(he_new); /* For the caller */
+		spin_unlock(&root->t_lock);
+		return he_new;
+	}
+	spin_unlock(&root->t_lock);
+
+	he_new = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS);
+	if (!he_new)
+		return ERR_PTR(-ENOMEM);
+
+	hot_inode_item_init(he_new, root, ino);
+
+	cond_resched();
+
+	goto redo;
+}
+
+struct hot_inode_item
+*hot_inode_item_lookup(struct hot_info *root, u64 ino)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct hot_inode_item *he;
+
+	/* walk tree to find insertion point */
+	spin_lock(&root->t_lock);
+	p = &root->hot_inode_tree.rb_node;
+	while (*p) {
+		parent = *p;
+		he = rb_entry(parent, struct hot_inode_item, rb_node);
+		if (ino < he->ino)
+			p = &(*p)->rb_left;
+		else if (ino > he->ino)
+			p = &(*p)->rb_right;
+		else {
+			hot_inode_item_get(he);
+			spin_unlock(&root->t_lock);
+
+			return he;
+		}
+	}
+	spin_unlock(&root->t_lock);
+
+	return ERR_PTR(-ENOENT);
+}
+
+void hot_inode_item_unlink(struct inode *inode)
+{
+	struct hot_info *root = inode->i_sb->s_hot_root;
+	struct hot_inode_item *he;
+
+	if (!(inode->i_sb->s_flags & MS_HOTTRACK)
+		|| !S_ISREG(inode->i_mode))
+		return;
+
+	he = hot_inode_item_lookup(root, inode->i_ino);
+	if (IS_ERR(he))
+                return;
+
+	spin_lock(&root->t_lock);
+	hot_inode_item_put(he);
+	hot_inode_item_put(he); /* For the caller */
+	spin_unlock(&root->t_lock);
+}
+
+/*
+ * This function does the actual work of updating
+ * the frequency numbers.
+ *
+ * avg_delta_{reads,writes} are indeed a kind of simple moving
+ * average of the time difference between each of the last
+ * 2^(FREQ_POWER) reads/writes. If there have not yet been that
+ * many reads or writes, it's likely that the values will be very
+ * large; They are initialized to the largest possible value for the
+ * data type. Simply, we don't want a few fast access to a file to
+ * automatically make it appear very hot.
+ */
+static void hot_freq_calc(struct timespec old_atime,
+		struct timespec cur_time, u64 *avg)
+{
+	struct timespec delta_ts;
+	u64 new_delta;
+
+	delta_ts = timespec_sub(cur_time, old_atime);
+	new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
+
+	*avg = (*avg << FREQ_POWER) - *avg + new_delta;
+	*avg = *avg >> FREQ_POWER;
+}
+
+static void hot_freq_update(struct hot_info *root,
+		struct hot_freq *freq, bool write)
+{
+	struct timespec cur_time = current_kernel_time();
+
+	if (write) {
+		freq->nr_writes += 1;
+		hot_freq_calc(freq->last_write_time,
+				cur_time,
+				&freq->avg_delta_writes);
+		freq->last_write_time = cur_time;
+	} else {
+		freq->nr_reads += 1;
+		hot_freq_calc(freq->last_read_time,
+				cur_time,
+				&freq->avg_delta_reads);
+		freq->last_read_time = cur_time;
+	}
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -136,6 +343,58 @@ void __init hot_cache_init(void)
 		kmem_cache_destroy(hot_inode_item_cachep);
 }
 
+/*
+ * Main function to update i/o access frequencies, and it will be called
+ * from read/writepages() hooks, which are read_pages(), do_writepages(),
+ * do_generic_file_read(), and __blockdev_direct_IO().
+ */
+inline void hot_freqs_update(struct inode *inode, loff_t start,
+			size_t len, int rw)
+{
+	struct hot_info *root = inode->i_sb->s_hot_root;
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+	u64 range_size;
+	loff_t cur, end;
+
+	if (!(inode->i_sb->s_flags & MS_HOTTRACK) || (len == 0)
+		|| !S_ISREG(inode->i_mode) || !inode->i_nlink)
+		return;
+
+	he = hot_inode_item_alloc(root, inode->i_ino);
+	if (IS_ERR(he))
+		return;
+
+	hot_freq_update(root, &he->freq, rw);
+
+	/*
+	 * Align ranges on range size boundary
+	 * to prevent proliferation of range structs
+	 */
+	range_size  = 1 << RANGE_BITS;
+	end = (start + len + range_size - 1) >> RANGE_BITS;
+	cur = start >> RANGE_BITS;
+	for (; cur < end; cur++) {
+		hr = hot_range_item_alloc(he, cur);
+		if (IS_ERR(hr)) {
+			WARN(1, "hot_range_item_alloc returns %ld\n",
+				PTR_ERR(hr));
+			return;
+		}
+
+		hot_freq_update(root, &hr->freq, rw);
+
+		spin_lock(&he->i_lock);
+		hot_range_item_put(hr);
+		spin_unlock(&he->i_lock);
+	}
+
+	spin_lock(&root->t_lock);
+	hot_inode_item_put(he);
+	spin_unlock(&root->t_lock);
+}
+EXPORT_SYMBOL(hot_freqs_update);
+
 static struct hot_info *hot_tree_init(struct super_block *sb)
 {
 	struct hot_info *root;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 51d829e..56ab13c 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -16,8 +16,11 @@
 
 /* size of sub-file ranges */
 #define RANGE_BITS 20
+#define FREQ_POWER 4
 
 void __init hot_cache_init(void);
 void hot_inode_item_put(struct hot_inode_item *he);
+struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root, u64 ino);
+void hot_inode_item_unlink(struct inode *inode);
 
 #endif /* __HOT_TRACKING__ */
diff --git a/fs/namei.c b/fs/namei.c
index caa2805..e50af1e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -38,6 +38,7 @@
 
 #include "internal.h"
 #include "mount.h"
+#include "hot_tracking.h"
 
 /* [Feb-1997 T. Schoebel-Theuer]
  * Fundamental changes in the pathname lookup mechanisms (namei)
@@ -3668,6 +3669,9 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
 				dont_mount(dentry);
 		}
 	}
+
+	if (!error && !dentry->d_inode->i_nlink)
+		hot_inode_item_unlink(dentry->d_inode);
 	mutex_unlock(&dentry->d_inode->i_mutex);
 
 	/* We don't d_delete() NFS sillyrenamed files--they still exist. */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 91633db..5f02025 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -31,8 +31,24 @@ enum {
 	MAX_TYPES,
 };
 
+/*
+ * A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item
+ */
+struct hot_freq {
+	struct timespec last_read_time;
+	struct timespec last_write_time;
+	u32 nr_reads;
+	u32 nr_writes;
+	u64 avg_delta_reads;
+	u64 avg_delta_writes;
+	u32 last_temp;
+};
+
 /* An item representing an inode and its access frequency */
 struct hot_inode_item {
+	struct hot_freq freq;           /* frequency data */
 	struct kref refs;
 	struct rb_node rb_node;         /* rbtree index */
 	struct rcu_head rcu;
@@ -47,6 +63,7 @@ struct hot_inode_item {
  * an inode whose frequency is being tracked
  */
 struct hot_range_item {
+	struct hot_freq freq;                   /* frequency data */
 	struct kref refs;
 	struct rb_node rb_node;                 /* rbtree index */
 	struct rcu_head rcu;
@@ -62,5 +79,7 @@ struct hot_info {
 
 extern int hot_track_init(struct super_block *sb);
 extern void hot_track_exit(struct super_block *sb);
+extern void hot_freqs_update(struct inode *inode, loff_t start,
+			size_t len, int rw);
 
 #endif  /* _LINUX_HOTTRACK_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index ae4846f..d70939d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/hot_tracking.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1196,6 +1197,10 @@ page_ok:
 			mark_page_accessed(page);
 		prev_index = index;
 
+		/* Hot tracking */
+		hot_freqs_update(inode, page->index << PAGE_CACHE_SHIFT,
+				PAGE_CACHE_SIZE, 0);
+
 		/*
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
@@ -1514,9 +1519,13 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 			return -ENOMEM;
 
 		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
-		if (ret == 0)
+		if (ret == 0) {
+			/* Hot tracking */
+			hot_freqs_update(mapping->host,
+					page->index << PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE, 0);
 			ret = mapping->a_ops->readpage(file, page);
-		else if (ret == -EEXIST)
+		} else if (ret == -EEXIST)
 			ret = 0; /* losing race to add is OK */
 
 		page_cache_release(page);
@@ -1711,6 +1720,11 @@ page_not_uptodate:
 	 * and we need to check for errors.
 	 */
 	ClearPageError(page);
+
+	/* Hot tracking */
+	hot_freqs_update(inode, page->index << PAGE_CACHE_SHIFT,
+			PAGE_CACHE_SIZE, 0);
+
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {
 		wait_on_page_locked(page);
@@ -2249,6 +2263,9 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	}
 
 	if (written > 0) {
+		/* Hot tracking */
+		hot_freqs_update(inode, pos, written, 1);
+
 		pos += written;
 		if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
 			i_size_write(inode, pos);
@@ -2404,6 +2421,9 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	status = generic_perform_write(file, &i, pos);
 
 	if (likely(status >= 0)) {
+		/* Hot tracking */
+		hot_freqs_update(file_inode(file), pos, status, 1);
+
 		written += status;
 		*ppos = pos + status;
   	}
diff --git a/mm/readahead.c b/mm/readahead.c
index e4ed041..51f0e88 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/file.h>
+#include <linux/hot_tracking.h>
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -115,6 +116,11 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	unsigned page_idx;
 	int ret;
 
+	/* Hot tracking */
+	hot_freqs_update(mapping->host,
+			list_to_page(pages)->index << PAGE_CACHE_SHIFT,
+			(size_t)nr_pages * PAGE_CACHE_SIZE, 0);
+
 	blk_start_plug(&plug);
 
 	if (mapping->a_ops->readpages) {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 03/11] VFS hot tracking: Add a workqueue to move items between hot maps
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 01/11] VFS hot tracking: Define basic data structures and functions Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 02/11] VFS hot tracking: Track IO and record heat information Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 04/11] VFS hot tracking: Add shrinker functionality to curtail memory usage Zhi Yong Wu
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Add a workqueue per superblock and a delayed_work
to run periodic work to update map info on each superblock.

Two arrays of map list are defined, one is for hot inode
items, and the other is for hot extent items.

The hot items in the RB-tree will be at first distilled
into one temperature in the range [0, 255]. It will be
be linked to its corresponding array of map list which use
the temperature as its index.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 208 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |  25 ++++++
 include/linux/hot_tracking.h |   8 +-
 3 files changed, 240 insertions(+), 1 deletion(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index d68c458..35d3b83 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -12,6 +12,7 @@
 #include <linux/list.h>
 #include <linux/err.h>
 #include <linux/spinlock.h>
+#include <linux/sched.h>
 #include "hot_tracking.h"
 
 /* kmem_cache pointers for slab caches */
@@ -22,6 +23,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
 			struct hot_inode_item *he, loff_t start)
 {
 	kref_init(&hr->refs);
+	INIT_LIST_HEAD(&hr->track_list);
 	hr->freq.avg_delta_reads = (u64) -1;
 	hr->freq.avg_delta_writes = (u64) -1;
 	hr->start = start;
@@ -41,8 +43,13 @@ static void hot_range_item_free(struct kref *kref)
 {
 	struct hot_range_item *hr = container_of(kref,
 				struct hot_range_item, refs);
+	struct hot_info *root = hr->hot_inode->hot_root;
 
 	rb_erase(&hr->rb_node, &hr->hot_inode->hot_range_tree);
+	spin_lock(&root->m_lock);
+	if (!list_empty(&hr->track_list))
+		list_del_init(&hr->track_list);
+	spin_unlock(&root->m_lock);
 
 	call_rcu(&hr->rcu, hot_range_item_free_cb);
 }
@@ -67,6 +74,8 @@ static struct hot_range_item
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct hot_range_item *hr, *hr_new = NULL;
+	u32 temp;
+	u8 temp_cur;
 
 	start = start << RANGE_BITS;
 
@@ -100,6 +109,12 @@ redo:
 	if (hr_new) {
 		rb_link_node(&hr_new->rb_node, parent, p);
 		rb_insert_color(&hr_new->rb_node, &he->hot_range_tree);
+		temp = hot_temp_calc(&hr_new->freq);
+		temp_cur = (u8)(temp >> (32 - MAP_BITS));
+		spin_lock(&he->hot_root->m_lock);
+		list_add_tail(&hr_new->track_list,
+			&he->hot_root->hot_map[TYPE_RANGE][temp_cur]);
+		spin_unlock(&he->hot_root->m_lock);
 		hot_range_item_get(hr_new); /* For the caller */
 		spin_unlock(&he->i_lock);
 		return hr_new;
@@ -136,10 +151,49 @@ static void hot_range_tree_free(struct hot_inode_item *he)
 	spin_unlock(&he->i_lock);
 }
 
+static void hot_range_map_update(struct hot_info *root,
+			struct hot_range_item *hr)
+{
+	u32 temp = hot_temp_calc(&hr->freq);
+	u8 temp_cur = (u8)(temp >> (32 - MAP_BITS));
+	u8 temp_prev = (u8)(hr->freq.last_temp >> (32 - MAP_BITS));
+
+	spin_lock(&root->m_lock);
+	if (!list_empty(&hr->track_list)
+		&& (temp_cur != temp_prev)) {
+		hr->freq.last_temp = temp;
+		list_del_init(&hr->track_list);
+		list_add_tail(&hr->track_list,
+			&root->hot_map[TYPE_RANGE][temp_cur]);
+	}
+	spin_unlock(&root->m_lock);
+}
+
+/*
+ * Update temperatures for each range item for aging purposes.
+ * If one hot range item is old, it will be aged out.
+ */
+static void hot_range_tree_update(struct hot_inode_item *he,
+				struct hot_info *root)
+{
+	struct rb_node *node;
+	struct hot_range_item *hr;
+
+	rcu_read_lock();
+	node = rb_first(&he->hot_range_tree);
+	while (node) {
+		hr = rb_entry(node, struct hot_range_item, rb_node);
+		node = rb_next(node);
+		hot_range_map_update(root, hr);
+	}
+	rcu_read_unlock();
+}
+
 static void hot_inode_item_init(struct hot_inode_item *he,
 			struct hot_info *root, u64 ino)
 {
 	kref_init(&he->refs);
+	INIT_LIST_HEAD(&he->track_list);
 	he->freq.avg_delta_reads = (u64) -1;
 	he->freq.avg_delta_writes = (u64) -1;
 	he->ino = ino;
@@ -161,6 +215,8 @@ static void hot_inode_item_free(struct kref *kref)
 				struct hot_inode_item, refs);
 
 	rb_erase(&he->rb_node, &he->hot_root->hot_inode_tree);
+	if (!list_empty(&he->track_list))
+		list_del_init(&he->track_list);
 	hot_range_tree_free(he);
 
 	call_rcu(&he->rcu, hot_inode_item_free_cb);
@@ -186,6 +242,8 @@ static struct hot_inode_item
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct hot_inode_item *he, *he_new = NULL;
+	u32 temp;
+	u8 temp_cur;
 
 	/* walk tree to find insertion point */
 redo:
@@ -217,6 +275,10 @@ redo:
 	if (he_new) {
 		rb_link_node(&he_new->rb_node, parent, p);
 		rb_insert_color(&he_new->rb_node, &root->hot_inode_tree);
+		temp = hot_temp_calc(&he_new->freq);
+		temp_cur = (u8)(temp >> (32 - MAP_BITS));
+		list_add_tail(&he_new->track_list,
+			&root->hot_map[TYPE_INODE][temp_cur]);
 		hot_inode_item_get(he_new); /* For the caller */
 		spin_unlock(&root->t_lock);
 		return he_new;
@@ -283,6 +345,29 @@ void hot_inode_item_unlink(struct inode *inode)
 }
 
 /*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature.
+ */
+static void hot_inode_map_update(struct hot_info *root,
+			struct hot_inode_item *he)
+{
+	u32 temp = hot_temp_calc(&he->freq);
+	u8 temp_cur = (u8)(temp >> (32 - MAP_BITS));
+	u8 temp_prev = (u8)(he->freq.last_temp >> (32 - MAP_BITS));
+
+	spin_lock(&root->t_lock);
+	if (!list_empty(&he->track_list)
+		&& (temp_cur != temp_prev)) {
+		he->freq.last_temp = temp;
+		list_del_init(&he->track_list);
+		list_add_tail(&he->track_list,
+			&root->hot_map[TYPE_INODE][temp_cur]);
+	}
+	spin_unlock(&root->t_lock);
+}
+
+/*
  * This function does the actual work of updating
  * the frequency numbers.
  *
@@ -328,6 +413,106 @@ static void hot_freq_update(struct hot_info *root,
 }
 
 /*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria down into a single temperature value for the data,
+ * which is an integer between 0 and HEAT_MAX_VALUE.
+ *
+ * With the six values, we first do some very rudimentary
+ * "normalizations" to each metric such that they affect the
+ * final temperature calculation exactly the right way. It's
+ * important to note that we still weren't really sure that
+ * these six adjustments were exactly right.
+ * They could definitely use more tweaking and adjustment,
+ * especially in terms of the memory footprint they consume.
+ *
+ * Next, we take the adjusted values and shift them down to
+ * a manageable size, whereafter they are weighted using the
+ * the *_COEFF_POWER values and combined to a single temperature
+ * value.
+ */
+u32 hot_temp_calc(struct hot_freq *freq)
+{
+	u32 result = 0;
+
+	struct timespec ckt = current_kernel_time();
+	u64 cur_time = timespec_to_ns(&ckt);
+	u32 nrr_heat, nrw_heat;
+	u64 ltr_heat, ltw_heat, avr_heat, avw_heat;
+
+	nrr_heat = (u32)(freq->nr_reads << NRR_MULTIPLIER_POWER);
+	nrw_heat = (u32)(freq->nr_writes << NRW_MULTIPLIER_POWER);
+
+	ltr_heat = (cur_time - timespec_to_ns(&freq->last_read_time))
+			>> LTR_DIVIDER_POWER;
+	ltw_heat = (cur_time - timespec_to_ns(&freq->last_write_time))
+			>> LTW_DIVIDER_POWER;
+
+	avr_heat = (((u64) -1) - freq->avg_delta_reads)
+			>> AVR_DIVIDER_POWER;
+	avw_heat = (((u64) -1) - freq->avg_delta_writes)
+			>> AVW_DIVIDER_POWER;
+
+	/* ltr_heat is now guaranteed to be u32 safe */
+	if (ltr_heat >= ((u64)1 << 32))
+		ltr_heat = 0;
+	else
+		ltr_heat = ((u64)1 << 32) - ltr_heat;
+
+	/* ltw_heat is now guaranteed to be u32 safe */
+	if (ltw_heat >= ((u64)1 << 32))
+		ltw_heat = 0;
+	else
+		ltw_heat = ((u64)1 << 32) - ltw_heat;
+
+	/* avr_heat is now guaranteed to be u32 safe */
+	if (avr_heat >= ((u64)1 << 32))
+		avr_heat = (u32)-1;
+
+	/* avw_heat is now guaranteed to be u32 safe */
+	if (avw_heat >= ((u64)1 << 32))
+		avw_heat = (u32)-1;
+
+	nrr_heat = (u32)((u64)nrr_heat >> (3 - NRR_COEFF_POWER));
+	nrw_heat = (u32)((u64)nrw_heat >> (3 - NRW_COEFF_POWER));
+	ltr_heat = (ltr_heat >> (3 - LTR_COEFF_POWER));
+	ltw_heat = (ltw_heat >> (3 - LTW_COEFF_POWER));
+	avr_heat = (avr_heat >> (3 - AVR_COEFF_POWER));
+	avw_heat = (avw_heat >> (3 - AVW_COEFF_POWER));
+
+	result = nrr_heat + nrw_heat + (u32) ltr_heat +
+		(u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+	return result;
+}
+
+/*
+ * Every sync period we update temperatures for
+ * each hot inode item and hot range item for aging
+ * purposes.
+ */
+static void hot_update_worker(struct work_struct *work)
+{
+	struct hot_info *root = container_of(to_delayed_work(work),
+					struct hot_info, update_work);
+	struct hot_inode_item *he;
+	struct rb_node *node;
+
+	rcu_read_lock();
+	node = root->hot_inode_tree.rb_node;
+	while (node) {
+		he = rb_entry(node, struct hot_inode_item, rb_node);
+		node = rb_next(node);
+		hot_inode_map_update(root, he);
+		hot_range_tree_update(he, root);
+	}
+	rcu_read_unlock();
+
+	/* Instert next delayed work */
+	queue_delayed_work(root->update_wq, &root->update_work,
+		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 void __init hot_cache_init(void)
@@ -409,6 +594,26 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 
 	root->hot_inode_tree = RB_ROOT;
 	spin_lock_init(&root->t_lock);
+	spin_lock_init(&root->m_lock);
+
+	for (i = 0; i < MAP_SIZE; i++) {
+		for (j = 0; j < MAX_TYPES; j++)
+			INIT_LIST_HEAD(&root->hot_map[j][i]);
+	}
+
+	root->update_wq = alloc_workqueue(
+			"hot_update_wq", WQ_NON_REENTRANT, 0);
+	if (!root->update_wq) {
+		printk(KERN_ERR "%s: Failed to create "
+				"hot update workqueue\n", __func__);
+		kfree(root);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Initialize hot tracking wq and arm one delayed work */
+	INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
+	queue_delayed_work(root->update_wq, &root->update_work,
+		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 
 	return root;
 }
@@ -421,6 +626,9 @@ static void hot_tree_exit(struct hot_info *root)
 	struct hot_inode_item *he;
 	struct rb_node *node;
 
+	cancel_delayed_work_sync(&root->update_work);
+	destroy_workqueue(root->update_wq);
+
 	spin_lock(&root->t_lock);
 	node = rb_first(&root->hot_inode_tree);
 	while (node) {
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 56ab13c..4a89fdb 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -12,15 +12,40 @@
 #ifndef __HOT_TRACKING__
 #define __HOT_TRACKING__
 
+#include <linux/workqueue.h>
 #include <linux/hot_tracking.h>
 
+#define HOT_UPDATE_INTERVAL 150
+
 /* size of sub-file ranges */
 #define RANGE_BITS 20
 #define FREQ_POWER 4
 
+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20 /* NRW - number of writes since mount */
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30 /* LTR - time elapsed since last read(ns) */
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30 /* LTW - time elapsed since last write(ns) */
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ */
+#define AVR_DIVIDER_POWER 40 /* AVR - average delta between recent reads(ns) */
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
+#define AVW_COEFF_POWER 0
+
 void __init hot_cache_init(void);
 void hot_inode_item_put(struct hot_inode_item *he);
 struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root, u64 ino);
 void hot_inode_item_unlink(struct inode *inode);
+u32 hot_temp_calc(struct hot_freq *freq);
 
 #endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 5f02025..3f82610 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -52,6 +52,7 @@ struct hot_inode_item {
 	struct kref refs;
 	struct rb_node rb_node;         /* rbtree index */
 	struct rcu_head rcu;
+	struct list_head track_list;    /* link to *_map[] */
 	struct rb_root hot_range_tree;	/* tree of ranges */
 	spinlock_t i_lock;		/* protect above tree */
 	struct hot_info *hot_root;	/* associated hot_info */
@@ -67,6 +68,7 @@ struct hot_range_item {
 	struct kref refs;
 	struct rb_node rb_node;                 /* rbtree index */
 	struct rcu_head rcu;
+	struct list_head track_list;            /* link to *_map[] */
 	struct hot_inode_item *hot_inode;	/* associated hot_inode_item */
 	loff_t start;				/* offset in bytes */
 	size_t len;				/* length in bytes */
@@ -74,7 +76,11 @@ struct hot_range_item {
 
 struct hot_info {
 	struct rb_root hot_inode_tree;
-	spinlock_t t_lock;				/* protect above tree */
+	struct list_head hot_map[MAX_TYPES][MAP_SIZE];	/* map of inode temp */
+	spinlock_t t_lock;		/* protect tree and map for inode item */
+	spinlock_t m_lock;		/* protect map for range item */
+	struct workqueue_struct *update_wq;
+	struct delayed_work update_work;
 };
 
 extern int hot_track_init(struct super_block *sb);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 04/11] VFS hot tracking: Add shrinker functionality to curtail memory usage
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (2 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 03/11] VFS hot tracking: Add a workqueue to move items between hot maps Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 05/11] VFS hot tracking: Add an ioctl to get hot tracking information Zhi Yong Wu
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Register a shrinker to control the amount of memory that
is used in tracking hot regions. If we are throwing inodes
out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking
code is using, even if it means losing useful information.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 91 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/hot_tracking.h |  2 +
 2 files changed, 93 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 35d3b83..ac0cdda 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -29,6 +29,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
 	hr->start = start;
 	hr->len = 1 << RANGE_BITS;
 	hr->hot_inode = he;
+	atomic_long_inc(&he->hot_root->hot_cnt);
 }
 
 static void hot_range_item_free_cb(struct rcu_head *head)
@@ -51,6 +52,7 @@ static void hot_range_item_free(struct kref *kref)
 		list_del_init(&hr->track_list);
 	spin_unlock(&root->m_lock);
 
+	atomic_long_dec(&root->hot_cnt);
 	call_rcu(&hr->rcu, hot_range_item_free_cb);
 }
 
@@ -98,6 +100,7 @@ redo:
 				 * the item for the range. Free the
 				 * newly allocated item.
 				 */
+				atomic_long_dec(&he->hot_root->hot_cnt);
 				kmem_cache_free(hot_range_item_cachep, hr_new);
 			}
 			spin_unlock(&he->i_lock);
@@ -199,6 +202,7 @@ static void hot_inode_item_init(struct hot_inode_item *he,
 	he->ino = ino;
 	he->hot_root = root;
 	spin_lock_init(&he->i_lock);
+	atomic_long_inc(&root->hot_cnt);
 }
 
 static void hot_inode_item_free_cb(struct rcu_head *head)
@@ -219,6 +223,7 @@ static void hot_inode_item_free(struct kref *kref)
 		list_del_init(&he->track_list);
 	hot_range_tree_free(he);
 
+	atomic_long_dec(&he->hot_root->hot_cnt);
 	call_rcu(&he->rcu, hot_inode_item_free_cb);
 }
 
@@ -264,6 +269,7 @@ redo:
 				 * the item for the inode. Free the
 				 * newly allocated item.
 				 */
+				atomic_long_dec(&root->hot_cnt);
 				kmem_cache_free(hot_inode_item_cachep, he_new);
 			}
 			spin_unlock(&root->t_lock);
@@ -485,6 +491,47 @@ u32 hot_temp_calc(struct hot_freq *freq)
 	return result;
 }
 
+static unsigned long hot_item_evict(struct hot_info *root, unsigned long work,
+			unsigned long (*work_get)(struct hot_info *root))
+{
+	long budget = work;
+	unsigned long freed = 0;
+	int i;
+
+	for (i = 0; i < MAP_SIZE; i++) {
+		struct hot_inode_item *he, *next;
+
+		spin_lock(&root->t_lock);
+		if (list_empty(&root->hot_map[TYPE_INODE][i])) {
+			spin_unlock(&root->t_lock);
+			continue;
+		}
+
+		list_for_each_entry_safe(he, next,
+			&root->hot_map[TYPE_INODE][i], track_list) {
+			long work_prev, delta;
+
+			if (atomic_read(&he->refs.refcount) > 1)
+				continue;
+			work_prev = work_get(root);
+			hot_inode_item_put(he);
+			delta = work_prev - work_get(root);
+			budget -= delta;
+			freed += delta;
+			if (unlikely(budget <= 0))
+				break;
+		}
+		spin_unlock(&root->t_lock);
+
+		if (unlikely(budget <= 0))
+			break;
+
+		cond_resched();
+	}
+
+	return freed;
+}
+
 /*
  * Every sync period we update temperatures for
  * each hot inode item and hot range item for aging
@@ -528,6 +575,41 @@ void __init hot_cache_init(void)
 		kmem_cache_destroy(hot_inode_item_cachep);
 }
 
+static unsigned long hot_track_shrink_count(struct shrinker *shrink,
+			struct shrink_control *sc)
+{
+	struct hot_info *root =
+		container_of(shrink, struct hot_info, hot_shrink);
+
+	return (unsigned long)atomic_long_read(&root->hot_cnt);
+}
+
+static inline unsigned long hot_cnt_get(struct hot_info *root)
+{
+	return (unsigned long)atomic_long_read(&root->hot_cnt);
+}
+
+static unsigned long hot_prune_map(struct hot_info *root, unsigned long nr)
+{
+	return hot_item_evict(root, nr, hot_cnt_get);
+}
+
+/* The shrinker callback function */
+static unsigned long hot_track_shrink_scan(struct shrinker *shrink,
+			struct shrink_control *sc)
+{
+	struct hot_info *root =
+		container_of(shrink, struct hot_info, hot_shrink);
+	unsigned long freed;
+
+	if (!(sc->gfp_mask & __GFP_FS))
+		return SHRINK_STOP;
+
+	freed =  hot_prune_map(root, sc->nr_to_scan);
+
+	return freed;
+}
+
 /*
  * Main function to update i/o access frequencies, and it will be called
  * from read/writepages() hooks, which are read_pages(), do_writepages(),
@@ -595,6 +677,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	root->hot_inode_tree = RB_ROOT;
 	spin_lock_init(&root->t_lock);
 	spin_lock_init(&root->m_lock);
+	atomic_long_set(&root->hot_cnt, 0);
 
 	for (i = 0; i < MAP_SIZE; i++) {
 		for (j = 0; j < MAX_TYPES; j++)
@@ -615,6 +698,13 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	queue_delayed_work(root->update_wq, &root->update_work,
 		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 
+	/* Register a shrinker callback */
+	root->hot_shrink.count_objects = hot_track_shrink_count;
+	root->hot_shrink.scan_objects = hot_track_shrink_scan;
+	root->hot_shrink.seeks = DEFAULT_SEEKS;
+	root->hot_shrink.flags = SHRINKER_NUMA_AWARE;
+	register_shrinker(&root->hot_shrink);
+
 	return root;
 }
 
@@ -626,6 +716,7 @@ static void hot_tree_exit(struct hot_info *root)
 	struct hot_inode_item *he;
 	struct rb_node *node;
 
+	unregister_shrinker(&root->hot_shrink);
 	cancel_delayed_work_sync(&root->update_work);
 	destroy_workqueue(root->update_wq);
 
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 3f82610..67468a51 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -79,8 +79,10 @@ struct hot_info {
 	struct list_head hot_map[MAX_TYPES][MAP_SIZE];	/* map of inode temp */
 	spinlock_t t_lock;		/* protect tree and map for inode item */
 	spinlock_t m_lock;		/* protect map for range item */
+	atomic_long_t hot_cnt;
 	struct workqueue_struct *update_wq;
 	struct delayed_work update_work;
+	struct shrinker hot_shrink;
 };
 
 extern int hot_track_init(struct super_block *sb);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 05/11] VFS hot tracking: Add an ioctl to get hot tracking information
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (3 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 04/11] VFS hot tracking: Add shrinker functionality to curtail memory usage Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 06/11] VFS hot tracking: Add a /proc interface to make the interval tunable Zhi Yong Wu
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics.

Optionally, retrieve the temperature from the hot data hash list
instead of recalculating it.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/compat_ioctl.c                 |  5 +++
 fs/ioctl.c                        | 71 +++++++++++++++++++++++++++++++++++++++
 include/linux/hot_tracking.h      | 10 +++++-
 include/uapi/linux/hot_tracking.h | 33 ++++++++++++++++++
 4 files changed, 118 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/hot_tracking.h

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5d19acf..9026b8a 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
 #include <linux/i2c-dev.h>
 #include <linux/atalk.h>
 #include <linux/gfp.h>
+#include <linux/hot_tracking.h>
 
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci.h>
@@ -1399,6 +1400,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
 /* fat 'r' ioctls. These are handled by fat with ->compat_ioctl,
    but we don't want warnings on other file systems. So declare
    them as compatible here. */
@@ -1578,6 +1582,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
 	case FIBMAP:
 	case FIGETBSZ:
 	case FIONREAD:
+	case FS_IOC_GET_HEAT_INFO:
 		if (S_ISREG(file_inode(f.file)->i_mode))
 			break;
 		/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index fd507fb..b3693ea 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include <linux/writeback.h>
 #include <linux/buffer_head.h>
 #include <linux/falloc.h>
+#include "hot_tracking.h"
 
 #include <asm/ioctls.h>
 
@@ -537,6 +538,73 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given inode.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the map list, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info->live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+	struct inode *inode = file_inode(file);
+	struct hot_info *root = inode->i_sb->s_hot_root;
+	struct hot_heat_info heat_info;
+	struct hot_inode_item *he;
+	int ret = 0;
+
+	/* The 'live' field need to be read from the user space */
+	if (copy_from_user((void *)&heat_info,
+			argp,
+			sizeof(struct hot_heat_info)) != 0) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	he = hot_inode_item_lookup(root, inode->i_ino);
+	if (IS_ERR(he)) {
+		/* we don't have any info on this file yet */
+		ret = -ENODATA;
+		goto err;
+	}
+
+	heat_info.avg_delta_reads =
+		(__u64) he->freq.avg_delta_reads;
+	heat_info.avg_delta_writes =
+		(__u64) he->freq.avg_delta_writes;
+	heat_info.last_read_time =
+	(__u64) timespec_to_ns(&he->freq.last_read_time);
+	heat_info.last_write_time =
+	(__u64) timespec_to_ns(&he->freq.last_write_time);
+	heat_info.num_reads = (__u32) he->freq.nr_reads;
+	heat_info.num_writes = (__u32) he->freq.nr_writes;
+
+	if (heat_info.live > 0) {
+		/*
+		 * got a request for live temperature,
+		 * call hot_calc_temp() to recalculate
+		 */
+		heat_info.temp = hot_temp_calc(&he->freq);
+	} else {
+		/* not live temperature, get it from the map list */
+		heat_info.temp = he->freq.last_temp;
+	}
+
+	spin_lock(&root->t_lock);
+	hot_inode_item_put(he);
+	spin_unlock(&root->t_lock);
+
+	if (copy_to_user(argp, (void *)&heat_info,
+			sizeof(struct hot_heat_info))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+err:
+	return ret;
+}
+
+/*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
@@ -591,6 +659,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, argp);
 
+	case FS_IOC_GET_HEAT_INFO:
+		return ioctl_heat_info(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 67468a51..0ee9ca2 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -15,11 +15,11 @@
 #ifndef _LINUX_HOTTRACK_H
 #define _LINUX_HOTTRACK_H
 
-#include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/kref.h>
 #include <linux/fs.h>
+#include <uapi/linux/hot_tracking.h>
 
 #define MAP_BITS 8
 #define MAP_SIZE (1 << MAP_BITS)
@@ -85,6 +85,14 @@ struct hot_info {
 	struct shrinker hot_shrink;
 };
 
+/*
+ * Hot data tracking ioctls:
+ *
+ * HOT_INFO - retrieve info on frequency of access
+ */
+#define FS_IOC_GET_HEAT_INFO _IOR('f', 17, \
+			struct hot_heat_info)
+
 extern int hot_track_init(struct super_block *sb);
 extern void hot_track_exit(struct super_block *sb);
 extern void hot_freqs_update(struct inode *inode, loff_t start,
diff --git a/include/uapi/linux/hot_tracking.h b/include/uapi/linux/hot_tracking.h
new file mode 100644
index 0000000..09dfc00
--- /dev/null
+++ b/include/uapi/linux/hot_tracking.h
@@ -0,0 +1,33 @@
+/*
+ *  include/uapi/linux/hot_tracking.h
+ *
+ * This file has definitions for VFS hot tracking
+ * structures etc.
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_HOTTRACK_H
+#define _UAPI_HOTTRACK_H
+
+#include <linux/types.h>
+
+struct hot_heat_info {
+	__u8 live;
+	__u8 resv[3];
+	__u32 temp;
+	__u64 avg_delta_reads;
+	__u64 avg_delta_writes;
+	__u64 last_read_time;
+	__u64 last_write_time;
+	__u32 num_reads;
+	__u32 num_writes;
+	__u64 future[4]; /* For future expansions */
+};
+
+#endif /* _UAPI_HOTTRACK_H */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 06/11] VFS hot tracking: Add a /proc interface to make the interval tunable
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (4 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 05/11] VFS hot tracking: Add an ioctl to get hot tracking information Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage Zhi Yong Wu
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Add a /proc interface hot-update-interval under the dir
/proc/sys/fs/ in order to turn HOT_UPDATE_INTERVAL into
a tunable parameter.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 6 ++++--
 fs/hot_tracking.h            | 2 --
 include/linux/hot_tracking.h | 3 +++
 kernel/sysctl.c              | 7 +++++++
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index ac0cdda..7a9bd4f 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,8 @@
 #include <linux/sched.h>
 #include "hot_tracking.h"
 
+int sysctl_hot_update_interval __read_mostly = 150;
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -556,7 +558,7 @@ static void hot_update_worker(struct work_struct *work)
 
 	/* Instert next delayed work */
 	queue_delayed_work(root->update_wq, &root->update_work,
-		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+		msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 }
 
 /*
@@ -696,7 +698,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	/* Initialize hot tracking wq and arm one delayed work */
 	INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
 	queue_delayed_work(root->update_wq, &root->update_work,
-		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+		msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 
 	/* Register a shrinker callback */
 	root->hot_shrink.count_objects = hot_track_shrink_count;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 4a89fdb..6a6971e 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -15,8 +15,6 @@
 #include <linux/workqueue.h>
 #include <linux/hot_tracking.h>
 
-#define HOT_UPDATE_INTERVAL 150
-
 /* size of sub-file ranges */
 #define RANGE_BITS 20
 #define FREQ_POWER 4
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 0ee9ca2..43df1b9 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -85,6 +85,9 @@ struct hot_info {
 	struct shrinker hot_shrink;
 };
 
+/* set how often to update temperatures (seconds) */
+extern int sysctl_hot_update_interval;
+
 /*
  * Hot data tracking ioctls:
  *
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..e0b062a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1631,6 +1631,13 @@ static struct ctl_table fs_table[] = {
 		.proc_handler	= &pipe_proc_fn,
 		.extra1		= &pipe_min_size,
 	},
+	{
+		.procname	= "hot-update-interval",
+		.data		= &sysctl_hot_update_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (5 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 06/11] VFS hot tracking: Add a /proc interface to make the interval tunable Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-11 22:15   ` Dave Hansen
  2013-12-11 15:44   ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 08/11] VFS hot tracking: Add documentation Zhi Yong Wu
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Introduce a /proc interface hot-mem-high-thresh and
to cap the memory which is consumed by hot_inode_item
and hot_range_item, and they will be in the unit of
1M bytes.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 29 +++++++++++++++++++++++++++++
 fs/hot_tracking.h            | 23 +++++++++++++++++++++++
 include/linux/hot_tracking.h |  3 +++
 kernel/sysctl.c              |  7 +++++++
 4 files changed, 62 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 7a9bd4f..2c5a7fd 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,7 @@
 #include <linux/sched.h>
 #include "hot_tracking.h"
 
+int sysctl_hot_mem_high_thresh __read_mostly = 0;
 int sysctl_hot_update_interval __read_mostly = 150;
 
 /* kmem_cache pointers for slab caches */
@@ -32,6 +33,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
 	hr->len = 1 << RANGE_BITS;
 	hr->hot_inode = he;
 	atomic_long_inc(&he->hot_root->hot_cnt);
+	hot_mem_limit_add(he->hot_root, sizeof(struct hot_range_item));
 }
 
 static void hot_range_item_free_cb(struct rcu_head *head)
@@ -55,6 +57,7 @@ static void hot_range_item_free(struct kref *kref)
 	spin_unlock(&root->m_lock);
 
 	atomic_long_dec(&root->hot_cnt);
+	hot_mem_limit_sub(root, sizeof(struct hot_range_item));
 	call_rcu(&hr->rcu, hot_range_item_free_cb);
 }
 
@@ -103,6 +106,8 @@ redo:
 				 * newly allocated item.
 				 */
 				atomic_long_dec(&he->hot_root->hot_cnt);
+				hot_mem_limit_sub(he->hot_root,
+						sizeof(struct hot_range_item));
 				kmem_cache_free(hot_range_item_cachep, hr_new);
 			}
 			spin_unlock(&he->i_lock);
@@ -205,6 +210,7 @@ static void hot_inode_item_init(struct hot_inode_item *he,
 	he->hot_root = root;
 	spin_lock_init(&he->i_lock);
 	atomic_long_inc(&root->hot_cnt);
+	hot_mem_limit_add(root, sizeof(struct hot_inode_item));
 }
 
 static void hot_inode_item_free_cb(struct rcu_head *head)
@@ -226,6 +232,7 @@ static void hot_inode_item_free(struct kref *kref)
 	hot_range_tree_free(he);
 
 	atomic_long_dec(&he->hot_root->hot_cnt);
+	hot_mem_limit_sub(he->hot_root, sizeof(struct hot_inode_item));
 	call_rcu(&he->rcu, hot_inode_item_free_cb);
 }
 
@@ -272,6 +279,8 @@ redo:
 				 * newly allocated item.
 				 */
 				atomic_long_dec(&root->hot_cnt);
+				hot_mem_limit_sub(root,
+						sizeof(struct hot_inode_item));
 				kmem_cache_free(hot_inode_item_cachep, he_new);
 			}
 			spin_unlock(&root->t_lock);
@@ -534,6 +543,23 @@ static unsigned long hot_item_evict(struct hot_info *root, unsigned long work,
 	return freed;
 }
 
+static void hot_mem_evict(struct hot_info *root)
+{
+	unsigned long sum, thresh;
+
+	if (sysctl_hot_mem_high_thresh == 0) 
+		return;
+
+	sum = hot_mem_limit_sum(root);
+	/* Note: sysctl_** is in the unit of 1M bytes */
+	thresh = sysctl_hot_mem_high_thresh;
+	thresh *= 1024 * 1024;
+	if (sum <= thresh)
+		return;
+
+	hot_item_evict(root, sum - thresh, hot_mem_limit_sum);
+}
+
 /*
  * Every sync period we update temperatures for
  * each hot inode item and hot range item for aging
@@ -546,6 +572,8 @@ static void hot_update_worker(struct work_struct *work)
 	struct hot_inode_item *he;
 	struct rb_node *node;
 
+	hot_mem_evict(root);
+
 	rcu_read_lock();
 	node = root->hot_inode_tree.rb_node;
 	while (node) {
@@ -753,6 +781,7 @@ int hot_track_init(struct super_block *sb)
 		goto err;
 	}
 
+	hot_mem_limit_init(root);
 	sb->s_hot_root = root;
 	sb->s_flags |= MS_HOTTRACK;
 
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 6a6971e..4ee0b90 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -46,4 +46,27 @@ struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root, u64 ino);
 void hot_inode_item_unlink(struct inode *inode);
 u32 hot_temp_calc(struct hot_freq *freq);
 
+/* Memory Tracking Functions. */
+static inline unsigned long hot_mem_limit_sum(struct hot_info *root)
+{
+	return atomic_long_read(&root->mem);
+}
+
+static inline void hot_mem_limit_sub(struct hot_info *root,
+				unsigned long count)
+{
+	atomic_long_sub(count, &root->mem);
+}
+
+static inline void hot_mem_limit_add(struct hot_info *root,
+				unsigned long count)
+{
+	atomic_long_add(count, &root->mem);
+}
+
+static inline void hot_mem_limit_init(struct hot_info *root)
+{
+	atomic_long_set(&root->mem, 0);
+}
+
 #endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 43df1b9..5c2c247 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -83,10 +83,13 @@ struct hot_info {
 	struct workqueue_struct *update_wq;
 	struct delayed_work update_work;
 	struct shrinker hot_shrink;
+	atomic_long_t mem;
 };
 
 /* set how often to update temperatures (seconds) */
 extern int sysctl_hot_update_interval;
+/* note: sysctl_** is in the unit of 1M bytes */
+extern int sysctl_hot_mem_high_thresh;
 
 /*
  * Hot data tracking ioctls:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e0b062a..fde8bc2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1632,6 +1632,13 @@ static struct ctl_table fs_table[] = {
 		.extra1		= &pipe_min_size,
 	},
 	{
+		.procname       = "hot-mem-high-thresh",
+		.data           = &sysctl_hot_mem_high_thresh,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
 		.procname	= "hot-update-interval",
 		.data		= &sysctl_hot_update_interval,
 		.maxlen		= sizeof(int),
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 08/11] VFS hot tracking: Add documentation
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (6 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 09/11] VFS hot tracking, btrfs: Add hot tracking support Zhi Yong Wu
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Add Documentation for VFS hot tracking feature

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 Documentation/filesystems/00-INDEX         |   2 +
 Documentation/filesystems/hot_tracking.txt | 207 +++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8042050..46b2f6f 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -122,3 +122,5 @@ xfs.txt
 	- info and mount options for the XFS filesystem.
 xip.txt
 	- info on execute-in-place for file mappings.
+hot_tracking.txt
+	- info on hot tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 0000000..df184b9
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,207 @@
+Hot Data Tracking
+
+April, 2013		Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. How to Calc Frequency of Reads/Writes & Temperature
+5. Git Development Tree
+6. Usage Example
+
+
+1. Introduction
+
+  The feature adds the  support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+"temperature" value that reflects what data is "hot", and filesystem
+can use this information to move hot data from slow devices to fast
+devices.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+
+2. Motivation
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+<https://btrfs.wiki.kernel.org/index.php/Project_ideas>.
+It will divide into two parts. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, this feature provides the first part
+of the functionality.
+
+
+3. The Design
+
+These include the following parts:
+
+    * Hooks in existing vfs functions to track data access frequency
+
+    * New rb-trees for tracking access frequency of inodes and sub-file
+ranges
+    The relationship between super_block and rb-trees is as below:
+hot_info.hot_inode_tree
+    Each FS instance can find hot tracking info s_hot_root.
+    hot_info has hot_inode_tree and it has inode's hot information,
+and it has hot_range_tree, which has range's hot information.
+
+    * A list of hot inodes and hot ranges by its temperature
+
+    * A work queue for updating inode heat info
+
+    * Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+    * An ioctl to retrieve the frequency information collected for a certain
+inode
+
+Let us see their relationship as below:
+
+    * hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+    * hot_inode_item contains access frequency data for that inode
+
+    * hot_inode_item holds a track list node to link the access frequency
+data for that inode
+
+    * hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+    * hot_range_item contains access frequency data for that range
+
+    * hot_range_item holds a track list node to link the access frequency
+data for that range
+
+    * hot_info.hot_map[TYPE_INODE] indexes per-inode track list nodes
+
+    * hot_info.hot_map[TYPE_RANGE] indexes per-range track list nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+                          super_block
+                              |
+                              V
+                           hot_info
+                              |
+    +-------------------------+----------------------------------------+
+    |                         |                                        |
+    |                         |                                        |
+    V                         V                                        V
+heat_inode_map           hot_inode_tree                         heat_range_map
+    |                         |                                        |
+    |                         V hot_inode_item                         |
+    |           +----------list_head---------+                         |
+    |           |       frequency data       |                         |
++---+           |                            |                         |
+|               V hot_inode_item             V hot_inode_item          |
+|....<-----list-head--->...      ...<----list_head---->...             |
+        frequency data                 frequency data                  |
+         hot_range_tree                hot_range_tree                  |
+                                             |                         |
+                                             V hot_range_item          |
+                               +---------list_head----------+          |
+                               |       frequency data       |          |
+                               |            ^               |          +---+
+                hot_range_item V            | |             Vhot_range_item|
+                        <--list_head-->...  | |  ...<--list_head-->....... |
+                        frequency data               frequency data
+
+
+4. How to Calc Frequency of Reads/Writes & Temperature
+
+1.) hot_freq_calc()
+
+  This function does the actual work of updating the frequency numbers.
+FREQ_POWER determines how many atime deltas we keep track of (as a power of 2).
+So, setting it to anything above 16ish is probably overkill. Also,
+the higher the power, the more bits get right shifted out of the timestamp,
+reducing precision, so take note of that as well.
+
+  FREQ_POWER, defined immediately below, determines how heavily to weight
+the current frequency numbers against the newest access. For example, a value
+of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+as heavily as the existing frequency info. In essence, this is a kludged-
+together version of a weighted average, since we can't afford to keep all of
+the information that it would take to get a _real_ weighted average.
+
+2.) hot_temp_calc()
+
+  The following comments explain what exactly comprises a unit of heat.
+Each of six values of heat are calculated and combined in order to form an
+overall temperature for the data:
+
+    * NRR - number of reads since mount
+    * NRW - number of writes since mount
+    * LTR - time elapsed since last read (ns)
+    * LTW - time elapsed since last write (ns)
+    * AVR - average delta between recent reads (ns)
+    * AVW - average delta between recent writes (ns)
+
+  These values are divided (right-shifted) according to the *_DIVIDER_POWER
+values defined below to bring the numbers into a reasonable range. You can
+modify these values to fit your needs. However, each heat unit is a u32 and
+thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+carefully or else they could max out or be stuck at zero quite easily.
+(E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+delta would bring the temperature above zero, ever.)
+
+  Finally, each value is added to the overall temperature between 0 and 8
+times, depending on its *_COEFF_POWER value. Note that the coefficients are
+also actually implemented with shifts, so take care to treat these values
+as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+
+    * AVR/AVW cold unit = 2^X ns of average delta
+    * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+
+  E.g., data with an average delta between 0 and 2^X ns will have a cold
+value of 0, which means a heat value equal to HEAT_MAX_VALUE.
+
+  This function is responsible for distilling the six heat
+criteria, which are described in detail in hot_tracking.h) down into a single
+temperature value for the data, which is an integer between 0
+and HEAT_MAX_VALUE.
+
+  To accomplish this, the raw values from the hot_freq_data structure
+are shifted in order to make the temperature calculation more
+or less sensitive to each value.
+
+  Once this calibration has happened, we do some additional normalization and
+make sure that everything fits nicely in a u32. From there, we take a very
+rudimentary kind of "average" of each of the values, where the *_COEFF_POWER
+values act as weights for the average.
+
+  Finally, we use the MAP_BITS value, which determines the size of the
+heat list array, to normalize the temperature to the proper granularity.
+
+
+5. Git Development Tree
+
+  This feature is still on development and review, so if you're interested,
+you can pull from the git repository at the following location:
+
+  https://github.com/wuzhy/kernel.git hot_tracking
+  git://github.com/wuzhy/kernel.git hot_tracking
+
+
+6. Usage Example
+
+1.) To use hot tracking, you should mount like this:
+
+$ mount -o hot_track /dev/sdb /mnt
+[ 1505.894078] device label test devid 1 transid 29 /dev/sdb
+[ 1505.952977] btrfs: disk space caching is enabled
+[ 1506.069678] VFS: Turning on hot tracking
+
+2.) Retrieve hot tracking info for some specific file by ioctl().
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 09/11] VFS hot tracking, btrfs: Add hot tracking support
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (7 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 08/11] VFS hot tracking: Add documentation Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 10/11] VFS hot tracking, xfs: " Zhi Yong Wu
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Introduce one new mount option '-o hot_track',
and add its parsing support.

Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/super.c | 22 +++++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0506f40..b8d8982 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1990,6 +1990,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
 #define BTRFS_MOUNT_RESCAN_UUID_TREE	(1 << 23)
+#define BTRFS_MOUNT_HOT_TRACK		(1 << 24)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e913328..69fe31d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -42,6 +42,7 @@
 #include <linux/cleancache.h>
 #include <linux/ratelimit.h>
 #include <linux/btrfs.h>
+#include <linux/hot_tracking.h>
 #include "compat.h"
 #include "delayed-inode.h"
 #include "ctree.h"
@@ -310,6 +311,10 @@ static void btrfs_put_super(struct super_block *sb)
 	 * last process that kept it busy.  Or segfault in the aforementioned
 	 * process...  Whom would you report that to?
 	 */
+
+	/* Hot data tracking */
+	if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+		hot_track_exit(sb);
 }
 
 enum {
@@ -323,7 +328,7 @@ enum {
 	Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
 	Opt_check_integrity, Opt_check_integrity_including_extent_data,
 	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
-	Opt_commit_interval,
+	Opt_commit_interval, Opt_hot_track,
 	Opt_err,
 };
 
@@ -366,6 +371,7 @@ static match_table_t tokens = {
 	{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
 	{Opt_commit_interval, "commit=%d"},
+	{Opt_hot_track, "hot_track"},
 	{Opt_err, NULL},
 };
 
@@ -676,6 +682,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 				info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
 			}
 			break;
+		case Opt_hot_track:
+			btrfs_set_opt(info->mount_opt, HOT_TRACK);
+			break;
 		case Opt_err:
 			printk(KERN_INFO "btrfs: unrecognized mount option "
 			       "'%s'\n", p);
@@ -898,11 +907,20 @@ static int btrfs_fill_super(struct super_block *sb,
 		goto fail_close;
 	}
 
+	if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+		err = hot_track_init(sb);
+		if (err)
+			goto fail_hot;
+	}
+
 	save_mount_options(sb, data);
 	cleancache_init_fs(sb);
 	sb->s_flags |= MS_ACTIVE;
 	return 0;
 
+fail_hot:
+	dput(sb->s_root);
+	sb->s_root = NULL;
 fail_close:
 	close_ctree(fs_info->tree_root);
 	return err;
@@ -1014,6 +1032,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",fatal_errors=panic");
 	if (info->commit_interval != BTRFS_DEFAULT_COMMIT_INTERVAL)
 		seq_printf(seq, ",commit=%d", info->commit_interval);
+	if (btrfs_test_opt(root, HOT_TRACK))
+		seq_puts(seq, ",hot_track");
 	return 0;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 10/11] VFS hot tracking, xfs: Add hot tracking support
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (8 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 09/11] VFS hot tracking, btrfs: Add hot tracking support Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-06 13:45 ` [PATCH v6 11/11] MAINTAINERS: add the maintainers for VFS hot tracking Zhi Yong Wu
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Dave Chinner, Zhi Yong Wu

From: Dave Chinner <dchinner@redhat.com>

Connect up the VFS hot tracking support so XFS filesystem
can make use of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/xfs/xfs_mount.h |  1 +
 fs/xfs/xfs_super.c | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 1fa0584..c6bbf31 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -184,6 +184,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_WSYNC		(1ULL << 0)	/* for nfs - all metadata ops
 						   must be synchronous except
 						   for space allocations */
+#define XFS_MOUNT_HOTTRACK      (1ULL << 1)     /* hot tracking */
 #define XFS_MOUNT_WAS_CLEAN	(1ULL << 3)
 #define XFS_MOUNT_FS_SHUTDOWN	(1ULL << 4)	/* atomic stop of all filesystem
 						   operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 15188cc..a2667f9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -62,6 +62,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/parser.h>
+#include <linux/hot_tracking.h>
 
 static const struct super_operations xfs_super_operations;
 static kmem_zone_t *xfs_ioend_zone;
@@ -115,6 +116,7 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_NODELAYLOG  "nodelaylog"	/* Delayed logging disabled */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
+#define MNTOPT_HOTTRACK    "hot_track"  /* hot tracking */
 
 /*
  * Table driven mount option parser.
@@ -381,6 +383,8 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
+		} else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
+			mp->m_flags |= XFS_MOUNT_HOTTRACK;
 		} else if (!strcmp(this_char, "ihashsize")) {
 			xfs_warn(mp,
 	"ihashsize no longer used, option is deprecated.");
@@ -504,6 +508,7 @@ xfs_showargs(
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
+		{ XFS_MOUNT_HOTTRACK,		"," MNTOPT_HOTTRACK },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1046,6 +1051,9 @@ xfs_fs_put_super(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	if (mp->m_flags & XFS_MOUNT_HOTTRACK)
+		hot_track_exit(sb);
+
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
 
@@ -1501,8 +1509,18 @@ xfs_fs_fill_super(
 		goto out_unmount;
 	}
 
+	if (mp->m_flags & XFS_MOUNT_HOTTRACK) {
+		error = hot_track_init(sb);
+		if (error)
+			goto out_free_root;
+	}
+
 	return 0;
 
+ out_free_root:
+	dput(sb->s_root);
+	sb->s_root = NULL;
+
  out_filestream_unmount:
 	xfs_filestream_unmount(mp);
  out_free_sb:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 11/11] MAINTAINERS: add the maintainers for VFS hot tracking
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (9 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 10/11] VFS hot tracking, xfs: " Zhi Yong Wu
@ 2013-11-06 13:45 ` Zhi Yong Wu
  2013-11-11 15:43 ` [PATCH v6 00/11] " Zhi Yong Wu
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-06 13:45 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

This patch adds maintainer information for VFS hot tracking
into the MAINTAINERS file.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 MAINTAINERS | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ffcaf97..49ff6cd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3387,6 +3387,18 @@ L:	linux-fsdevel@vger.kernel.org
 S:	Maintained
 F:	fs/*
 
+VFS HOT TRACKING
+M:	Zhi Yong Wu <zwu.kernel@gmail.com>
+M:	Chandra Seetharaman <sekharan@us.ibm.com>
+L:	linux-fsdevel@vger.kernel.org
+T:	git git://github.com/wuzhy/kernel.git
+S:	Maintained
+F:	Documentation/filesystems/hot_tracking.txt
+F:	fs/hot_tracking.c
+F:	fs/hot_tracking.h
+F:	include/linux/hot_tracking.h
+F:	include/uapi/linux/hot_tracking.h
+
 FINTEK F75375S HARDWARE MONITOR AND FAN CONTROLLER DRIVER
 M:	Riku Voipio <riku.voipio@iki.fi>
 L:	lm-sensors@lm-sensors.org
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/11] VFS hot tracking
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (10 preceding siblings ...)
  2013-11-06 13:45 ` [PATCH v6 11/11] MAINTAINERS: add the maintainers for VFS hot tracking Zhi Yong Wu
@ 2013-11-11 15:43 ` Zhi Yong Wu
  2013-11-13 18:33 ` Zhi Yong Wu
  2013-12-11 15:45 ` Zhi Yong Wu
  13 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-11 15:43 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel mlist, Zhi Yong Wu

ping? any plan to review?

On Wed, Nov 6, 2013 at 9:45 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
>   The patchset is trying to introduce hot tracking function in
> VFS layer, which will keep track of real disk I/O in memory.
> By it, you will easily know more details about disk I/O, and
> then detect where disk I/O hot spots are. Also, specific FS
> can take use of it to do accurate defragment, and hot relocation
> support, etc.
>
>   Now it's time to send out its V6 for external review, and
> any comments or ideas are appreciated, thanks.
>
> NOTE:
>
>   The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
>   If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
>   For how to use and more other info and performance report,
> please check hot_tracking.txt in Documentation and following
> links:
>   1.) http://lwn.net/Articles/525651/
>   2.) https://lkml.org/lkml/2012/12/20/199
>
>   This patchset has been done scalability or performance tests
> by fs_mark, ffsb and compilebench.
>
>   The perf testings were done on Linux 3.12.0-rc7 with Model IBM,8231-E2C
> Big Endian PPC64 with 64 CPUs and 2 NUMA nodes, 250G RAM and 1.50 TiB
> test hard disk where each test file size is 20G or 100G.
> Architecture:          ppc64
> Byte Order:            Big Endian
> CPU(s):                64
> On-line CPU(s) list:   0-63
> Thread(s) per core:    4
> Core(s) per socket:    1
> Socket(s):             16
> NUMA node(s):          2
> Model:                 IBM,8231-E2C
> Hypervisor vendor:     pHyp
> Virtualization type:   full
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              4096K
> NUMA node0 CPU(s):     0-31
> NUMA node1 CPU(s):     32-63
>
>   Below is the perf testing report:
>
>   Please focus on the two key points:
>   - The overall overhead which is injected by the patchset
>   - The stability of the perf results
>
> 1. fio tests
>
>                             w/o hot tracking                               w/ hot tracking
>
> RAM size                            32G          32G         16G           8G           4G           2G          250G
>
> sequential-8k-1jobs-read         61260KB/s    60918KB/s    60901KB/s    62610KB/s    60992KB/s    60213KB/s    60948KB/s
>
> sequential-8k-1jobs-write         1329KB/s     1329KB/s     1328KB/s     1329KB/s     1328KB/s     1329KB/s     1329KB/s
>
> sequential-8k-8jobs-read         91139KB/s    92614KB/s    90907KB/s    89895KB/s    92022KB/s    90851KB/s    91877KB/s
>
> sequential-8k-8jobs-write         2523KB/s     2522KB/s     2516KB/s     2521KB/s     2516KB/s     2518KB/s     2521KB/s
>
> sequential-256k-1jobs-read      151432KB/s   151403KB/s   151406KB/s   151422KB/s   151344KB/s   151446KB/s   151372KB/s
>
> sequential-256k-1jobs-write      33451KB/s    33470KB/s    33481KB/s    33470KB/s    33459KB/s    33472KB/s    33477KB/s
>
> sequential-256k-8jobs-read      235291KB/s   234555KB/s   234251KB/s   233656KB/s   234927KB/s   236380KB/s   235535KB/s
>
> sequential-256k-8jobs-write      62419KB/s    62402KB/s    62191KB/s    62859KB/s    62629KB/s    62720KB/s    62523KB/s
>
> random-io-mix-8k-1jobs  [READ]    2929KB/s     2942KB/s     2946KB/s     2929KB/s     2934KB/s     2947KB/s     2946KB/s
>                         [WRITE]   1262KB/s     1266KB/s     1257KB/s     1262KB/s     1257KB/s     1257KB/s     1265KB/s
>
> random-io-mix-8k-8jobs  [READ]    2444KB/s     2442KB/s     2436KB/s     2416KB/s     2353KB/s     2441KB/s     2442KB/s
>                         [WRITE]   1047KB/s     1044KB/s     1047KB/s     1028KB/s     1017KB/s     1034KB/s     1049KB/s
>
> random-io-mix-8k-16jobs [READ]    2182KB/s     2184KB/s     2169KB/s     2178KB/s     2190KB/s     2184KB/s     2180KB/s
>                         [WRITE]    932KB/s      930KB/s      943KB/s      936KB/s      937KB/s      929KB/s      931KB/s
>
> The above perf parameter is the aggregate bandwidth of threads in the group;
> If you hope to know how about other perf parameters, or fio raw results, please let me know, thanks.
>
> 2. Locking stat - Contention & Cacheline Bouncing
>
> RAM size         class name         con-bounces  contentions  acq-bounces   acquisitions   cacheline bouncing  locking contention
>                                                                                                  ratio              ratio
>
>               &(&root->t_lock)->rlock:  1508        1592         157834      374639292           0.96%              0.00%
> 250G          &(&root->m_lock)->rlock:  1469        1484         119221       43077842           1.23%              0.00%
>               &(&he->i_lock)->rlock:       0           0         101879      376755218           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  2912        2985         342575      374691186           0.85%              0.00%
> 32G           &(&root->m_lock)->rlock:   188         193         307765        8803163           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         291860      376756084           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  3863        3948         298041      374727038           1.30%              0.00%
> 16G           &(&root->m_lock)->rlock:   220         228         254451        8687057           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         235027      376756830           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  3283        3409         233790      374722064           1.40%              0.00%
> 8G            &(&root->m_lock)->rlock:   136         139         203917        8684313           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         193746      376756438           0.00%              0.00%
>
>               &(&root->t_lock)->rlock: 15090       15705         283460      374889666           5.32%              0.00%
> 4G            &(&root->m_lock)->rlock:   172         173         222480        8555052           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         206431      376759452           0.00%              0.00%
>
>               &(&root->t_lock)->rlock: 25515       27368         305129       375394828          8.36%              0.00%
> 2G            &(&root->m_lock)->rlock:   100         101         216516        6752265           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         214713      376765169           0.00%              0.00%
>
> 3. Perf test - Cacheline Ping-pong
>
>                       w/o hot tracking                                                        w/ hot tracking
>
> RAM size                    32G                  32G                 16G                  8G                   4G                    2G                  250G
>
> cache-references    1,264,996,437,581    1,401,504,955,577    1,398,308,614,801    1,396,525,544,527    1,384,793,467,410    1,432,042,560,409    1,571,627,148,771
>
> cache-misses           45,424,567,057       58,432,749,807       59,200,504,032       59,762,030,933       58,104,156,576       57,283,962,840       61,963,839,419
>
> seconds time elapsed  22956.327674298      23035.457069488      23017.232397085      23012.397142967      23008.420970731      23057.245578767      23342.456015188
>
> cache-misses ratio            3.591 %              4.169 %              4.234 %              4.279 %              4.196 %              4.000 %              3.943 %
>
> Changelog from v5:
>  - Also added the hook hot_freqs_update() in the page cache I/O path,
>    not only in real disk I/O path [viro]
>  - Don't export the stuff until it's used by a module [viro]
>  - Splitted hot_inode_item_lookup() [viro]
>  - Prevented hot items from being re-created after the inode was unlinked. [viro]
>  - Made hot_freqs_update() to be inline and adopt one private hot flag [viro]
>  - Killed hot_bit_shift() [viro]
>  - Used file_inode() instead of file->f_dentry->d_inode [viro]
>  - Introduced one new file hot_tracking.h in include/uapi/linux/ [viro]
>  - Made the checks for ->i_nlink to be protectd by ->i_mutex [viro]
>
> v5:
>  - Added all kinds of perf testing report [viro]
>  - Covered mmap() now [viro]
>  - Removed list_sort() in hot_update_worker() to avoid locking contention
>    and cacheline bouncing [viro]
>  - Removed a /proc interface to control low memory usage [Chandra]
>  - Adjusted shrinker support due to the change of public shrinker APIs [zwu]
>  - Fixed the locking missing issue when hot_inode_item_put() is called
>    in ioctl_heat_info() [viro]
>  - Fixed some locking contention issues [zwu]
>
> v4:
>  - Removed debugfs support, but leave it to TODO list [viro, Chandra]
>  - Killed HOT_DELETING and HOT_IN_LIST flag [viro]
>  - Fixed unlink issues [viro]
>  - Fixed the issue on lookups (both for inode and range)
>    leak on race with unlink  [viro]
>  - Killed hot_comm_item and split the functions which take it [virio]
>  - Fixed some other issues [zwu, Chandra]
>
> v3:
>  - Added memory caping function for hot items [Zhiyong]
>  - Cleanup aging function [Zhiyong]
>
> v2:
>  - Refactored to be under RCU [Chandra Seetharaman]
>   Merged some code changes [Chandra Seetharaman]
>  - Fixed some issues [Chandra Seetharaman]
>
> v1:
>  - Solved 64 bits inode number issue. [David Sterba]
>  - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
>  - Cleanup Some issues [David Sterba]
>  - Use a static hot debugfs root [Greg KH]
>
> rfcv4:
>  - Introduce hot func registering framework [Zhiyong]
>  - Remove global variable for hot tracking [Zhiyong]
>  - Add btrfs hot tracking support [Zhiyong]
>
> rfcv3:
>  1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
>  2.) Refactored workqueue support. [Dave Chinner]
>  3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
>      TIME_TO_KICK, and HEAT_UPDATE_DELAY
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
>
> rfcv2:
>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>  2.) Added memory shrinker [Dave Chinner]
>  3.) Converted to one workqueue to update map info periodically [Dave Chinner]
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
> rfcv1:
>  1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
>  2.) The first three patches can probably just be flattened into one.
>                                         [Marco Stornelli , Dave Chinner]
>
>
> Dave Chinner (1):
>   VFS hot tracking, xfs: Add hot tracking support
>
> Zhi Yong Wu (10):
>   VFS hot tracking: Define basic data structures and functions
>   VFS hot tracking: Track IO and record heat information
>   VFS hot tracking: Add a workqueue to move items between hot maps
>   VFS hot tracking: Add shrinker functionality to curtail memory usage
>   VFS hot tracking: Add an ioctl to get hot tracking information
>   VFS hot tracking: Add a /proc interface to make the interval tunable
>   VFS hot tracking: Add a /proc interface to control memory usage
>   VFS hot tracking: Add documentation
>   VFS hot tracking, btrfs: Add hot tracking support
>   MAINTAINERS: add the maintainers for VFS hot tracking
>
>  Documentation/filesystems/00-INDEX         |   2 +
>  Documentation/filesystems/hot_tracking.txt | 207 ++++++++
>  MAINTAINERS                                |  12 +
>  fs/Makefile                                |   2 +-
>  fs/btrfs/ctree.h                           |   1 +
>  fs/btrfs/super.c                           |  22 +-
>  fs/compat_ioctl.c                          |   5 +
>  fs/dcache.c                                |   2 +
>  fs/hot_tracking.c                          | 816 +++++++++++++++++++++++++++++
>  fs/hot_tracking.h                          |  72 +++
>  fs/ioctl.c                                 |  71 +++
>  fs/namei.c                                 |   4 +
>  fs/xfs/xfs_mount.h                         |   1 +
>  fs/xfs/xfs_super.c                         |  18 +
>  include/linux/fs.h                         |   4 +
>  include/linux/hot_tracking.h               | 107 ++++
>  include/uapi/linux/fs.h                    |   1 +
>  include/uapi/linux/hot_tracking.h          |  33 ++
>  kernel/sysctl.c                            |  14 +
>  mm/filemap.c                               |  24 +-
>  mm/readahead.c                             |   6 +
>  21 files changed, 1420 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/filesystems/hot_tracking.txt
>  create mode 100644 fs/hot_tracking.c
>  create mode 100644 fs/hot_tracking.h
>  create mode 100644 include/linux/hot_tracking.h
>  create mode 100644 include/uapi/linux/hot_tracking.h
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-06 13:45 ` [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage Zhi Yong Wu
@ 2013-11-11 22:15   ` Dave Hansen
  2013-11-11 22:45     ` Zhi Yong Wu
  2013-12-11 15:44   ` Zhi Yong Wu
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2013-11-11 22:15 UTC (permalink / raw)
  To: Zhi Yong Wu, viro
  Cc: linux-fsdevel, linux-kernel, Zhi Yong Wu, Chandra Seetharaman

On 11/06/2013 05:45 AM, Zhi Yong Wu wrote:
> Introduce a /proc interface hot-mem-high-thresh and
> to cap the memory which is consumed by hot_inode_item
> and hot_range_item, and they will be in the unit of
> 1M bytes.

You don't seem to have any documentation for this, btw... :(

> +		.procname       = "hot-mem-high-thresh",

*Always* put units on these.  I know you mention it in a code comment,
but please also include it in the proc filename too.

In general, why do you have to control the number of these statically?
Shouldn't you just define a shrinker and let memory pressure determine
how many of these we allow to exist?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-11 22:15   ` Dave Hansen
@ 2013-11-11 22:45     ` Zhi Yong Wu
  2013-11-12 17:05       ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-11 22:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Al Viro, linux-fsdevel, linux-kernel mlist, Zhi Yong Wu,
	Chandra Seetharaman

On Tue, Nov 12, 2013 at 6:15 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 11/06/2013 05:45 AM, Zhi Yong Wu wrote:
>> Introduce a /proc interface hot-mem-high-thresh and
>> to cap the memory which is consumed by hot_inode_item
>> and hot_range_item, and they will be in the unit of
>> 1M bytes.
>
> You don't seem to have any documentation for this, btw... :(
>
>> +             .procname       = "hot-mem-high-thresh",
>
> *Always* put units on these.  I know you mention it in a code comment,
> but please also include it in the proc filename too.
If you think it is better, i will add it.
>
> In general, why do you have to control the number of these statically?
It gives the user or admin one optional chance to control the amount
of memory consumed by VFS hot tracking. And you can choose not to use
it.
> Shouldn't you just define a shrinker and let memory pressure determine
> how many of these we allow to exist?
How about if the user and admin hope to control the amount of the
memory consumed by VFS hot tracking? e.g. If the host has several
hundred of G or T memory, but the user or admin hope that the memory
size consumed by VFS hot tracking is under several G, In the case,
maybe a shrinker of VFS hot tracking will never be invoked by system
memory module, so this interface will make sense.



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-11 22:45     ` Zhi Yong Wu
@ 2013-11-12 17:05       ` Dave Hansen
  2013-11-12 20:38         ` Zhi Yong Wu
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2013-11-12 17:05 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Al Viro, linux-fsdevel, linux-kernel mlist, Zhi Yong Wu,
	Chandra Seetharaman

On 11/11/2013 02:45 PM, Zhi Yong Wu wrote:
> On Tue, Nov 12, 2013 at 6:15 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> In general, why do you have to control the number of these statically?
> It gives the user or admin one optional chance to control the amount
> of memory consumed by VFS hot tracking. And you can choose not to use
> it.

The on/off knob seems to me to be something better left to a mount
option, not a global tunable.

>> Shouldn't you just define a shrinker and let memory pressure determine
>> how many of these we allow to exist?
> How about if the user and admin hope to control the amount of the
> memory consumed by VFS hot tracking? e.g. If the host has several
> hundred of G or T memory, but the user or admin hope that the memory
> size consumed by VFS hot tracking is under several G, In the case,
> maybe a shrinker of VFS hot tracking will never be invoked by system
> memory module, so this interface will make sense.

If the shrinker is not invoked, that means that there is lots of memory
free.  In the case that there is lots of memory free, are you arguing
that a user would rather see memory go *unused* than be put to use for
this hot tracking data?

If this were true, why don't we have similar knobs for the dentry, inode
and page caches?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-12 17:05       ` Dave Hansen
@ 2013-11-12 20:38         ` Zhi Yong Wu
  2013-11-12 21:02           ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-12 20:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Al Viro, linux-fsdevel, linux-kernel mlist, Zhi Yong Wu,
	Chandra Seetharaman

On Wed, Nov 13, 2013 at 1:05 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 11/11/2013 02:45 PM, Zhi Yong Wu wrote:
>> On Tue, Nov 12, 2013 at 6:15 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> In general, why do you have to control the number of these statically?
>> It gives the user or admin one optional chance to control the amount
>> of memory consumed by VFS hot tracking. And you can choose not to use
>> it.
>
> The on/off knob seems to me to be something better left to a mount
> option, not a global tunable.
If it is left to a mount option, the user or admin can't change it
*dynamically*.

>
>>> Shouldn't you just define a shrinker and let memory pressure determine
>>> how many of these we allow to exist?
>> How about if the user and admin hope to control the amount of the
>> memory consumed by VFS hot tracking? e.g. If the host has several
>> hundred of G or T memory, but the user or admin hope that the memory
>> size consumed by VFS hot tracking is under several G, In the case,
>> maybe a shrinker of VFS hot tracking will never be invoked by system
>> memory module, so this interface will make sense.
>
> If the shrinker is not invoked, that means that there is lots of memory
> free.  In the case that there is lots of memory free, are you arguing
> that a user would rather see memory go *unused* than be put to use for
> this hot tracking data?
At first, some user or admin has a lot of use cases which you can't imagine.
If he hope that the usage of memory consumed by VFS hot tracking
doesn't affect other key applications, how about it? This only give
one fine-grained control to the usage of memory consumed by VFS hot
tracking.
>
> If this were true, why don't we have similar knobs for the dentry, inode
> and page caches?
This is not be controlled by memory controller(mem_cgroup)?



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-12 20:38         ` Zhi Yong Wu
@ 2013-11-12 21:02           ` Dave Hansen
  2013-11-12 21:56             ` Zhi Yong Wu
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2013-11-12 21:02 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Al Viro, linux-fsdevel, linux-kernel mlist, Zhi Yong Wu,
	Chandra Seetharaman

On 11/12/2013 12:38 PM, Zhi Yong Wu wrote:
> On Wed, Nov 13, 2013 at 1:05 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> The on/off knob seems to me to be something better left to a mount
>> option, not a global tunable.
> If it is left to a mount option, the user or admin can't change it
> *dynamically*.

Really?

man mount.  Look at "Mount options for tmpfs".  Try this on an existing
tmpfs mount:

	mount -o remount,size=$foo tmpfsmount

How would that be different from your tunable?

>> If this were true, why don't we have similar knobs for the dentry, inode
>> and page caches?
> This is not be controlled by memory controller(mem_cgroup)?

That's a good point.  There is a 'kmem' cgroup controller for
controlling the in-kernel structures (not page cache which is controlled
by a separate one).  I believe the 'kmem' one would (could?) apply to
the hot tracking data structures as well, which would obviate the need
for this tunable.

At least for the dentry and inode caches, they represent kernel-internal
cache structures and are the same as your hot-data-tracking structures.
 We don't have explicit /proc controls for the size of the dentry and
inode caches, so I'm arguing that we should do the same for these new
hot-data-tracking structures.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-12 21:02           ` Dave Hansen
@ 2013-11-12 21:56             ` Zhi Yong Wu
  0 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-12 21:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Al Viro, linux-fsdevel, linux-kernel mlist, Zhi Yong Wu,
	Chandra Seetharaman

On Wed, Nov 13, 2013 at 5:02 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 11/12/2013 12:38 PM, Zhi Yong Wu wrote:
>> On Wed, Nov 13, 2013 at 1:05 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> The on/off knob seems to me to be something better left to a mount
>>> option, not a global tunable.
>> If it is left to a mount option, the user or admin can't change it
>> *dynamically*.
>
> Really?
>
> man mount.  Look at "Mount options for tmpfs".  Try this on an existing
> tmpfs mount:
>
>         mount -o remount,size=$foo tmpfsmount
>
> How would that be different from your tunable?
Is it light weight? I thought that remount will have more overhead and
effect on the applications running on filesystem.

>
>>> If this were true, why don't we have similar knobs for the dentry, inode
>>> and page caches?
>> This is not be controlled by memory controller(mem_cgroup)?
>
> That's a good point.  There is a 'kmem' cgroup controller for
> controlling the in-kernel structures (not page cache which is controlled
> by a separate one).  I believe the 'kmem' one would (could?) apply to
> the hot tracking data structures as well, which would obviate the need
> for this tunable.
>
> At least for the dentry and inode caches, they represent kernel-internal
> cache structures and are the same as your hot-data-tracking structures.
>  We don't have explicit /proc controls for the size of the dentry and
> inode caches, so I'm arguing that we should do the same for these new
> hot-data-tracking structures.
If 'kmem' cgroup controller is applied to VFS hot tracking, need we do
some additional coding work in kernel? If yes, we should put it to
TODO list. You know, we should push VFS hot tracking core to get
merged ASAP at first. Like this interface, we can develop and improve
it later.
I don't know what Viro's opinion is, If he also agree, we can really
put it to TODO list.


>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/11] VFS hot tracking
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (11 preceding siblings ...)
  2013-11-11 15:43 ` [PATCH v6 00/11] " Zhi Yong Wu
@ 2013-11-13 18:33 ` Zhi Yong Wu
  2013-11-21 13:57   ` Zhi Yong Wu
  2013-12-11 15:45 ` Zhi Yong Wu
  13 siblings, 1 reply; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-13 18:33 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds; +Cc: linux-fsdevel, linux-kernel mlist, Zhi Yong Wu

Ping....

On Wed, Nov 6, 2013 at 9:45 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
>   The patchset is trying to introduce hot tracking function in
> VFS layer, which will keep track of real disk I/O in memory.
> By it, you will easily know more details about disk I/O, and
> then detect where disk I/O hot spots are. Also, specific FS
> can take use of it to do accurate defragment, and hot relocation
> support, etc.
>
>   Now it's time to send out its V6 for external review, and
> any comments or ideas are appreciated, thanks.
>
> NOTE:
>
>   The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
>   If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
>   For how to use and more other info and performance report,
> please check hot_tracking.txt in Documentation and following
> links:
>   1.) http://lwn.net/Articles/525651/
>   2.) https://lkml.org/lkml/2012/12/20/199
>
>   This patchset has been done scalability or performance tests
> by fs_mark, ffsb and compilebench.
>
>   The perf testings were done on Linux 3.12.0-rc7 with Model IBM,8231-E2C
> Big Endian PPC64 with 64 CPUs and 2 NUMA nodes, 250G RAM and 1.50 TiB
> test hard disk where each test file size is 20G or 100G.
> Architecture:          ppc64
> Byte Order:            Big Endian
> CPU(s):                64
> On-line CPU(s) list:   0-63
> Thread(s) per core:    4
> Core(s) per socket:    1
> Socket(s):             16
> NUMA node(s):          2
> Model:                 IBM,8231-E2C
> Hypervisor vendor:     pHyp
> Virtualization type:   full
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              4096K
> NUMA node0 CPU(s):     0-31
> NUMA node1 CPU(s):     32-63
>
>   Below is the perf testing report:
>
>   Please focus on the two key points:
>   - The overall overhead which is injected by the patchset
>   - The stability of the perf results
>
> 1. fio tests
>
>                             w/o hot tracking                               w/ hot tracking
>
> RAM size                            32G          32G         16G           8G           4G           2G          250G
>
> sequential-8k-1jobs-read         61260KB/s    60918KB/s    60901KB/s    62610KB/s    60992KB/s    60213KB/s    60948KB/s
>
> sequential-8k-1jobs-write         1329KB/s     1329KB/s     1328KB/s     1329KB/s     1328KB/s     1329KB/s     1329KB/s
>
> sequential-8k-8jobs-read         91139KB/s    92614KB/s    90907KB/s    89895KB/s    92022KB/s    90851KB/s    91877KB/s
>
> sequential-8k-8jobs-write         2523KB/s     2522KB/s     2516KB/s     2521KB/s     2516KB/s     2518KB/s     2521KB/s
>
> sequential-256k-1jobs-read      151432KB/s   151403KB/s   151406KB/s   151422KB/s   151344KB/s   151446KB/s   151372KB/s
>
> sequential-256k-1jobs-write      33451KB/s    33470KB/s    33481KB/s    33470KB/s    33459KB/s    33472KB/s    33477KB/s
>
> sequential-256k-8jobs-read      235291KB/s   234555KB/s   234251KB/s   233656KB/s   234927KB/s   236380KB/s   235535KB/s
>
> sequential-256k-8jobs-write      62419KB/s    62402KB/s    62191KB/s    62859KB/s    62629KB/s    62720KB/s    62523KB/s
>
> random-io-mix-8k-1jobs  [READ]    2929KB/s     2942KB/s     2946KB/s     2929KB/s     2934KB/s     2947KB/s     2946KB/s
>                         [WRITE]   1262KB/s     1266KB/s     1257KB/s     1262KB/s     1257KB/s     1257KB/s     1265KB/s
>
> random-io-mix-8k-8jobs  [READ]    2444KB/s     2442KB/s     2436KB/s     2416KB/s     2353KB/s     2441KB/s     2442KB/s
>                         [WRITE]   1047KB/s     1044KB/s     1047KB/s     1028KB/s     1017KB/s     1034KB/s     1049KB/s
>
> random-io-mix-8k-16jobs [READ]    2182KB/s     2184KB/s     2169KB/s     2178KB/s     2190KB/s     2184KB/s     2180KB/s
>                         [WRITE]    932KB/s      930KB/s      943KB/s      936KB/s      937KB/s      929KB/s      931KB/s
>
> The above perf parameter is the aggregate bandwidth of threads in the group;
> If you hope to know how about other perf parameters, or fio raw results, please let me know, thanks.
>
> 2. Locking stat - Contention & Cacheline Bouncing
>
> RAM size         class name         con-bounces  contentions  acq-bounces   acquisitions   cacheline bouncing  locking contention
>                                                                                                  ratio              ratio
>
>               &(&root->t_lock)->rlock:  1508        1592         157834      374639292           0.96%              0.00%
> 250G          &(&root->m_lock)->rlock:  1469        1484         119221       43077842           1.23%              0.00%
>               &(&he->i_lock)->rlock:       0           0         101879      376755218           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  2912        2985         342575      374691186           0.85%              0.00%
> 32G           &(&root->m_lock)->rlock:   188         193         307765        8803163           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         291860      376756084           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  3863        3948         298041      374727038           1.30%              0.00%
> 16G           &(&root->m_lock)->rlock:   220         228         254451        8687057           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         235027      376756830           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  3283        3409         233790      374722064           1.40%              0.00%
> 8G            &(&root->m_lock)->rlock:   136         139         203917        8684313           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         193746      376756438           0.00%              0.00%
>
>               &(&root->t_lock)->rlock: 15090       15705         283460      374889666           5.32%              0.00%
> 4G            &(&root->m_lock)->rlock:   172         173         222480        8555052           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         206431      376759452           0.00%              0.00%
>
>               &(&root->t_lock)->rlock: 25515       27368         305129       375394828          8.36%              0.00%
> 2G            &(&root->m_lock)->rlock:   100         101         216516        6752265           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         214713      376765169           0.00%              0.00%
>
> 3. Perf test - Cacheline Ping-pong
>
>                       w/o hot tracking                                                        w/ hot tracking
>
> RAM size                    32G                  32G                 16G                  8G                   4G                    2G                  250G
>
> cache-references    1,264,996,437,581    1,401,504,955,577    1,398,308,614,801    1,396,525,544,527    1,384,793,467,410    1,432,042,560,409    1,571,627,148,771
>
> cache-misses           45,424,567,057       58,432,749,807       59,200,504,032       59,762,030,933       58,104,156,576       57,283,962,840       61,963,839,419
>
> seconds time elapsed  22956.327674298      23035.457069488      23017.232397085      23012.397142967      23008.420970731      23057.245578767      23342.456015188
>
> cache-misses ratio            3.591 %              4.169 %              4.234 %              4.279 %              4.196 %              4.000 %              3.943 %
>
> Changelog from v5:
>  - Also added the hook hot_freqs_update() in the page cache I/O path,
>    not only in real disk I/O path [viro]
>  - Don't export the stuff until it's used by a module [viro]
>  - Splitted hot_inode_item_lookup() [viro]
>  - Prevented hot items from being re-created after the inode was unlinked. [viro]
>  - Made hot_freqs_update() to be inline and adopt one private hot flag [viro]
>  - Killed hot_bit_shift() [viro]
>  - Used file_inode() instead of file->f_dentry->d_inode [viro]
>  - Introduced one new file hot_tracking.h in include/uapi/linux/ [viro]
>  - Made the checks for ->i_nlink to be protectd by ->i_mutex [viro]
>
> v5:
>  - Added all kinds of perf testing report [viro]
>  - Covered mmap() now [viro]
>  - Removed list_sort() in hot_update_worker() to avoid locking contention
>    and cacheline bouncing [viro]
>  - Removed a /proc interface to control low memory usage [Chandra]
>  - Adjusted shrinker support due to the change of public shrinker APIs [zwu]
>  - Fixed the locking missing issue when hot_inode_item_put() is called
>    in ioctl_heat_info() [viro]
>  - Fixed some locking contention issues [zwu]
>
> v4:
>  - Removed debugfs support, but leave it to TODO list [viro, Chandra]
>  - Killed HOT_DELETING and HOT_IN_LIST flag [viro]
>  - Fixed unlink issues [viro]
>  - Fixed the issue on lookups (both for inode and range)
>    leak on race with unlink  [viro]
>  - Killed hot_comm_item and split the functions which take it [virio]
>  - Fixed some other issues [zwu, Chandra]
>
> v3:
>  - Added memory caping function for hot items [Zhiyong]
>  - Cleanup aging function [Zhiyong]
>
> v2:
>  - Refactored to be under RCU [Chandra Seetharaman]
>   Merged some code changes [Chandra Seetharaman]
>  - Fixed some issues [Chandra Seetharaman]
>
> v1:
>  - Solved 64 bits inode number issue. [David Sterba]
>  - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
>  - Cleanup Some issues [David Sterba]
>  - Use a static hot debugfs root [Greg KH]
>
> rfcv4:
>  - Introduce hot func registering framework [Zhiyong]
>  - Remove global variable for hot tracking [Zhiyong]
>  - Add btrfs hot tracking support [Zhiyong]
>
> rfcv3:
>  1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
>  2.) Refactored workqueue support. [Dave Chinner]
>  3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
>      TIME_TO_KICK, and HEAT_UPDATE_DELAY
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
>
> rfcv2:
>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>  2.) Added memory shrinker [Dave Chinner]
>  3.) Converted to one workqueue to update map info periodically [Dave Chinner]
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
> rfcv1:
>  1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
>  2.) The first three patches can probably just be flattened into one.
>                                         [Marco Stornelli , Dave Chinner]
>
>
> Dave Chinner (1):
>   VFS hot tracking, xfs: Add hot tracking support
>
> Zhi Yong Wu (10):
>   VFS hot tracking: Define basic data structures and functions
>   VFS hot tracking: Track IO and record heat information
>   VFS hot tracking: Add a workqueue to move items between hot maps
>   VFS hot tracking: Add shrinker functionality to curtail memory usage
>   VFS hot tracking: Add an ioctl to get hot tracking information
>   VFS hot tracking: Add a /proc interface to make the interval tunable
>   VFS hot tracking: Add a /proc interface to control memory usage
>   VFS hot tracking: Add documentation
>   VFS hot tracking, btrfs: Add hot tracking support
>   MAINTAINERS: add the maintainers for VFS hot tracking
>
>  Documentation/filesystems/00-INDEX         |   2 +
>  Documentation/filesystems/hot_tracking.txt | 207 ++++++++
>  MAINTAINERS                                |  12 +
>  fs/Makefile                                |   2 +-
>  fs/btrfs/ctree.h                           |   1 +
>  fs/btrfs/super.c                           |  22 +-
>  fs/compat_ioctl.c                          |   5 +
>  fs/dcache.c                                |   2 +
>  fs/hot_tracking.c                          | 816 +++++++++++++++++++++++++++++
>  fs/hot_tracking.h                          |  72 +++
>  fs/ioctl.c                                 |  71 +++
>  fs/namei.c                                 |   4 +
>  fs/xfs/xfs_mount.h                         |   1 +
>  fs/xfs/xfs_super.c                         |  18 +
>  include/linux/fs.h                         |   4 +
>  include/linux/hot_tracking.h               | 107 ++++
>  include/uapi/linux/fs.h                    |   1 +
>  include/uapi/linux/hot_tracking.h          |  33 ++
>  kernel/sysctl.c                            |  14 +
>  mm/filemap.c                               |  24 +-
>  mm/readahead.c                             |   6 +
>  21 files changed, 1420 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/filesystems/hot_tracking.txt
>  create mode 100644 fs/hot_tracking.c
>  create mode 100644 fs/hot_tracking.h
>  create mode 100644 include/linux/hot_tracking.h
>  create mode 100644 include/uapi/linux/hot_tracking.h
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/11] VFS hot tracking
  2013-11-13 18:33 ` Zhi Yong Wu
@ 2013-11-21 13:57   ` Zhi Yong Wu
  2013-11-30  9:55     ` Zhi Yong Wu
  0 siblings, 1 reply; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-21 13:57 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds; +Cc: linux-fsdevel, linux-kernel mlist, Zhi Yong Wu

HI, Maintainers

Ping again....

On Thu, Nov 14, 2013 at 2:33 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> Ping....
>
> On Wed, Nov 6, 2013 at 9:45 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   The patchset is trying to introduce hot tracking function in
>> VFS layer, which will keep track of real disk I/O in memory.
>> By it, you will easily know more details about disk I/O, and
>> then detect where disk I/O hot spots are. Also, specific FS
>> can take use of it to do accurate defragment, and hot relocation
>> support, etc.
>>
>>   Now it's time to send out its V6 for external review, and
>> any comments or ideas are appreciated, thanks.
>>
>> NOTE:
>>
>>   The patchset can be obtained via my kernel dev git on github:
>> git://github.com/wuzhy/kernel.git hot_tracking
>>   If you're interested, you can also review them via
>> https://github.com/wuzhy/kernel/commits/hot_tracking
>>
>>   For how to use and more other info and performance report,
>> please check hot_tracking.txt in Documentation and following
>> links:
>>   1.) http://lwn.net/Articles/525651/
>>   2.) https://lkml.org/lkml/2012/12/20/199
>>
>>   This patchset has been done scalability or performance tests
>> by fs_mark, ffsb and compilebench.
>>
>>   The perf testings were done on Linux 3.12.0-rc7 with Model IBM,8231-E2C
>> Big Endian PPC64 with 64 CPUs and 2 NUMA nodes, 250G RAM and 1.50 TiB
>> test hard disk where each test file size is 20G or 100G.
>> Architecture:          ppc64
>> Byte Order:            Big Endian
>> CPU(s):                64
>> On-line CPU(s) list:   0-63
>> Thread(s) per core:    4
>> Core(s) per socket:    1
>> Socket(s):             16
>> NUMA node(s):          2
>> Model:                 IBM,8231-E2C
>> Hypervisor vendor:     pHyp
>> Virtualization type:   full
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              256K
>> L3 cache:              4096K
>> NUMA node0 CPU(s):     0-31
>> NUMA node1 CPU(s):     32-63
>>
>>   Below is the perf testing report:
>>
>>   Please focus on the two key points:
>>   - The overall overhead which is injected by the patchset
>>   - The stability of the perf results
>>
>> 1. fio tests
>>
>>                             w/o hot tracking                               w/ hot tracking
>>
>> RAM size                            32G          32G         16G           8G           4G           2G          250G
>>
>> sequential-8k-1jobs-read         61260KB/s    60918KB/s    60901KB/s    62610KB/s    60992KB/s    60213KB/s    60948KB/s
>>
>> sequential-8k-1jobs-write         1329KB/s     1329KB/s     1328KB/s     1329KB/s     1328KB/s     1329KB/s     1329KB/s
>>
>> sequential-8k-8jobs-read         91139KB/s    92614KB/s    90907KB/s    89895KB/s    92022KB/s    90851KB/s    91877KB/s
>>
>> sequential-8k-8jobs-write         2523KB/s     2522KB/s     2516KB/s     2521KB/s     2516KB/s     2518KB/s     2521KB/s
>>
>> sequential-256k-1jobs-read      151432KB/s   151403KB/s   151406KB/s   151422KB/s   151344KB/s   151446KB/s   151372KB/s
>>
>> sequential-256k-1jobs-write      33451KB/s    33470KB/s    33481KB/s    33470KB/s    33459KB/s    33472KB/s    33477KB/s
>>
>> sequential-256k-8jobs-read      235291KB/s   234555KB/s   234251KB/s   233656KB/s   234927KB/s   236380KB/s   235535KB/s
>>
>> sequential-256k-8jobs-write      62419KB/s    62402KB/s    62191KB/s    62859KB/s    62629KB/s    62720KB/s    62523KB/s
>>
>> random-io-mix-8k-1jobs  [READ]    2929KB/s     2942KB/s     2946KB/s     2929KB/s     2934KB/s     2947KB/s     2946KB/s
>>                         [WRITE]   1262KB/s     1266KB/s     1257KB/s     1262KB/s     1257KB/s     1257KB/s     1265KB/s
>>
>> random-io-mix-8k-8jobs  [READ]    2444KB/s     2442KB/s     2436KB/s     2416KB/s     2353KB/s     2441KB/s     2442KB/s
>>                         [WRITE]   1047KB/s     1044KB/s     1047KB/s     1028KB/s     1017KB/s     1034KB/s     1049KB/s
>>
>> random-io-mix-8k-16jobs [READ]    2182KB/s     2184KB/s     2169KB/s     2178KB/s     2190KB/s     2184KB/s     2180KB/s
>>                         [WRITE]    932KB/s      930KB/s      943KB/s      936KB/s      937KB/s      929KB/s      931KB/s
>>
>> The above perf parameter is the aggregate bandwidth of threads in the group;
>> If you hope to know how about other perf parameters, or fio raw results, please let me know, thanks.
>>
>> 2. Locking stat - Contention & Cacheline Bouncing
>>
>> RAM size         class name         con-bounces  contentions  acq-bounces   acquisitions   cacheline bouncing  locking contention
>>                                                                                                  ratio              ratio
>>
>>               &(&root->t_lock)->rlock:  1508        1592         157834      374639292           0.96%              0.00%
>> 250G          &(&root->m_lock)->rlock:  1469        1484         119221       43077842           1.23%              0.00%
>>               &(&he->i_lock)->rlock:       0           0         101879      376755218           0.00%              0.00%
>>
>>               &(&root->t_lock)->rlock:  2912        2985         342575      374691186           0.85%              0.00%
>> 32G           &(&root->m_lock)->rlock:   188         193         307765        8803163           0.00%              0.00%
>>               &(&he->i_lock)->rlock:       0           0         291860      376756084           0.00%              0.00%
>>
>>               &(&root->t_lock)->rlock:  3863        3948         298041      374727038           1.30%              0.00%
>> 16G           &(&root->m_lock)->rlock:   220         228         254451        8687057           0.00%              0.00%
>>               &(&he->i_lock)->rlock:       0           0         235027      376756830           0.00%              0.00%
>>
>>               &(&root->t_lock)->rlock:  3283        3409         233790      374722064           1.40%              0.00%
>> 8G            &(&root->m_lock)->rlock:   136         139         203917        8684313           0.00%              0.00%
>>               &(&he->i_lock)->rlock:       0           0         193746      376756438           0.00%              0.00%
>>
>>               &(&root->t_lock)->rlock: 15090       15705         283460      374889666           5.32%              0.00%
>> 4G            &(&root->m_lock)->rlock:   172         173         222480        8555052           0.00%              0.00%
>>               &(&he->i_lock)->rlock:       0           0         206431      376759452           0.00%              0.00%
>>
>>               &(&root->t_lock)->rlock: 25515       27368         305129       375394828          8.36%              0.00%
>> 2G            &(&root->m_lock)->rlock:   100         101         216516        6752265           0.00%              0.00%
>>               &(&he->i_lock)->rlock:       0           0         214713      376765169           0.00%              0.00%
>>
>> 3. Perf test - Cacheline Ping-pong
>>
>>                       w/o hot tracking                                                        w/ hot tracking
>>
>> RAM size                    32G                  32G                 16G                  8G                   4G                    2G                  250G
>>
>> cache-references    1,264,996,437,581    1,401,504,955,577    1,398,308,614,801    1,396,525,544,527    1,384,793,467,410    1,432,042,560,409    1,571,627,148,771
>>
>> cache-misses           45,424,567,057       58,432,749,807       59,200,504,032       59,762,030,933       58,104,156,576       57,283,962,840       61,963,839,419
>>
>> seconds time elapsed  22956.327674298      23035.457069488      23017.232397085      23012.397142967      23008.420970731      23057.245578767      23342.456015188
>>
>> cache-misses ratio            3.591 %              4.169 %              4.234 %              4.279 %              4.196 %              4.000 %              3.943 %
>>
>> Changelog from v5:
>>  - Also added the hook hot_freqs_update() in the page cache I/O path,
>>    not only in real disk I/O path [viro]
>>  - Don't export the stuff until it's used by a module [viro]
>>  - Splitted hot_inode_item_lookup() [viro]
>>  - Prevented hot items from being re-created after the inode was unlinked. [viro]
>>  - Made hot_freqs_update() to be inline and adopt one private hot flag [viro]
>>  - Killed hot_bit_shift() [viro]
>>  - Used file_inode() instead of file->f_dentry->d_inode [viro]
>>  - Introduced one new file hot_tracking.h in include/uapi/linux/ [viro]
>>  - Made the checks for ->i_nlink to be protectd by ->i_mutex [viro]
>>
>> v5:
>>  - Added all kinds of perf testing report [viro]
>>  - Covered mmap() now [viro]
>>  - Removed list_sort() in hot_update_worker() to avoid locking contention
>>    and cacheline bouncing [viro]
>>  - Removed a /proc interface to control low memory usage [Chandra]
>>  - Adjusted shrinker support due to the change of public shrinker APIs [zwu]
>>  - Fixed the locking missing issue when hot_inode_item_put() is called
>>    in ioctl_heat_info() [viro]
>>  - Fixed some locking contention issues [zwu]
>>
>> v4:
>>  - Removed debugfs support, but leave it to TODO list [viro, Chandra]
>>  - Killed HOT_DELETING and HOT_IN_LIST flag [viro]
>>  - Fixed unlink issues [viro]
>>  - Fixed the issue on lookups (both for inode and range)
>>    leak on race with unlink  [viro]
>>  - Killed hot_comm_item and split the functions which take it [virio]
>>  - Fixed some other issues [zwu, Chandra]
>>
>> v3:
>>  - Added memory caping function for hot items [Zhiyong]
>>  - Cleanup aging function [Zhiyong]
>>
>> v2:
>>  - Refactored to be under RCU [Chandra Seetharaman]
>>   Merged some code changes [Chandra Seetharaman]
>>  - Fixed some issues [Chandra Seetharaman]
>>
>> v1:
>>  - Solved 64 bits inode number issue. [David Sterba]
>>  - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
>>  - Cleanup Some issues [David Sterba]
>>  - Use a static hot debugfs root [Greg KH]
>>
>> rfcv4:
>>  - Introduce hot func registering framework [Zhiyong]
>>  - Remove global variable for hot tracking [Zhiyong]
>>  - Add btrfs hot tracking support [Zhiyong]
>>
>> rfcv3:
>>  1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
>>  2.) Refactored workqueue support. [Dave Chinner]
>>  3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
>>      TIME_TO_KICK, and HEAT_UPDATE_DELAY
>>  4.) Cleanedup a lot of other issues [Dave Chinner]
>>
>>
>> rfcv2:
>>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>>  2.) Added memory shrinker [Dave Chinner]
>>  3.) Converted to one workqueue to update map info periodically [Dave Chinner]
>>  4.) Cleanedup a lot of other issues [Dave Chinner]
>>
>> rfcv1:
>>  1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
>>  2.) The first three patches can probably just be flattened into one.
>>                                         [Marco Stornelli , Dave Chinner]
>>
>>
>> Dave Chinner (1):
>>   VFS hot tracking, xfs: Add hot tracking support
>>
>> Zhi Yong Wu (10):
>>   VFS hot tracking: Define basic data structures and functions
>>   VFS hot tracking: Track IO and record heat information
>>   VFS hot tracking: Add a workqueue to move items between hot maps
>>   VFS hot tracking: Add shrinker functionality to curtail memory usage
>>   VFS hot tracking: Add an ioctl to get hot tracking information
>>   VFS hot tracking: Add a /proc interface to make the interval tunable
>>   VFS hot tracking: Add a /proc interface to control memory usage
>>   VFS hot tracking: Add documentation
>>   VFS hot tracking, btrfs: Add hot tracking support
>>   MAINTAINERS: add the maintainers for VFS hot tracking
>>
>>  Documentation/filesystems/00-INDEX         |   2 +
>>  Documentation/filesystems/hot_tracking.txt | 207 ++++++++
>>  MAINTAINERS                                |  12 +
>>  fs/Makefile                                |   2 +-
>>  fs/btrfs/ctree.h                           |   1 +
>>  fs/btrfs/super.c                           |  22 +-
>>  fs/compat_ioctl.c                          |   5 +
>>  fs/dcache.c                                |   2 +
>>  fs/hot_tracking.c                          | 816 +++++++++++++++++++++++++++++
>>  fs/hot_tracking.h                          |  72 +++
>>  fs/ioctl.c                                 |  71 +++
>>  fs/namei.c                                 |   4 +
>>  fs/xfs/xfs_mount.h                         |   1 +
>>  fs/xfs/xfs_super.c                         |  18 +
>>  include/linux/fs.h                         |   4 +
>>  include/linux/hot_tracking.h               | 107 ++++
>>  include/uapi/linux/fs.h                    |   1 +
>>  include/uapi/linux/hot_tracking.h          |  33 ++
>>  kernel/sysctl.c                            |  14 +
>>  mm/filemap.c                               |  24 +-
>>  mm/readahead.c                             |   6 +
>>  21 files changed, 1420 insertions(+), 4 deletions(-)
>>  create mode 100644 Documentation/filesystems/hot_tracking.txt
>>  create mode 100644 fs/hot_tracking.c
>>  create mode 100644 fs/hot_tracking.h
>>  create mode 100644 include/linux/hot_tracking.h
>>  create mode 100644 include/uapi/linux/hot_tracking.h
>>
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
> --
> Regards,
>
> Zhi Yong Wu



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/11] VFS hot tracking
  2013-11-21 13:57   ` Zhi Yong Wu
@ 2013-11-30  9:55     ` Zhi Yong Wu
  2013-12-03 20:16       ` Zhi Yong Wu
  0 siblings, 1 reply; 31+ messages in thread
From: Zhi Yong Wu @ 2013-11-30  9:55 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds; +Cc: linux-fsdevel, linux-kernel mlist, Zhi Yong Wu

HI,

Ping again....

On Thu, Nov 21, 2013 at 9:57 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> HI, Maintainers
>
> Ping again....
>
> On Thu, Nov 14, 2013 at 2:33 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> Ping....
>>
>> On Wed, Nov 6, 2013 at 9:45 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>
>>>   The patchset is trying to introduce hot tracking function in
>>> VFS layer, which will keep track of real disk I/O in memory.
>>> By it, you will easily know more details about disk I/O, and
>>> then detect where disk I/O hot spots are. Also, specific FS
>>> can take use of it to do accurate defragment, and hot relocation
>>> support, etc.
>>>
>>>   Now it's time to send out its V6 for external review, and
>>> any comments or ideas are appreciated, thanks.
>>>
>>> NOTE:
>>>
>>>   The patchset can be obtained via my kernel dev git on github:
>>> git://github.com/wuzhy/kernel.git hot_tracking
>>>   If you're interested, you can also review them via
>>> https://github.com/wuzhy/kernel/commits/hot_tracking
>>>
>>>   For how to use and more other info and performance report,
>>> please check hot_tracking.txt in Documentation and following
>>> links:
>>>   1.) http://lwn.net/Articles/525651/
>>>   2.) https://lkml.org/lkml/2012/12/20/199
>>>
>>>   This patchset has been done scalability or performance tests
>>> by fs_mark, ffsb and compilebench.
>>>
>>>   The perf testings were done on Linux 3.12.0-rc7 with Model IBM,8231-E2C
>>> Big Endian PPC64 with 64 CPUs and 2 NUMA nodes, 250G RAM and 1.50 TiB
>>> test hard disk where each test file size is 20G or 100G.
>>> Architecture:          ppc64
>>> Byte Order:            Big Endian
>>> CPU(s):                64
>>> On-line CPU(s) list:   0-63
>>> Thread(s) per core:    4
>>> Core(s) per socket:    1
>>> Socket(s):             16
>>> NUMA node(s):          2
>>> Model:                 IBM,8231-E2C
>>> Hypervisor vendor:     pHyp
>>> Virtualization type:   full
>>> L1d cache:             32K
>>> L1i cache:             32K
>>> L2 cache:              256K
>>> L3 cache:              4096K
>>> NUMA node0 CPU(s):     0-31
>>> NUMA node1 CPU(s):     32-63
>>>
>>>   Below is the perf testing report:
>>>
>>>   Please focus on the two key points:
>>>   - The overall overhead which is injected by the patchset
>>>   - The stability of the perf results
>>>
>>> 1. fio tests
>>>
>>>                             w/o hot tracking                               w/ hot tracking
>>>
>>> RAM size                            32G          32G         16G           8G           4G           2G          250G
>>>
>>> sequential-8k-1jobs-read         61260KB/s    60918KB/s    60901KB/s    62610KB/s    60992KB/s    60213KB/s    60948KB/s
>>>
>>> sequential-8k-1jobs-write         1329KB/s     1329KB/s     1328KB/s     1329KB/s     1328KB/s     1329KB/s     1329KB/s
>>>
>>> sequential-8k-8jobs-read         91139KB/s    92614KB/s    90907KB/s    89895KB/s    92022KB/s    90851KB/s    91877KB/s
>>>
>>> sequential-8k-8jobs-write         2523KB/s     2522KB/s     2516KB/s     2521KB/s     2516KB/s     2518KB/s     2521KB/s
>>>
>>> sequential-256k-1jobs-read      151432KB/s   151403KB/s   151406KB/s   151422KB/s   151344KB/s   151446KB/s   151372KB/s
>>>
>>> sequential-256k-1jobs-write      33451KB/s    33470KB/s    33481KB/s    33470KB/s    33459KB/s    33472KB/s    33477KB/s
>>>
>>> sequential-256k-8jobs-read      235291KB/s   234555KB/s   234251KB/s   233656KB/s   234927KB/s   236380KB/s   235535KB/s
>>>
>>> sequential-256k-8jobs-write      62419KB/s    62402KB/s    62191KB/s    62859KB/s    62629KB/s    62720KB/s    62523KB/s
>>>
>>> random-io-mix-8k-1jobs  [READ]    2929KB/s     2942KB/s     2946KB/s     2929KB/s     2934KB/s     2947KB/s     2946KB/s
>>>                         [WRITE]   1262KB/s     1266KB/s     1257KB/s     1262KB/s     1257KB/s     1257KB/s     1265KB/s
>>>
>>> random-io-mix-8k-8jobs  [READ]    2444KB/s     2442KB/s     2436KB/s     2416KB/s     2353KB/s     2441KB/s     2442KB/s
>>>                         [WRITE]   1047KB/s     1044KB/s     1047KB/s     1028KB/s     1017KB/s     1034KB/s     1049KB/s
>>>
>>> random-io-mix-8k-16jobs [READ]    2182KB/s     2184KB/s     2169KB/s     2178KB/s     2190KB/s     2184KB/s     2180KB/s
>>>                         [WRITE]    932KB/s      930KB/s      943KB/s      936KB/s      937KB/s      929KB/s      931KB/s
>>>
>>> The above perf parameter is the aggregate bandwidth of threads in the group;
>>> If you hope to know how about other perf parameters, or fio raw results, please let me know, thanks.
>>>
>>> 2. Locking stat - Contention & Cacheline Bouncing
>>>
>>> RAM size         class name         con-bounces  contentions  acq-bounces   acquisitions   cacheline bouncing  locking contention
>>>                                                                                                  ratio              ratio
>>>
>>>               &(&root->t_lock)->rlock:  1508        1592         157834      374639292           0.96%              0.00%
>>> 250G          &(&root->m_lock)->rlock:  1469        1484         119221       43077842           1.23%              0.00%
>>>               &(&he->i_lock)->rlock:       0           0         101879      376755218           0.00%              0.00%
>>>
>>>               &(&root->t_lock)->rlock:  2912        2985         342575      374691186           0.85%              0.00%
>>> 32G           &(&root->m_lock)->rlock:   188         193         307765        8803163           0.00%              0.00%
>>>               &(&he->i_lock)->rlock:       0           0         291860      376756084           0.00%              0.00%
>>>
>>>               &(&root->t_lock)->rlock:  3863        3948         298041      374727038           1.30%              0.00%
>>> 16G           &(&root->m_lock)->rlock:   220         228         254451        8687057           0.00%              0.00%
>>>               &(&he->i_lock)->rlock:       0           0         235027      376756830           0.00%              0.00%
>>>
>>>               &(&root->t_lock)->rlock:  3283        3409         233790      374722064           1.40%              0.00%
>>> 8G            &(&root->m_lock)->rlock:   136         139         203917        8684313           0.00%              0.00%
>>>               &(&he->i_lock)->rlock:       0           0         193746      376756438           0.00%              0.00%
>>>
>>>               &(&root->t_lock)->rlock: 15090       15705         283460      374889666           5.32%              0.00%
>>> 4G            &(&root->m_lock)->rlock:   172         173         222480        8555052           0.00%              0.00%
>>>               &(&he->i_lock)->rlock:       0           0         206431      376759452           0.00%              0.00%
>>>
>>>               &(&root->t_lock)->rlock: 25515       27368         305129       375394828          8.36%              0.00%
>>> 2G            &(&root->m_lock)->rlock:   100         101         216516        6752265           0.00%              0.00%
>>>               &(&he->i_lock)->rlock:       0           0         214713      376765169           0.00%              0.00%
>>>
>>> 3. Perf test - Cacheline Ping-pong
>>>
>>>                       w/o hot tracking                                                        w/ hot tracking
>>>
>>> RAM size                    32G                  32G                 16G                  8G                   4G                    2G                  250G
>>>
>>> cache-references    1,264,996,437,581    1,401,504,955,577    1,398,308,614,801    1,396,525,544,527    1,384,793,467,410    1,432,042,560,409    1,571,627,148,771
>>>
>>> cache-misses           45,424,567,057       58,432,749,807       59,200,504,032       59,762,030,933       58,104,156,576       57,283,962,840       61,963,839,419
>>>
>>> seconds time elapsed  22956.327674298      23035.457069488      23017.232397085      23012.397142967      23008.420970731      23057.245578767      23342.456015188
>>>
>>> cache-misses ratio            3.591 %              4.169 %              4.234 %              4.279 %              4.196 %              4.000 %              3.943 %
>>>
>>> Changelog from v5:
>>>  - Also added the hook hot_freqs_update() in the page cache I/O path,
>>>    not only in real disk I/O path [viro]
>>>  - Don't export the stuff until it's used by a module [viro]
>>>  - Splitted hot_inode_item_lookup() [viro]
>>>  - Prevented hot items from being re-created after the inode was unlinked. [viro]
>>>  - Made hot_freqs_update() to be inline and adopt one private hot flag [viro]
>>>  - Killed hot_bit_shift() [viro]
>>>  - Used file_inode() instead of file->f_dentry->d_inode [viro]
>>>  - Introduced one new file hot_tracking.h in include/uapi/linux/ [viro]
>>>  - Made the checks for ->i_nlink to be protectd by ->i_mutex [viro]
>>>
>>> v5:
>>>  - Added all kinds of perf testing report [viro]
>>>  - Covered mmap() now [viro]
>>>  - Removed list_sort() in hot_update_worker() to avoid locking contention
>>>    and cacheline bouncing [viro]
>>>  - Removed a /proc interface to control low memory usage [Chandra]
>>>  - Adjusted shrinker support due to the change of public shrinker APIs [zwu]
>>>  - Fixed the locking missing issue when hot_inode_item_put() is called
>>>    in ioctl_heat_info() [viro]
>>>  - Fixed some locking contention issues [zwu]
>>>
>>> v4:
>>>  - Removed debugfs support, but leave it to TODO list [viro, Chandra]
>>>  - Killed HOT_DELETING and HOT_IN_LIST flag [viro]
>>>  - Fixed unlink issues [viro]
>>>  - Fixed the issue on lookups (both for inode and range)
>>>    leak on race with unlink  [viro]
>>>  - Killed hot_comm_item and split the functions which take it [virio]
>>>  - Fixed some other issues [zwu, Chandra]
>>>
>>> v3:
>>>  - Added memory caping function for hot items [Zhiyong]
>>>  - Cleanup aging function [Zhiyong]
>>>
>>> v2:
>>>  - Refactored to be under RCU [Chandra Seetharaman]
>>>   Merged some code changes [Chandra Seetharaman]
>>>  - Fixed some issues [Chandra Seetharaman]
>>>
>>> v1:
>>>  - Solved 64 bits inode number issue. [David Sterba]
>>>  - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
>>>  - Cleanup Some issues [David Sterba]
>>>  - Use a static hot debugfs root [Greg KH]
>>>
>>> rfcv4:
>>>  - Introduce hot func registering framework [Zhiyong]
>>>  - Remove global variable for hot tracking [Zhiyong]
>>>  - Add btrfs hot tracking support [Zhiyong]
>>>
>>> rfcv3:
>>>  1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
>>>  2.) Refactored workqueue support. [Dave Chinner]
>>>  3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
>>>      TIME_TO_KICK, and HEAT_UPDATE_DELAY
>>>  4.) Cleanedup a lot of other issues [Dave Chinner]
>>>
>>>
>>> rfcv2:
>>>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>>>  2.) Added memory shrinker [Dave Chinner]
>>>  3.) Converted to one workqueue to update map info periodically [Dave Chinner]
>>>  4.) Cleanedup a lot of other issues [Dave Chinner]
>>>
>>> rfcv1:
>>>  1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
>>>  2.) The first three patches can probably just be flattened into one.
>>>                                         [Marco Stornelli , Dave Chinner]
>>>
>>>
>>> Dave Chinner (1):
>>>   VFS hot tracking, xfs: Add hot tracking support
>>>
>>> Zhi Yong Wu (10):
>>>   VFS hot tracking: Define basic data structures and functions
>>>   VFS hot tracking: Track IO and record heat information
>>>   VFS hot tracking: Add a workqueue to move items between hot maps
>>>   VFS hot tracking: Add shrinker functionality to curtail memory usage
>>>   VFS hot tracking: Add an ioctl to get hot tracking information
>>>   VFS hot tracking: Add a /proc interface to make the interval tunable
>>>   VFS hot tracking: Add a /proc interface to control memory usage
>>>   VFS hot tracking: Add documentation
>>>   VFS hot tracking, btrfs: Add hot tracking support
>>>   MAINTAINERS: add the maintainers for VFS hot tracking
>>>
>>>  Documentation/filesystems/00-INDEX         |   2 +
>>>  Documentation/filesystems/hot_tracking.txt | 207 ++++++++
>>>  MAINTAINERS                                |  12 +
>>>  fs/Makefile                                |   2 +-
>>>  fs/btrfs/ctree.h                           |   1 +
>>>  fs/btrfs/super.c                           |  22 +-
>>>  fs/compat_ioctl.c                          |   5 +
>>>  fs/dcache.c                                |   2 +
>>>  fs/hot_tracking.c                          | 816 +++++++++++++++++++++++++++++
>>>  fs/hot_tracking.h                          |  72 +++
>>>  fs/ioctl.c                                 |  71 +++
>>>  fs/namei.c                                 |   4 +
>>>  fs/xfs/xfs_mount.h                         |   1 +
>>>  fs/xfs/xfs_super.c                         |  18 +
>>>  include/linux/fs.h                         |   4 +
>>>  include/linux/hot_tracking.h               | 107 ++++
>>>  include/uapi/linux/fs.h                    |   1 +
>>>  include/uapi/linux/hot_tracking.h          |  33 ++
>>>  kernel/sysctl.c                            |  14 +
>>>  mm/filemap.c                               |  24 +-
>>>  mm/readahead.c                             |   6 +
>>>  21 files changed, 1420 insertions(+), 4 deletions(-)
>>>  create mode 100644 Documentation/filesystems/hot_tracking.txt
>>>  create mode 100644 fs/hot_tracking.c
>>>  create mode 100644 fs/hot_tracking.h
>>>  create mode 100644 include/linux/hot_tracking.h
>>>  create mode 100644 include/uapi/linux/hot_tracking.h
>>>
>>> --
>>> 1.7.11.7
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>>
>> --
>> Regards,
>>
>> Zhi Yong Wu
>
>
>
> --
> Regards,
>
> Zhi Yong Wu



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/11] VFS hot tracking
  2013-11-30  9:55     ` Zhi Yong Wu
@ 2013-12-03 20:16       ` Zhi Yong Wu
  0 siblings, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-12-03 20:16 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds; +Cc: linux-fsdevel, linux-kernel mlist, Zhi Yong Wu

Ping 6,

any reason why this patchset can't get reviewed so far? If no
comments, pls merge them.  Please don't force me to be impolite,
thanks.

On Sat, Nov 30, 2013 at 5:55 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> HI,
>
> Ping again....
>
> On Thu, Nov 21, 2013 at 9:57 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> HI, Maintainers
>>
>> Ping again....
>>
>> On Thu, Nov 14, 2013 at 2:33 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>>> Ping....
>>>
>>> On Wed, Nov 6, 2013 at 9:45 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>>>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>>
>>>>   The patchset is trying to introduce hot tracking function in
>>>> VFS layer, which will keep track of real disk I/O in memory.
>>>> By it, you will easily know more details about disk I/O, and
>>>> then detect where disk I/O hot spots are. Also, specific FS
>>>> can take use of it to do accurate defragment, and hot relocation
>>>> support, etc.
>>>>
>>>>   Now it's time to send out its V6 for external review, and
>>>> any comments or ideas are appreciated, thanks.
>>>>
>>>> NOTE:
>>>>
>>>>   The patchset can be obtained via my kernel dev git on github:
>>>> git://github.com/wuzhy/kernel.git hot_tracking
>>>>   If you're interested, you can also review them via
>>>> https://github.com/wuzhy/kernel/commits/hot_tracking
>>>>
>>>>   For how to use and more other info and performance report,
>>>> please check hot_tracking.txt in Documentation and following
>>>> links:
>>>>   1.) http://lwn.net/Articles/525651/
>>>>   2.) https://lkml.org/lkml/2012/12/20/199
>>>>
>>>>   This patchset has been done scalability or performance tests
>>>> by fs_mark, ffsb and compilebench.
>>>>
>>>>   The perf testings were done on Linux 3.12.0-rc7 with Model IBM,8231-E2C
>>>> Big Endian PPC64 with 64 CPUs and 2 NUMA nodes, 250G RAM and 1.50 TiB
>>>> test hard disk where each test file size is 20G or 100G.
>>>> Architecture:          ppc64
>>>> Byte Order:            Big Endian
>>>> CPU(s):                64
>>>> On-line CPU(s) list:   0-63
>>>> Thread(s) per core:    4
>>>> Core(s) per socket:    1
>>>> Socket(s):             16
>>>> NUMA node(s):          2
>>>> Model:                 IBM,8231-E2C
>>>> Hypervisor vendor:     pHyp
>>>> Virtualization type:   full
>>>> L1d cache:             32K
>>>> L1i cache:             32K
>>>> L2 cache:              256K
>>>> L3 cache:              4096K
>>>> NUMA node0 CPU(s):     0-31
>>>> NUMA node1 CPU(s):     32-63
>>>>
>>>>   Below is the perf testing report:
>>>>
>>>>   Please focus on the two key points:
>>>>   - The overall overhead which is injected by the patchset
>>>>   - The stability of the perf results
>>>>
>>>> 1. fio tests
>>>>
>>>>                             w/o hot tracking                               w/ hot tracking
>>>>
>>>> RAM size                            32G          32G         16G           8G           4G           2G          250G
>>>>
>>>> sequential-8k-1jobs-read         61260KB/s    60918KB/s    60901KB/s    62610KB/s    60992KB/s    60213KB/s    60948KB/s
>>>>
>>>> sequential-8k-1jobs-write         1329KB/s     1329KB/s     1328KB/s     1329KB/s     1328KB/s     1329KB/s     1329KB/s
>>>>
>>>> sequential-8k-8jobs-read         91139KB/s    92614KB/s    90907KB/s    89895KB/s    92022KB/s    90851KB/s    91877KB/s
>>>>
>>>> sequential-8k-8jobs-write         2523KB/s     2522KB/s     2516KB/s     2521KB/s     2516KB/s     2518KB/s     2521KB/s
>>>>
>>>> sequential-256k-1jobs-read      151432KB/s   151403KB/s   151406KB/s   151422KB/s   151344KB/s   151446KB/s   151372KB/s
>>>>
>>>> sequential-256k-1jobs-write      33451KB/s    33470KB/s    33481KB/s    33470KB/s    33459KB/s    33472KB/s    33477KB/s
>>>>
>>>> sequential-256k-8jobs-read      235291KB/s   234555KB/s   234251KB/s   233656KB/s   234927KB/s   236380KB/s   235535KB/s
>>>>
>>>> sequential-256k-8jobs-write      62419KB/s    62402KB/s    62191KB/s    62859KB/s    62629KB/s    62720KB/s    62523KB/s
>>>>
>>>> random-io-mix-8k-1jobs  [READ]    2929KB/s     2942KB/s     2946KB/s     2929KB/s     2934KB/s     2947KB/s     2946KB/s
>>>>                         [WRITE]   1262KB/s     1266KB/s     1257KB/s     1262KB/s     1257KB/s     1257KB/s     1265KB/s
>>>>
>>>> random-io-mix-8k-8jobs  [READ]    2444KB/s     2442KB/s     2436KB/s     2416KB/s     2353KB/s     2441KB/s     2442KB/s
>>>>                         [WRITE]   1047KB/s     1044KB/s     1047KB/s     1028KB/s     1017KB/s     1034KB/s     1049KB/s
>>>>
>>>> random-io-mix-8k-16jobs [READ]    2182KB/s     2184KB/s     2169KB/s     2178KB/s     2190KB/s     2184KB/s     2180KB/s
>>>>                         [WRITE]    932KB/s      930KB/s      943KB/s      936KB/s      937KB/s      929KB/s      931KB/s
>>>>
>>>> The above perf parameter is the aggregate bandwidth of threads in the group;
>>>> If you hope to know how about other perf parameters, or fio raw results, please let me know, thanks.
>>>>
>>>> 2. Locking stat - Contention & Cacheline Bouncing
>>>>
>>>> RAM size         class name         con-bounces  contentions  acq-bounces   acquisitions   cacheline bouncing  locking contention
>>>>                                                                                                  ratio              ratio
>>>>
>>>>               &(&root->t_lock)->rlock:  1508        1592         157834      374639292           0.96%              0.00%
>>>> 250G          &(&root->m_lock)->rlock:  1469        1484         119221       43077842           1.23%              0.00%
>>>>               &(&he->i_lock)->rlock:       0           0         101879      376755218           0.00%              0.00%
>>>>
>>>>               &(&root->t_lock)->rlock:  2912        2985         342575      374691186           0.85%              0.00%
>>>> 32G           &(&root->m_lock)->rlock:   188         193         307765        8803163           0.00%              0.00%
>>>>               &(&he->i_lock)->rlock:       0           0         291860      376756084           0.00%              0.00%
>>>>
>>>>               &(&root->t_lock)->rlock:  3863        3948         298041      374727038           1.30%              0.00%
>>>> 16G           &(&root->m_lock)->rlock:   220         228         254451        8687057           0.00%              0.00%
>>>>               &(&he->i_lock)->rlock:       0           0         235027      376756830           0.00%              0.00%
>>>>
>>>>               &(&root->t_lock)->rlock:  3283        3409         233790      374722064           1.40%              0.00%
>>>> 8G            &(&root->m_lock)->rlock:   136         139         203917        8684313           0.00%              0.00%
>>>>               &(&he->i_lock)->rlock:       0           0         193746      376756438           0.00%              0.00%
>>>>
>>>>               &(&root->t_lock)->rlock: 15090       15705         283460      374889666           5.32%              0.00%
>>>> 4G            &(&root->m_lock)->rlock:   172         173         222480        8555052           0.00%              0.00%
>>>>               &(&he->i_lock)->rlock:       0           0         206431      376759452           0.00%              0.00%
>>>>
>>>>               &(&root->t_lock)->rlock: 25515       27368         305129       375394828          8.36%              0.00%
>>>> 2G            &(&root->m_lock)->rlock:   100         101         216516        6752265           0.00%              0.00%
>>>>               &(&he->i_lock)->rlock:       0           0         214713      376765169           0.00%              0.00%
>>>>
>>>> 3. Perf test - Cacheline Ping-pong
>>>>
>>>>                       w/o hot tracking                                                        w/ hot tracking
>>>>
>>>> RAM size                    32G                  32G                 16G                  8G                   4G                    2G                  250G
>>>>
>>>> cache-references    1,264,996,437,581    1,401,504,955,577    1,398,308,614,801    1,396,525,544,527    1,384,793,467,410    1,432,042,560,409    1,571,627,148,771
>>>>
>>>> cache-misses           45,424,567,057       58,432,749,807       59,200,504,032       59,762,030,933       58,104,156,576       57,283,962,840       61,963,839,419
>>>>
>>>> seconds time elapsed  22956.327674298      23035.457069488      23017.232397085      23012.397142967      23008.420970731      23057.245578767      23342.456015188
>>>>
>>>> cache-misses ratio            3.591 %              4.169 %              4.234 %              4.279 %              4.196 %              4.000 %              3.943 %
>>>>
>>>> Changelog from v5:
>>>>  - Also added the hook hot_freqs_update() in the page cache I/O path,
>>>>    not only in real disk I/O path [viro]
>>>>  - Don't export the stuff until it's used by a module [viro]
>>>>  - Splitted hot_inode_item_lookup() [viro]
>>>>  - Prevented hot items from being re-created after the inode was unlinked. [viro]
>>>>  - Made hot_freqs_update() to be inline and adopt one private hot flag [viro]
>>>>  - Killed hot_bit_shift() [viro]
>>>>  - Used file_inode() instead of file->f_dentry->d_inode [viro]
>>>>  - Introduced one new file hot_tracking.h in include/uapi/linux/ [viro]
>>>>  - Made the checks for ->i_nlink to be protectd by ->i_mutex [viro]
>>>>
>>>> v5:
>>>>  - Added all kinds of perf testing report [viro]
>>>>  - Covered mmap() now [viro]
>>>>  - Removed list_sort() in hot_update_worker() to avoid locking contention
>>>>    and cacheline bouncing [viro]
>>>>  - Removed a /proc interface to control low memory usage [Chandra]
>>>>  - Adjusted shrinker support due to the change of public shrinker APIs [zwu]
>>>>  - Fixed the locking missing issue when hot_inode_item_put() is called
>>>>    in ioctl_heat_info() [viro]
>>>>  - Fixed some locking contention issues [zwu]
>>>>
>>>> v4:
>>>>  - Removed debugfs support, but leave it to TODO list [viro, Chandra]
>>>>  - Killed HOT_DELETING and HOT_IN_LIST flag [viro]
>>>>  - Fixed unlink issues [viro]
>>>>  - Fixed the issue on lookups (both for inode and range)
>>>>    leak on race with unlink  [viro]
>>>>  - Killed hot_comm_item and split the functions which take it [virio]
>>>>  - Fixed some other issues [zwu, Chandra]
>>>>
>>>> v3:
>>>>  - Added memory caping function for hot items [Zhiyong]
>>>>  - Cleanup aging function [Zhiyong]
>>>>
>>>> v2:
>>>>  - Refactored to be under RCU [Chandra Seetharaman]
>>>>   Merged some code changes [Chandra Seetharaman]
>>>>  - Fixed some issues [Chandra Seetharaman]
>>>>
>>>> v1:
>>>>  - Solved 64 bits inode number issue. [David Sterba]
>>>>  - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
>>>>  - Cleanup Some issues [David Sterba]
>>>>  - Use a static hot debugfs root [Greg KH]
>>>>
>>>> rfcv4:
>>>>  - Introduce hot func registering framework [Zhiyong]
>>>>  - Remove global variable for hot tracking [Zhiyong]
>>>>  - Add btrfs hot tracking support [Zhiyong]
>>>>
>>>> rfcv3:
>>>>  1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
>>>>  2.) Refactored workqueue support. [Dave Chinner]
>>>>  3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
>>>>      TIME_TO_KICK, and HEAT_UPDATE_DELAY
>>>>  4.) Cleanedup a lot of other issues [Dave Chinner]
>>>>
>>>>
>>>> rfcv2:
>>>>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>>>>  2.) Added memory shrinker [Dave Chinner]
>>>>  3.) Converted to one workqueue to update map info periodically [Dave Chinner]
>>>>  4.) Cleanedup a lot of other issues [Dave Chinner]
>>>>
>>>> rfcv1:
>>>>  1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
>>>>  2.) The first three patches can probably just be flattened into one.
>>>>                                         [Marco Stornelli , Dave Chinner]
>>>>
>>>>
>>>> Dave Chinner (1):
>>>>   VFS hot tracking, xfs: Add hot tracking support
>>>>
>>>> Zhi Yong Wu (10):
>>>>   VFS hot tracking: Define basic data structures and functions
>>>>   VFS hot tracking: Track IO and record heat information
>>>>   VFS hot tracking: Add a workqueue to move items between hot maps
>>>>   VFS hot tracking: Add shrinker functionality to curtail memory usage
>>>>   VFS hot tracking: Add an ioctl to get hot tracking information
>>>>   VFS hot tracking: Add a /proc interface to make the interval tunable
>>>>   VFS hot tracking: Add a /proc interface to control memory usage
>>>>   VFS hot tracking: Add documentation
>>>>   VFS hot tracking, btrfs: Add hot tracking support
>>>>   MAINTAINERS: add the maintainers for VFS hot tracking
>>>>
>>>>  Documentation/filesystems/00-INDEX         |   2 +
>>>>  Documentation/filesystems/hot_tracking.txt | 207 ++++++++
>>>>  MAINTAINERS                                |  12 +
>>>>  fs/Makefile                                |   2 +-
>>>>  fs/btrfs/ctree.h                           |   1 +
>>>>  fs/btrfs/super.c                           |  22 +-
>>>>  fs/compat_ioctl.c                          |   5 +
>>>>  fs/dcache.c                                |   2 +
>>>>  fs/hot_tracking.c                          | 816 +++++++++++++++++++++++++++++
>>>>  fs/hot_tracking.h                          |  72 +++
>>>>  fs/ioctl.c                                 |  71 +++
>>>>  fs/namei.c                                 |   4 +
>>>>  fs/xfs/xfs_mount.h                         |   1 +
>>>>  fs/xfs/xfs_super.c                         |  18 +
>>>>  include/linux/fs.h                         |   4 +
>>>>  include/linux/hot_tracking.h               | 107 ++++
>>>>  include/uapi/linux/fs.h                    |   1 +
>>>>  include/uapi/linux/hot_tracking.h          |  33 ++
>>>>  kernel/sysctl.c                            |  14 +
>>>>  mm/filemap.c                               |  24 +-
>>>>  mm/readahead.c                             |   6 +
>>>>  21 files changed, 1420 insertions(+), 4 deletions(-)
>>>>  create mode 100644 Documentation/filesystems/hot_tracking.txt
>>>>  create mode 100644 fs/hot_tracking.c
>>>>  create mode 100644 fs/hot_tracking.h
>>>>  create mode 100644 include/linux/hot_tracking.h
>>>>  create mode 100644 include/uapi/linux/hot_tracking.h
>>>>
>>>> --
>>>> 1.7.11.7
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Zhi Yong Wu
>>
>>
>>
>> --
>> Regards,
>>
>> Zhi Yong Wu
>
>
>
> --
> Regards,
>
> Zhi Yong Wu



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage
  2013-11-06 13:45 ` [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage Zhi Yong Wu
  2013-11-11 22:15   ` Dave Hansen
@ 2013-12-11 15:44   ` Zhi Yong Wu
  1 sibling, 0 replies; 31+ messages in thread
From: Zhi Yong Wu @ 2013-12-11 15:44 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel mlist, Zhi Yong Wu, Ric Wheeler,
	Linus Torvalds, Paul McKenney

Ping ^ 7....

On Wed, Nov 6, 2013 at 9:45 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
> Introduce a /proc interface hot-mem-high-thresh and
> to cap the memory which is consumed by hot_inode_item
> and hot_range_item, and they will be in the unit of
> 1M bytes.
>
> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
>  fs/hot_tracking.c            | 29 +++++++++++++++++++++++++++++
>  fs/hot_tracking.h            | 23 +++++++++++++++++++++++
>  include/linux/hot_tracking.h |  3 +++
>  kernel/sysctl.c              |  7 +++++++
>  4 files changed, 62 insertions(+)
>
> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
> index 7a9bd4f..2c5a7fd 100644
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> @@ -15,6 +15,7 @@
>  #include <linux/sched.h>
>  #include "hot_tracking.h"
>
> +int sysctl_hot_mem_high_thresh __read_mostly = 0;
>  int sysctl_hot_update_interval __read_mostly = 150;
>
>  /* kmem_cache pointers for slab caches */
> @@ -32,6 +33,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
>         hr->len = 1 << RANGE_BITS;
>         hr->hot_inode = he;
>         atomic_long_inc(&he->hot_root->hot_cnt);
> +       hot_mem_limit_add(he->hot_root, sizeof(struct hot_range_item));
>  }
>
>  static void hot_range_item_free_cb(struct rcu_head *head)
> @@ -55,6 +57,7 @@ static void hot_range_item_free(struct kref *kref)
>         spin_unlock(&root->m_lock);
>
>         atomic_long_dec(&root->hot_cnt);
> +       hot_mem_limit_sub(root, sizeof(struct hot_range_item));
>         call_rcu(&hr->rcu, hot_range_item_free_cb);
>  }
>
> @@ -103,6 +106,8 @@ redo:
>                                  * newly allocated item.
>                                  */
>                                 atomic_long_dec(&he->hot_root->hot_cnt);
> +                               hot_mem_limit_sub(he->hot_root,
> +                                               sizeof(struct hot_range_item));
>                                 kmem_cache_free(hot_range_item_cachep, hr_new);
>                         }
>                         spin_unlock(&he->i_lock);
> @@ -205,6 +210,7 @@ static void hot_inode_item_init(struct hot_inode_item *he,
>         he->hot_root = root;
>         spin_lock_init(&he->i_lock);
>         atomic_long_inc(&root->hot_cnt);
> +       hot_mem_limit_add(root, sizeof(struct hot_inode_item));
>  }
>
>  static void hot_inode_item_free_cb(struct rcu_head *head)
> @@ -226,6 +232,7 @@ static void hot_inode_item_free(struct kref *kref)
>         hot_range_tree_free(he);
>
>         atomic_long_dec(&he->hot_root->hot_cnt);
> +       hot_mem_limit_sub(he->hot_root, sizeof(struct hot_inode_item));
>         call_rcu(&he->rcu, hot_inode_item_free_cb);
>  }
>
> @@ -272,6 +279,8 @@ redo:
>                                  * newly allocated item.
>                                  */
>                                 atomic_long_dec(&root->hot_cnt);
> +                               hot_mem_limit_sub(root,
> +                                               sizeof(struct hot_inode_item));
>                                 kmem_cache_free(hot_inode_item_cachep, he_new);
>                         }
>                         spin_unlock(&root->t_lock);
> @@ -534,6 +543,23 @@ static unsigned long hot_item_evict(struct hot_info *root, unsigned long work,
>         return freed;
>  }
>
> +static void hot_mem_evict(struct hot_info *root)
> +{
> +       unsigned long sum, thresh;
> +
> +       if (sysctl_hot_mem_high_thresh == 0)
> +               return;
> +
> +       sum = hot_mem_limit_sum(root);
> +       /* Note: sysctl_** is in the unit of 1M bytes */
> +       thresh = sysctl_hot_mem_high_thresh;
> +       thresh *= 1024 * 1024;
> +       if (sum <= thresh)
> +               return;
> +
> +       hot_item_evict(root, sum - thresh, hot_mem_limit_sum);
> +}
> +
>  /*
>   * Every sync period we update temperatures for
>   * each hot inode item and hot range item for aging
> @@ -546,6 +572,8 @@ static void hot_update_worker(struct work_struct *work)
>         struct hot_inode_item *he;
>         struct rb_node *node;
>
> +       hot_mem_evict(root);
> +
>         rcu_read_lock();
>         node = root->hot_inode_tree.rb_node;
>         while (node) {
> @@ -753,6 +781,7 @@ int hot_track_init(struct super_block *sb)
>                 goto err;
>         }
>
> +       hot_mem_limit_init(root);
>         sb->s_hot_root = root;
>         sb->s_flags |= MS_HOTTRACK;
>
> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
> index 6a6971e..4ee0b90 100644
> --- a/fs/hot_tracking.h
> +++ b/fs/hot_tracking.h
> @@ -46,4 +46,27 @@ struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root, u64 ino);
>  void hot_inode_item_unlink(struct inode *inode);
>  u32 hot_temp_calc(struct hot_freq *freq);
>
> +/* Memory Tracking Functions. */
> +static inline unsigned long hot_mem_limit_sum(struct hot_info *root)
> +{
> +       return atomic_long_read(&root->mem);
> +}
> +
> +static inline void hot_mem_limit_sub(struct hot_info *root,
> +                               unsigned long count)
> +{
> +       atomic_long_sub(count, &root->mem);
> +}
> +
> +static inline void hot_mem_limit_add(struct hot_info *root,
> +                               unsigned long count)
> +{
> +       atomic_long_add(count, &root->mem);
> +}
> +
> +static inline void hot_mem_limit_init(struct hot_info *root)
> +{
> +       atomic_long_set(&root->mem, 0);
> +}
> +
>  #endif /* __HOT_TRACKING__ */
> diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
> index 43df1b9..5c2c247 100644
> --- a/include/linux/hot_tracking.h
> +++ b/include/linux/hot_tracking.h
> @@ -83,10 +83,13 @@ struct hot_info {
>         struct workqueue_struct *update_wq;
>         struct delayed_work update_work;
>         struct shrinker hot_shrink;
> +       atomic_long_t mem;
>  };
>
>  /* set how often to update temperatures (seconds) */
>  extern int sysctl_hot_update_interval;
> +/* note: sysctl_** is in the unit of 1M bytes */
> +extern int sysctl_hot_mem_high_thresh;
>
>  /*
>   * Hot data tracking ioctls:
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index e0b062a..fde8bc2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1632,6 +1632,13 @@ static struct ctl_table fs_table[] = {
>                 .extra1         = &pipe_min_size,
>         },
>         {
> +               .procname       = "hot-mem-high-thresh",
> +               .data           = &sysctl_hot_mem_high_thresh,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec,
> +       },
> +       {
>                 .procname       = "hot-update-interval",
>                 .data           = &sysctl_hot_update_interval,
>                 .maxlen         = sizeof(int),
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/11] VFS hot tracking
  2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
                   ` (12 preceding siblings ...)
  2013-11-13 18:33 ` Zhi Yong Wu
@ 2013-12-11 15:45 ` Zhi Yong Wu
  2014-07-17 19:35   ` The VFS hot tracking debacle Daniel Poelzleithner
  13 siblings, 1 reply; 31+ messages in thread
From: Zhi Yong Wu @ 2013-12-11 15:45 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel mlist, Zhi Yong Wu, Ric Wheeler,
	Paul McKenney

Ping ^ 7

On Wed, Nov 6, 2013 at 9:45 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
>   The patchset is trying to introduce hot tracking function in
> VFS layer, which will keep track of real disk I/O in memory.
> By it, you will easily know more details about disk I/O, and
> then detect where disk I/O hot spots are. Also, specific FS
> can take use of it to do accurate defragment, and hot relocation
> support, etc.
>
>   Now it's time to send out its V6 for external review, and
> any comments or ideas are appreciated, thanks.
>
> NOTE:
>
>   The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
>   If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
>   For how to use and more other info and performance report,
> please check hot_tracking.txt in Documentation and following
> links:
>   1.) http://lwn.net/Articles/525651/
>   2.) https://lkml.org/lkml/2012/12/20/199
>
>   This patchset has been done scalability or performance tests
> by fs_mark, ffsb and compilebench.
>
>   The perf testings were done on Linux 3.12.0-rc7 with Model IBM,8231-E2C
> Big Endian PPC64 with 64 CPUs and 2 NUMA nodes, 250G RAM and 1.50 TiB
> test hard disk where each test file size is 20G or 100G.
> Architecture:          ppc64
> Byte Order:            Big Endian
> CPU(s):                64
> On-line CPU(s) list:   0-63
> Thread(s) per core:    4
> Core(s) per socket:    1
> Socket(s):             16
> NUMA node(s):          2
> Model:                 IBM,8231-E2C
> Hypervisor vendor:     pHyp
> Virtualization type:   full
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              4096K
> NUMA node0 CPU(s):     0-31
> NUMA node1 CPU(s):     32-63
>
>   Below is the perf testing report:
>
>   Please focus on the two key points:
>   - The overall overhead which is injected by the patchset
>   - The stability of the perf results
>
> 1. fio tests
>
>                             w/o hot tracking                               w/ hot tracking
>
> RAM size                            32G          32G         16G           8G           4G           2G          250G
>
> sequential-8k-1jobs-read         61260KB/s    60918KB/s    60901KB/s    62610KB/s    60992KB/s    60213KB/s    60948KB/s
>
> sequential-8k-1jobs-write         1329KB/s     1329KB/s     1328KB/s     1329KB/s     1328KB/s     1329KB/s     1329KB/s
>
> sequential-8k-8jobs-read         91139KB/s    92614KB/s    90907KB/s    89895KB/s    92022KB/s    90851KB/s    91877KB/s
>
> sequential-8k-8jobs-write         2523KB/s     2522KB/s     2516KB/s     2521KB/s     2516KB/s     2518KB/s     2521KB/s
>
> sequential-256k-1jobs-read      151432KB/s   151403KB/s   151406KB/s   151422KB/s   151344KB/s   151446KB/s   151372KB/s
>
> sequential-256k-1jobs-write      33451KB/s    33470KB/s    33481KB/s    33470KB/s    33459KB/s    33472KB/s    33477KB/s
>
> sequential-256k-8jobs-read      235291KB/s   234555KB/s   234251KB/s   233656KB/s   234927KB/s   236380KB/s   235535KB/s
>
> sequential-256k-8jobs-write      62419KB/s    62402KB/s    62191KB/s    62859KB/s    62629KB/s    62720KB/s    62523KB/s
>
> random-io-mix-8k-1jobs  [READ]    2929KB/s     2942KB/s     2946KB/s     2929KB/s     2934KB/s     2947KB/s     2946KB/s
>                         [WRITE]   1262KB/s     1266KB/s     1257KB/s     1262KB/s     1257KB/s     1257KB/s     1265KB/s
>
> random-io-mix-8k-8jobs  [READ]    2444KB/s     2442KB/s     2436KB/s     2416KB/s     2353KB/s     2441KB/s     2442KB/s
>                         [WRITE]   1047KB/s     1044KB/s     1047KB/s     1028KB/s     1017KB/s     1034KB/s     1049KB/s
>
> random-io-mix-8k-16jobs [READ]    2182KB/s     2184KB/s     2169KB/s     2178KB/s     2190KB/s     2184KB/s     2180KB/s
>                         [WRITE]    932KB/s      930KB/s      943KB/s      936KB/s      937KB/s      929KB/s      931KB/s
>
> The above perf parameter is the aggregate bandwidth of threads in the group;
> If you hope to know how about other perf parameters, or fio raw results, please let me know, thanks.
>
> 2. Locking stat - Contention & Cacheline Bouncing
>
> RAM size         class name         con-bounces  contentions  acq-bounces   acquisitions   cacheline bouncing  locking contention
>                                                                                                  ratio              ratio
>
>               &(&root->t_lock)->rlock:  1508        1592         157834      374639292           0.96%              0.00%
> 250G          &(&root->m_lock)->rlock:  1469        1484         119221       43077842           1.23%              0.00%
>               &(&he->i_lock)->rlock:       0           0         101879      376755218           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  2912        2985         342575      374691186           0.85%              0.00%
> 32G           &(&root->m_lock)->rlock:   188         193         307765        8803163           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         291860      376756084           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  3863        3948         298041      374727038           1.30%              0.00%
> 16G           &(&root->m_lock)->rlock:   220         228         254451        8687057           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         235027      376756830           0.00%              0.00%
>
>               &(&root->t_lock)->rlock:  3283        3409         233790      374722064           1.40%              0.00%
> 8G            &(&root->m_lock)->rlock:   136         139         203917        8684313           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         193746      376756438           0.00%              0.00%
>
>               &(&root->t_lock)->rlock: 15090       15705         283460      374889666           5.32%              0.00%
> 4G            &(&root->m_lock)->rlock:   172         173         222480        8555052           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         206431      376759452           0.00%              0.00%
>
>               &(&root->t_lock)->rlock: 25515       27368         305129       375394828          8.36%              0.00%
> 2G            &(&root->m_lock)->rlock:   100         101         216516        6752265           0.00%              0.00%
>               &(&he->i_lock)->rlock:       0           0         214713      376765169           0.00%              0.00%
>
> 3. Perf test - Cacheline Ping-pong
>
>                       w/o hot tracking                                                        w/ hot tracking
>
> RAM size                    32G                  32G                 16G                  8G                   4G                    2G                  250G
>
> cache-references    1,264,996,437,581    1,401,504,955,577    1,398,308,614,801    1,396,525,544,527    1,384,793,467,410    1,432,042,560,409    1,571,627,148,771
>
> cache-misses           45,424,567,057       58,432,749,807       59,200,504,032       59,762,030,933       58,104,156,576       57,283,962,840       61,963,839,419
>
> seconds time elapsed  22956.327674298      23035.457069488      23017.232397085      23012.397142967      23008.420970731      23057.245578767      23342.456015188
>
> cache-misses ratio            3.591 %              4.169 %              4.234 %              4.279 %              4.196 %              4.000 %              3.943 %
>
> Changelog from v5:
>  - Also added the hook hot_freqs_update() in the page cache I/O path,
>    not only in real disk I/O path [viro]
>  - Don't export the stuff until it's used by a module [viro]
>  - Splitted hot_inode_item_lookup() [viro]
>  - Prevented hot items from being re-created after the inode was unlinked. [viro]
>  - Made hot_freqs_update() to be inline and adopt one private hot flag [viro]
>  - Killed hot_bit_shift() [viro]
>  - Used file_inode() instead of file->f_dentry->d_inode [viro]
>  - Introduced one new file hot_tracking.h in include/uapi/linux/ [viro]
>  - Made the checks for ->i_nlink to be protectd by ->i_mutex [viro]
>
> v5:
>  - Added all kinds of perf testing report [viro]
>  - Covered mmap() now [viro]
>  - Removed list_sort() in hot_update_worker() to avoid locking contention
>    and cacheline bouncing [viro]
>  - Removed a /proc interface to control low memory usage [Chandra]
>  - Adjusted shrinker support due to the change of public shrinker APIs [zwu]
>  - Fixed the locking missing issue when hot_inode_item_put() is called
>    in ioctl_heat_info() [viro]
>  - Fixed some locking contention issues [zwu]
>
> v4:
>  - Removed debugfs support, but leave it to TODO list [viro, Chandra]
>  - Killed HOT_DELETING and HOT_IN_LIST flag [viro]
>  - Fixed unlink issues [viro]
>  - Fixed the issue on lookups (both for inode and range)
>    leak on race with unlink  [viro]
>  - Killed hot_comm_item and split the functions which take it [virio]
>  - Fixed some other issues [zwu, Chandra]
>
> v3:
>  - Added memory caping function for hot items [Zhiyong]
>  - Cleanup aging function [Zhiyong]
>
> v2:
>  - Refactored to be under RCU [Chandra Seetharaman]
>   Merged some code changes [Chandra Seetharaman]
>  - Fixed some issues [Chandra Seetharaman]
>
> v1:
>  - Solved 64 bits inode number issue. [David Sterba]
>  - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
>  - Cleanup Some issues [David Sterba]
>  - Use a static hot debugfs root [Greg KH]
>
> rfcv4:
>  - Introduce hot func registering framework [Zhiyong]
>  - Remove global variable for hot tracking [Zhiyong]
>  - Add btrfs hot tracking support [Zhiyong]
>
> rfcv3:
>  1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
>  2.) Refactored workqueue support. [Dave Chinner]
>  3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
>      TIME_TO_KICK, and HEAT_UPDATE_DELAY
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
>
> rfcv2:
>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>  2.) Added memory shrinker [Dave Chinner]
>  3.) Converted to one workqueue to update map info periodically [Dave Chinner]
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
> rfcv1:
>  1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
>  2.) The first three patches can probably just be flattened into one.
>                                         [Marco Stornelli , Dave Chinner]
>
>
> Dave Chinner (1):
>   VFS hot tracking, xfs: Add hot tracking support
>
> Zhi Yong Wu (10):
>   VFS hot tracking: Define basic data structures and functions
>   VFS hot tracking: Track IO and record heat information
>   VFS hot tracking: Add a workqueue to move items between hot maps
>   VFS hot tracking: Add shrinker functionality to curtail memory usage
>   VFS hot tracking: Add an ioctl to get hot tracking information
>   VFS hot tracking: Add a /proc interface to make the interval tunable
>   VFS hot tracking: Add a /proc interface to control memory usage
>   VFS hot tracking: Add documentation
>   VFS hot tracking, btrfs: Add hot tracking support
>   MAINTAINERS: add the maintainers for VFS hot tracking
>
>  Documentation/filesystems/00-INDEX         |   2 +
>  Documentation/filesystems/hot_tracking.txt | 207 ++++++++
>  MAINTAINERS                                |  12 +
>  fs/Makefile                                |   2 +-
>  fs/btrfs/ctree.h                           |   1 +
>  fs/btrfs/super.c                           |  22 +-
>  fs/compat_ioctl.c                          |   5 +
>  fs/dcache.c                                |   2 +
>  fs/hot_tracking.c                          | 816 +++++++++++++++++++++++++++++
>  fs/hot_tracking.h                          |  72 +++
>  fs/ioctl.c                                 |  71 +++
>  fs/namei.c                                 |   4 +
>  fs/xfs/xfs_mount.h                         |   1 +
>  fs/xfs/xfs_super.c                         |  18 +
>  include/linux/fs.h                         |   4 +
>  include/linux/hot_tracking.h               | 107 ++++
>  include/uapi/linux/fs.h                    |   1 +
>  include/uapi/linux/hot_tracking.h          |  33 ++
>  kernel/sysctl.c                            |  14 +
>  mm/filemap.c                               |  24 +-
>  mm/readahead.c                             |   6 +
>  21 files changed, 1420 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/filesystems/hot_tracking.txt
>  create mode 100644 fs/hot_tracking.c
>  create mode 100644 fs/hot_tracking.h
>  create mode 100644 include/linux/hot_tracking.h
>  create mode 100644 include/uapi/linux/hot_tracking.h
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* The VFS hot tracking debacle
  2013-12-11 15:45 ` Zhi Yong Wu
@ 2014-07-17 19:35   ` Daniel Poelzleithner
  2014-07-17 21:34     ` Martin Steigerwald
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Poelzleithner @ 2014-07-17 19:35 UTC (permalink / raw)
  To: linux-fsdevel

Zhi Yong Wu <zwu.kernel <at> gmail.com> writes:

> 
> Ping ^ 7
> 


I'm following this patch now for quite some time and I have to say that this
thread+patch is one of the most disappointing experiences I had with kernel
development.

I can't understand why such a patch is simply ignored from the maintainers
of the subsystem, no comment, no review, no checkin.

If you would at least give a explanation why it is a bad idea to have hot
data on fast drives or the general approach is simply bad or there are some
quirks in the code that needs to be fixed - then alright.
But ignoring a patch like this that will make many people happy is not
understandable form my perspective and the only explanation I can think of
is some bad intention or a conflict with a employer...

This patch story is just disappointing.

Zhi Yong Wu, thanks for the great patch.


kind regards
 Daniel



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: The VFS hot tracking debacle
  2014-07-17 19:35   ` The VFS hot tracking debacle Daniel Poelzleithner
@ 2014-07-17 21:34     ` Martin Steigerwald
  2014-07-17 21:52       ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Martin Steigerwald @ 2014-07-17 21:34 UTC (permalink / raw)
  To: Daniel Poelzleithner; +Cc: linux-fsdevel

Am Donnerstag, 17. Juli 2014, 19:35:53 schrieb Daniel Poelzleithner:
> Zhi Yong Wu <zwu.kernel <at> gmail.com> writes:
> > Ping ^ 7
> 
> I'm following this patch now for quite some time and I have to say that this
> thread+patch is one of the most disappointing experiences I had with kernel
> development.
> 
> I can't understand why such a patch is simply ignored from the maintainers
> of the subsystem, no comment, no review, no checkin.

I always wanted to try this one out, but never got around doing it.

I really like the general approach of it to put the general stuff into VFS and 
only the special stuff into filesystems like BTRFS.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: The VFS hot tracking debacle
  2014-07-17 21:34     ` Martin Steigerwald
@ 2014-07-17 21:52       ` Dave Chinner
  2014-07-18  8:25         ` Martin Steigerwald
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2014-07-17 21:52 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Daniel Poelzleithner, linux-fsdevel

On Thu, Jul 17, 2014 at 11:34:49PM +0200, Martin Steigerwald wrote:
> Am Donnerstag, 17. Juli 2014, 19:35:53 schrieb Daniel Poelzleithner:
> > Zhi Yong Wu <zwu.kernel <at> gmail.com> writes:
> > > Ping ^ 7
> > 
> > I'm following this patch now for quite some time and I have to say that this
> > thread+patch is one of the most disappointing experiences I had with kernel
> > development.
> > 
> > I can't understand why such a patch is simply ignored from the maintainers
> > of the subsystem, no comment, no review, no checkin.
> 
> I always wanted to try this one out, but never got around doing it.
> 
> I really like the general approach of it to put the general stuff into VFS and 
> only the special stuff into filesystems like BTRFS.

And that's the core issue here: there are no applications that use
the information.  i.e. it's a solution looking for a problem.

I spent a lot of time reviewing and helping on this, and I made
repeated suggestions that applications like xfs_fsr could make use
of the information to do optimised file layout during
defragmentation, but nothing like that has ever been implemented.

So, really, until there is an application that actually demonstrates
the usefulness of the specific information that is tracked and
exported, we can't verify that the code as it stands is actually
useful. We can verify that the code doesn't have problems, but we
can't verify whether it is fit for purpose because it currently has
no purpose....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: The VFS hot tracking debacle
  2014-07-17 21:52       ` Dave Chinner
@ 2014-07-18  8:25         ` Martin Steigerwald
  2014-07-20  0:02           ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Martin Steigerwald @ 2014-07-18  8:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Daniel Poelzleithner, linux-fsdevel

Am Freitag, 18. Juli 2014, 07:52:48 schrieb Dave Chinner:
> On Thu, Jul 17, 2014 at 11:34:49PM +0200, Martin Steigerwald wrote:
> > Am Donnerstag, 17. Juli 2014, 19:35:53 schrieb Daniel Poelzleithner:
> > > Zhi Yong Wu <zwu.kernel <at> gmail.com> writes:
> > > > Ping ^ 7
> > > 
> > > I'm following this patch now for quite some time and I have to say that
> > > this thread+patch is one of the most disappointing experiences I had
> > > with kernel development.
> > > 
> > > I can't understand why such a patch is simply ignored from the
> > > maintainers
> > > of the subsystem, no comment, no review, no checkin.
> > 
> > I always wanted to try this one out, but never got around doing it.
> > 
> > I really like the general approach of it to put the general stuff into VFS
> > and only the special stuff into filesystems like BTRFS.
> 
> And that's the core issue here: there are no applications that use
> the information.  i.e. it's a solution looking for a problem.
> 
> I spent a lot of time reviewing and helping on this, and I made
> repeated suggestions that applications like xfs_fsr could make use
> of the information to do optimised file layout during
> defragmentation, but nothing like that has ever been implemented.
> 
> So, really, until there is an application that actually demonstrates
> the usefulness of the specific information that is tracked and
> exported, we can't verify that the code as it stands is actually
> useful. We can verify that the code doesn't have problems, but we
> can't verify whether it is fit for purpose because it currently has
> no purpose....

So this needs an example implementation for one filesystem?

I thought there is one for BTRFS.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: The VFS hot tracking debacle
  2014-07-18  8:25         ` Martin Steigerwald
@ 2014-07-20  0:02           ` Dave Chinner
  2014-07-25  8:43             ` Steven Whitehouse
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2014-07-20  0:02 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Daniel Poelzleithner, linux-fsdevel

On Fri, Jul 18, 2014 at 10:25:23AM +0200, Martin Steigerwald wrote:
> Am Freitag, 18. Juli 2014, 07:52:48 schrieb Dave Chinner:
> > On Thu, Jul 17, 2014 at 11:34:49PM +0200, Martin Steigerwald wrote:
> > > Am Donnerstag, 17. Juli 2014, 19:35:53 schrieb Daniel Poelzleithner:
> > > > Zhi Yong Wu <zwu.kernel <at> gmail.com> writes:
> > > > > Ping ^ 7
> > > > 
> > > > I'm following this patch now for quite some time and I have to say that
> > > > this thread+patch is one of the most disappointing experiences I had
> > > > with kernel development.
> > > > 
> > > > I can't understand why such a patch is simply ignored from the
> > > > maintainers
> > > > of the subsystem, no comment, no review, no checkin.
> > > 
> > > I always wanted to try this one out, but never got around doing it.
> > > 
> > > I really like the general approach of it to put the general stuff into VFS
> > > and only the special stuff into filesystems like BTRFS.
> > 
> > And that's the core issue here: there are no applications that use
> > the information.  i.e. it's a solution looking for a problem.
> > 
> > I spent a lot of time reviewing and helping on this, and I made
> > repeated suggestions that applications like xfs_fsr could make use
> > of the information to do optimised file layout during
> > defragmentation, but nothing like that has ever been implemented.
> > 
> > So, really, until there is an application that actually demonstrates
> > the usefulness of the specific information that is tracked and
> > exported, we can't verify that the code as it stands is actually
> > useful. We can verify that the code doesn't have problems, but we
> > can't verify whether it is fit for purpose because it currently has
> > no purpose....
> 
> So this needs an example implementation for one filesystem?
> 
> I thought there is one for BTRFS.

The tracking of the "hot data" was implemented for both btrfs and
XFS. There isn't a *consumer* of that data, however, and that's the
missing piece of the picture.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: The VFS hot tracking debacle
  2014-07-20  0:02           ` Dave Chinner
@ 2014-07-25  8:43             ` Steven Whitehouse
  0 siblings, 0 replies; 31+ messages in thread
From: Steven Whitehouse @ 2014-07-25  8:43 UTC (permalink / raw)
  To: Dave Chinner, Martin Steigerwald; +Cc: Daniel Poelzleithner, linux-fsdevel

Hi,

On 20/07/14 01:02, Dave Chinner wrote:
> On Fri, Jul 18, 2014 at 10:25:23AM +0200, Martin Steigerwald wrote:
>> Am Freitag, 18. Juli 2014, 07:52:48 schrieb Dave Chinner:
>>> On Thu, Jul 17, 2014 at 11:34:49PM +0200, Martin Steigerwald wrote:
>>>> Am Donnerstag, 17. Juli 2014, 19:35:53 schrieb Daniel Poelzleithner:
>>>>> Zhi Yong Wu <zwu.kernel <at> gmail.com> writes:
>>>>>> Ping ^ 7
>>>>> I'm following this patch now for quite some time and I have to say that
>>>>> this thread+patch is one of the most disappointing experiences I had
>>>>> with kernel development.
>>>>>
>>>>> I can't understand why such a patch is simply ignored from the
>>>>> maintainers
>>>>> of the subsystem, no comment, no review, no checkin.
>>>> I always wanted to try this one out, but never got around doing it.
>>>>
>>>> I really like the general approach of it to put the general stuff into VFS
>>>> and only the special stuff into filesystems like BTRFS.
>>> And that's the core issue here: there are no applications that use
>>> the information.  i.e. it's a solution looking for a problem.
>>>
>>> I spent a lot of time reviewing and helping on this, and I made
>>> repeated suggestions that applications like xfs_fsr could make use
>>> of the information to do optimised file layout during
>>> defragmentation, but nothing like that has ever been implemented.
>>>
>>> So, really, until there is an application that actually demonstrates
>>> the usefulness of the specific information that is tracked and
>>> exported, we can't verify that the code as it stands is actually
>>> useful. We can verify that the code doesn't have problems, but we
>>> can't verify whether it is fit for purpose because it currently has
>>> no purpose....
>> So this needs an example implementation for one filesystem?
>>
>> I thought there is one for BTRFS.
> The tracking of the "hot data" was implemented for both btrfs and
> XFS. There isn't a *consumer* of that data, however, and that's the
> missing piece of the picture.
>
> Cheers,
>
> Dave.

GFS2 has both a producer (gfs2 tracepoints) and consumer (PCP pmda) of 
data which is intended to provide a picture of where there are hot spots 
in terms of cluster locking. The hot tracking patch set sounds like it 
should provide something similar on a generic basis, which I think 
should be a useful thing to have. I did look at some earlier versions of 
this patch set, but I rather lost track of it - is there an uptodate git 
tree somewhere?

Steve.


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-07-25  8:43 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-06 13:45 [PATCH v6 00/11] VFS hot tracking Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 01/11] VFS hot tracking: Define basic data structures and functions Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 02/11] VFS hot tracking: Track IO and record heat information Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 03/11] VFS hot tracking: Add a workqueue to move items between hot maps Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 04/11] VFS hot tracking: Add shrinker functionality to curtail memory usage Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 05/11] VFS hot tracking: Add an ioctl to get hot tracking information Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 06/11] VFS hot tracking: Add a /proc interface to make the interval tunable Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 07/11] VFS hot tracking: Add a /proc interface to control memory usage Zhi Yong Wu
2013-11-11 22:15   ` Dave Hansen
2013-11-11 22:45     ` Zhi Yong Wu
2013-11-12 17:05       ` Dave Hansen
2013-11-12 20:38         ` Zhi Yong Wu
2013-11-12 21:02           ` Dave Hansen
2013-11-12 21:56             ` Zhi Yong Wu
2013-12-11 15:44   ` Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 08/11] VFS hot tracking: Add documentation Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 09/11] VFS hot tracking, btrfs: Add hot tracking support Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 10/11] VFS hot tracking, xfs: " Zhi Yong Wu
2013-11-06 13:45 ` [PATCH v6 11/11] MAINTAINERS: add the maintainers for VFS hot tracking Zhi Yong Wu
2013-11-11 15:43 ` [PATCH v6 00/11] " Zhi Yong Wu
2013-11-13 18:33 ` Zhi Yong Wu
2013-11-21 13:57   ` Zhi Yong Wu
2013-11-30  9:55     ` Zhi Yong Wu
2013-12-03 20:16       ` Zhi Yong Wu
2013-12-11 15:45 ` Zhi Yong Wu
2014-07-17 19:35   ` The VFS hot tracking debacle Daniel Poelzleithner
2014-07-17 21:34     ` Martin Steigerwald
2014-07-17 21:52       ` Dave Chinner
2014-07-18  8:25         ` Martin Steigerwald
2014-07-20  0:02           ` Dave Chinner
2014-07-25  8:43             ` Steven Whitehouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.