All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 00/12] blktrace: output cgroup info
@ 2017-06-28 16:29 Shaohua Li
  2017-06-28 16:29 ` [PATCH V4 01/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
                   ` (12 more replies)
  0 siblings, 13 replies; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Hi,

Currently blktrace isn't cgroup aware. blktrace prints out task name of current
context, but the task of current context isn't always in the cgroup where the
BIO comes from. We can't use task name to find out IO cgroup. For example,
Writeback BIOs always comes from flusher thread but the BIOs are for different
blk cgroups. Request could be requeued and dispatched from completely different
tasks. MD/DM are another examples. This brings challenges if we want to use
blktrace for performance tunning with cgroup enabled.

This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
use a BPF program to filter out blktrace for a specific cgroup.

The first 6 patches adds export operation handlers for kernfs, so userspace can
use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
blktrace output cgroup info.

Thanks,
Shaohua

V3 -> V4:
- Address some issues pointed out by Tejun
- Fix build errors with latest -next tree

V2 -> V3:
- Uses 32 bits inode number
- Refresh to latest -next tree

V1 -> V2:
- Fix a bug in cgroup association
- Fix build errors reported by 0day
- Address some issues pointed out by Tejun


Shaohua Li (12):
  kernfs: use idr instead of ida to manage inode number
  kernfs: implement i_generation
  kernfs: add an API to get kernfs node from inode number
  kernfs: don't set dentry->d_fsdata
  kernfs: introduce kernfs_node_id
  kernfs: add exportfs operations
  cgroup: export fhandle info for a cgroup
  blktrace: export cgroup info in trace
  block: always attach cgroup info into bio
  block: call __bio_free in bio_endio
  blktrace: add an option to allow displying cgroup path
  block: use standard blktrace API to output cgroup info for debug notes

 block/bfq-iosched.h               |  13 +-
 block/bio-integrity.c             |   1 +
 block/bio.c                       |   2 +
 block/blk-throttle.c              |  13 +-
 block/cfq-iosched.c               |  15 +--
 fs/kernfs/dir.c                   | 111 ++++++++++++----
 fs/kernfs/file.c                  |  10 +-
 fs/kernfs/inode.c                 |   9 +-
 fs/kernfs/kernfs-internal.h       |   9 ++
 fs/kernfs/mount.c                 |  94 ++++++++++++--
 fs/kernfs/symlink.c               |   6 +-
 include/linux/blk-cgroup.h        |   3 +
 include/linux/blktrace_api.h      |  13 +-
 include/linux/cgroup.h            |  16 ++-
 include/linux/kernfs.h            |  28 ++++-
 include/trace/events/writeback.h  |   2 +-
 include/uapi/linux/blktrace_api.h |   3 +
 kernel/cgroup/cgroup.c            |  15 ++-
 kernel/trace/blktrace.c           | 259 ++++++++++++++++++++++++++------------
 19 files changed, 470 insertions(+), 152 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V4 01/12] kernfs: use idr instead of ida to manage inode number
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-29 12:51   ` Greg KH
  2017-06-28 16:29 ` [PATCH V4 02/12] kernfs: implement i_generation Shaohua Li
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c        | 17 ++++++++++++-----
 include/linux/kernfs.h |  2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index db5900aaa..8ad7a17 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -21,6 +21,7 @@
 DEFINE_MUTEX(kernfs_mutex);
 static DEFINE_SPINLOCK(kernfs_rename_lock);	/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];	/* protected by rename_lock */
+static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
 #define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)
 
@@ -533,7 +534,9 @@ void kernfs_put(struct kernfs_node *kn)
 		simple_xattrs_free(&kn->iattr->xattrs);
 	}
 	kfree(kn->iattr);
-	ida_simple_remove(&root->ino_ida, kn->ino);
+	spin_lock(&kernfs_idr_lock);
+	idr_remove(&root->ino_idr, kn->ino);
+	spin_unlock(&kernfs_idr_lock);
 	kmem_cache_free(kernfs_node_cache, kn);
 
 	kn = parent;
@@ -542,7 +545,7 @@ void kernfs_put(struct kernfs_node *kn)
 			goto repeat;
 	} else {
 		/* just released the root kn, free @root too */
-		ida_destroy(&root->ino_ida);
+		idr_destroy(&root->ino_idr);
 		kfree(root);
 	}
 }
@@ -630,7 +633,11 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	if (!kn)
 		goto err_out1;
 
-	ret = ida_simple_get(&root->ino_ida, 1, 0, GFP_KERNEL);
+	idr_preload(GFP_KERNEL);
+	spin_lock(&kernfs_idr_lock);
+	ret = idr_alloc(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+	spin_unlock(&kernfs_idr_lock);
+	idr_preload_end();
 	if (ret < 0)
 		goto err_out2;
 	kn->ino = ret;
@@ -875,13 +882,13 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 	if (!root)
 		return ERR_PTR(-ENOMEM);
 
-	ida_init(&root->ino_ida);
+	idr_init(&root->ino_idr);
 	INIT_LIST_HEAD(&root->supers);
 
 	kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       KERNFS_DIR);
 	if (!kn) {
-		ida_destroy(&root->ino_ida);
+		idr_destroy(&root->ino_idr);
 		kfree(root);
 		return ERR_PTR(-ENOMEM);
 	}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index a9b11b8..5f5d602 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -163,7 +163,7 @@ struct kernfs_root {
 	unsigned int		flags;	/* KERNFS_ROOT_* flags */
 
 	/* private fields, do not use outside kernfs proper */
-	struct ida		ino_ida;
+	struct idr		ino_idr;
 	struct kernfs_syscall_ops *syscall_ops;
 
 	/* list of kernfs_super_info of this root, protected by kernfs_mutex */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 02/12] kernfs: implement i_generation
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
  2017-06-28 16:29 ` [PATCH V4 01/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-29 12:51   ` Greg KH
  2017-06-28 16:29 ` [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number Shaohua Li
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Set i_generation for kernfs inode. This is required to implement
exportfs operations. The generation is 32-bit, so it's possible the
generation wraps up and we find stale files. To reduce the posssibility,
we don't reuse inode numer immediately. When the inode number allocation
wraps, we increase generation number. In this way generation/inode
number consist of a 64-bit number which is unlikely duplicated. This
does make the idr tree more sparse and waste some memory. Since idr
manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
bits of the key. In a 100k inode kernfs, the worst case will have around
300k radix tree node. Each node is 576bytes, so the tree will use about
~150M memory. Sounds not too bad, if this really is a problem, we should
find better data structure.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c        | 10 +++++++++-
 fs/kernfs/inode.c      |  1 +
 include/linux/kernfs.h |  2 ++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8ad7a17..33f711f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -623,6 +623,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 					     unsigned flags)
 {
 	struct kernfs_node *kn;
+	u32 gen;
+	int cursor;
 	int ret;
 
 	name = kstrdup_const(name, GFP_KERNEL);
@@ -635,12 +637,17 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 
 	idr_preload(GFP_KERNEL);
 	spin_lock(&kernfs_idr_lock);
-	ret = idr_alloc(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+	cursor = idr_get_cursor(&root->ino_idr);
+	ret = idr_alloc_cyclic(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+	if (ret >= 0 && ret < cursor)
+		root->next_generation++;
+	gen = root->next_generation;
 	spin_unlock(&kernfs_idr_lock);
 	idr_preload_end();
 	if (ret < 0)
 		goto err_out2;
 	kn->ino = ret;
+	kn->generation = gen;
 
 	atomic_set(&kn->count, 1);
 	atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
@@ -884,6 +891,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 
 	idr_init(&root->ino_idr);
 	INIT_LIST_HEAD(&root->supers);
+	root->next_generation = 1;
 
 	kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       KERNFS_DIR);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fb4b4a7..79cdae4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,6 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
 	inode->i_private = kn;
 	inode->i_mapping->a_ops = &kernfs_aops;
 	inode->i_op = &kernfs_iops;
+	inode->i_generation = kn->generation;
 
 	set_default_inode_attr(inode, kn->mode);
 	kernfs_refresh_inode(kn, inode);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5f5d602..8c00d28 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -135,6 +135,7 @@ struct kernfs_node {
 	umode_t			mode;
 	unsigned int		ino;
 	struct kernfs_iattrs	*iattr;
+	u32			generation;
 };
 
 /*
@@ -164,6 +165,7 @@ struct kernfs_root {
 
 	/* private fields, do not use outside kernfs proper */
 	struct idr		ino_idr;
+	u32			next_generation;
 	struct kernfs_syscall_ops *syscall_ops;
 
 	/* list of kernfs_super_info of this root, protected by kernfs_mutex */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
  2017-06-28 16:29 ` [PATCH V4 01/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
  2017-06-28 16:29 ` [PATCH V4 02/12] kernfs: implement i_generation Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-28 18:07   ` Tejun Heo
  2017-06-29 12:50   ` Greg KH
  2017-06-28 16:29 ` [PATCH V4 04/12] kernfs: don't set dentry->d_fsdata Shaohua Li
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.

This API will be used in blktrace too later, so it should be as fast as
possible. To make the API lock free, kernfs node is freed in RCU
context. And we depend on kernfs_node count/ino number to filter out
stale kernfs nodes.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c             | 57 +++++++++++++++++++++++++++++++++++++++++++++
 fs/kernfs/kernfs-internal.h |  2 ++
 fs/kernfs/mount.c           | 11 ++++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 33f711f..7be37c8 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -508,6 +508,10 @@ void kernfs_put(struct kernfs_node *kn)
 	struct kernfs_node *parent;
 	struct kernfs_root *root;
 
+	/*
+	 * kernfs_node is freed with ->count 0, kernfs_find_and_get_node_by_ino
+	 * depends on this to filter reused stale node
+	 */
 	if (!kn || !atomic_dec_and_test(&kn->count))
 		return;
 	root = kernfs_root(kn);
@@ -649,6 +653,11 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	kn->ino = ret;
 	kn->generation = gen;
 
+	/*
+	 * set ino first. This barrier is paired with atomic_inc_not_zero in
+	 * kernfs_find_and_get_node_by_ino
+	 */
+	smp_mb__before_atomic();
 	atomic_set(&kn->count, 1);
 	atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
 	RB_CLEAR_NODE(&kn->rb);
@@ -680,6 +689,54 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 	return kn;
 }
 
+/*
+ * kernfs_find_and_get_node_by_ino - get kernfs_node from inode number
+ * @root: the kernfs root
+ * @ino: inode number
+ *
+ * RETURNS:
+ * NULL on failure. Return a kernfs node with reference counter incremented
+ */
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+						    unsigned int ino)
+{
+	struct kernfs_node *kn;
+
+	rcu_read_lock();
+	kn = idr_find(&root->ino_idr, ino);
+	if (!kn)
+		goto out;
+
+	/*
+	 * Since kernfs_node is freed in RCU, it's possible an old node for ino
+	 * is freed, but reused before RCU grace period. But a freed node (see
+	 * kernfs_put) or an incompletedly initialized node (see
+	 * __kernfs_new_node) should have 'count' 0. We can use this fact to
+	 * filter out such node.
+	 */
+	if (!atomic_inc_not_zero(&kn->count)) {
+		kn = NULL;
+		goto out;
+	}
+
+	/*
+	 * The node could be a new node or a reused node. If it's a new node,
+	 * we are ok. If it's reused because of RCU (because of
+	 * SLAB_TYPESAFE_BY_RCU), the __kernfs_new_node always sets its 'ino'
+	 * before 'count'. So if 'count' is uptodate, 'ino' should be uptodate,
+	 * hence we can use 'ino' to filter stale node.
+	 */
+	if (kn->ino != ino)
+		goto out;
+	rcu_read_unlock();
+
+	return kn;
+out:
+	rcu_read_unlock();
+	kernfs_put(kn);
+	return NULL;
+}
+
 /**
  *	kernfs_add_one - add kernfs_node to parent without warning
  *	@kn: kernfs_node to be added
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 2d5144a..e9c226f 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -98,6 +98,8 @@ int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    unsigned flags);
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+						    unsigned int ino);
 
 /*
  * file.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d5b149a..69c48be 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -330,7 +330,16 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns)
 
 void __init kernfs_init(void)
 {
+
+	/*
+	 * the slab is freed in RCU context, so kernfs_find_and_get_node_by_ino
+	 * can access the slab lock free. This could introduce stale nodes,
+	 * please see how kernfs_find_and_get_node_by_ino filters out stale
+	 * nodes.
+	 */
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0,
+					      SLAB_PANIC | SLAB_TYPESAFE_BY_RCU,
+					      NULL);
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 04/12] kernfs: don't set dentry->d_fsdata
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (2 preceding siblings ...)
  2017-06-28 16:29 ` [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-29 12:51   ` Greg KH
  2017-06-28 16:29 ` [PATCH V4 05/12] kernfs: introduce kernfs_node_id Shaohua Li
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c             | 25 +++++++++----------------
 fs/kernfs/file.c            |  6 +++---
 fs/kernfs/inode.c           |  6 +++---
 fs/kernfs/kernfs-internal.h |  7 +++++++
 fs/kernfs/mount.c           |  8 ++------
 fs/kernfs/symlink.c         |  6 +++---
 6 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7be37c8..b61a7ef 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -566,7 +566,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	if (d_really_is_negative(dentry))
 		goto out_bad_unlocked;
 
-	kn = dentry->d_fsdata;
+	kn = kernfs_dentry_node(dentry);
 	mutex_lock(&kernfs_mutex);
 
 	/* The kernfs node has been deactivated */
@@ -574,7 +574,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		goto out_bad;
 
 	/* The kernfs node has been moved? */
-	if (dentry->d_parent->d_fsdata != kn->parent)
+	if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
 		goto out_bad;
 
 	/* The kernfs node has been renamed */
@@ -594,14 +594,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	return 0;
 }
 
-static void kernfs_dop_release(struct dentry *dentry)
-{
-	kernfs_put(dentry->d_fsdata);
-}
-
 const struct dentry_operations kernfs_dops = {
 	.d_revalidate	= kernfs_dop_revalidate,
-	.d_release	= kernfs_dop_release,
 };
 
 /**
@@ -617,8 +611,9 @@ const struct dentry_operations kernfs_dops = {
  */
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
 {
-	if (dentry->d_sb->s_op == &kernfs_sops)
-		return dentry->d_fsdata;
+	if (dentry->d_sb->s_op == &kernfs_sops &&
+	    !d_really_is_negative(dentry))
+		return kernfs_dentry_node(dentry);
 	return NULL;
 }
 
@@ -1056,7 +1051,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 					unsigned int flags)
 {
 	struct dentry *ret;
-	struct kernfs_node *parent = dentry->d_parent->d_fsdata;
+	struct kernfs_node *parent = dir->i_private;
 	struct kernfs_node *kn;
 	struct inode *inode;
 	const void *ns = NULL;
@@ -1073,8 +1068,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 		ret = NULL;
 		goto out_unlock;
 	}
-	kernfs_get(kn);
-	dentry->d_fsdata = kn;
 
 	/* attach dentry and inode */
 	inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1111,7 +1104,7 @@ static int kernfs_iop_mkdir(struct inode *dir, struct dentry *dentry,
 
 static int kernfs_iop_rmdir(struct inode *dir, struct dentry *dentry)
 {
-	struct kernfs_node *kn  = dentry->d_fsdata;
+	struct kernfs_node *kn  = kernfs_dentry_node(dentry);
 	struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
 	int ret;
 
@@ -1131,7 +1124,7 @@ static int kernfs_iop_rename(struct inode *old_dir, struct dentry *old_dentry,
 			     struct inode *new_dir, struct dentry *new_dentry,
 			     unsigned int flags)
 {
-	struct kernfs_node *kn  = old_dentry->d_fsdata;
+	struct kernfs_node *kn = kernfs_dentry_node(old_dentry);
 	struct kernfs_node *new_parent = new_dir->i_private;
 	struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
 	int ret;
@@ -1644,7 +1637,7 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct dentry *dentry = file->f_path.dentry;
-	struct kernfs_node *parent = dentry->d_fsdata;
+	struct kernfs_node *parent = kernfs_dentry_node(dentry);
 	struct kernfs_node *pos = file->private_data;
 	const void *ns = NULL;
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ac2dfe0..7f90d4d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -616,7 +616,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
 {
-	struct kernfs_node *kn = file->f_path.dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_root *root = kernfs_root(kn);
 	const struct kernfs_ops *ops;
 	struct kernfs_open_file *of;
@@ -768,7 +768,7 @@ static void kernfs_release_file(struct kernfs_node *kn,
 
 static int kernfs_fop_release(struct inode *inode, struct file *filp)
 {
-	struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_open_file *of = kernfs_of(filp);
 
 	if (kn->flags & KERNFS_HAS_RELEASE) {
@@ -835,7 +835,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 static unsigned int kernfs_fop_poll(struct file *filp, poll_table *wait)
 {
 	struct kernfs_open_file *of = kernfs_of(filp);
-	struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+	struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry);
 	struct kernfs_open_node *on = kn->attr.open;
 
 	if (!kernfs_get_active(kn))
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 79cdae4..4c8b510 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -112,7 +112,7 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr)
 {
 	struct inode *inode = d_inode(dentry);
-	struct kernfs_node *kn = dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	int error;
 
 	if (!kn)
@@ -154,7 +154,7 @@ static int kernfs_node_setsecdata(struct kernfs_iattrs *attrs, void **secdata,
 
 ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size)
 {
-	struct kernfs_node *kn = dentry->d_fsdata;
+	struct kernfs_node *kn = kernfs_dentry_node(dentry);
 	struct kernfs_iattrs *attrs;
 
 	attrs = kernfs_iattrs(kn);
@@ -203,8 +203,8 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
 		       u32 request_mask, unsigned int query_flags)
 {
-	struct kernfs_node *kn = path->dentry->d_fsdata;
 	struct inode *inode = d_inode(path->dentry);
+	struct kernfs_node *kn = inode->i_private;
 
 	mutex_lock(&kernfs_mutex);
 	kernfs_refresh_inode(kn, inode);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index e9c226f..0f260dc 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -70,6 +70,13 @@ struct kernfs_super_info {
 };
 #define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
 
+static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
+{
+	if (d_really_is_negative(dentry))
+		return NULL;
+	return d_inode(dentry)->i_private;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache;
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 69c48be..acd5426 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -33,7 +33,7 @@ static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data)
 
 static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 {
-	struct kernfs_root *root = kernfs_root(dentry->d_fsdata);
+	struct kernfs_root *root = kernfs_root(kernfs_dentry_node(dentry));
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
 
 	if (scops && scops->show_options)
@@ -43,7 +43,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 
 static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
 {
-	struct kernfs_node *node = dentry->d_fsdata;
+	struct kernfs_node *node = kernfs_dentry_node(dentry);
 	struct kernfs_root *root = kernfs_root(node);
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
 
@@ -176,8 +176,6 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 		pr_debug("%s: could not get root dentry!\n", __func__);
 		return -ENOMEM;
 	}
-	kernfs_get(info->root->kn);
-	root->d_fsdata = info->root->kn;
 	sb->s_root = root;
 	sb->s_d_op = &kernfs_dops;
 	return 0;
@@ -283,7 +281,6 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
 void kernfs_kill_sb(struct super_block *sb)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
-	struct kernfs_node *root_kn = sb->s_root->d_fsdata;
 
 	mutex_lock(&kernfs_mutex);
 	list_del(&info->node);
@@ -295,7 +292,6 @@ void kernfs_kill_sb(struct super_block *sb)
 	 */
 	kill_anon_super(sb);
 	kfree(info);
-	kernfs_put(root_kn);
 }
 
 /**
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 1684af4..08ccabd 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -98,9 +98,9 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
 	return 0;
 }
 
-static int kernfs_getlink(struct dentry *dentry, char *path)
+static int kernfs_getlink(struct inode *inode, char *path)
 {
-	struct kernfs_node *kn = dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_node *parent = kn->parent;
 	struct kernfs_node *target = kn->symlink.target_kn;
 	int error;
@@ -124,7 +124,7 @@ static const char *kernfs_iop_get_link(struct dentry *dentry,
 	body = kzalloc(PAGE_SIZE, GFP_KERNEL);
 	if (!body)
 		return ERR_PTR(-ENOMEM);
-	error = kernfs_getlink(dentry, body);
+	error = kernfs_getlink(inode, body);
 	if (unlikely(error < 0)) {
 		kfree(body);
 		return ERR_PTR(error);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 05/12] kernfs: introduce kernfs_node_id
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (3 preceding siblings ...)
  2017-06-28 16:29 ` [PATCH V4 04/12] kernfs: don't set dentry->d_fsdata Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-29 12:51   ` Greg KH
  2017-06-28 16:29 ` [PATCH V4 06/12] kernfs: add exportfs operations Shaohua Li
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c                  | 10 +++++-----
 fs/kernfs/file.c                 |  4 ++--
 fs/kernfs/inode.c                |  4 ++--
 include/linux/cgroup.h           |  2 +-
 include/linux/kernfs.h           | 12 ++++++++++--
 include/trace/events/writeback.h |  2 +-
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index b61a7ef..89d1dc1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -539,7 +539,7 @@ void kernfs_put(struct kernfs_node *kn)
 	}
 	kfree(kn->iattr);
 	spin_lock(&kernfs_idr_lock);
-	idr_remove(&root->ino_idr, kn->ino);
+	idr_remove(&root->ino_idr, kn->id.ino);
 	spin_unlock(&kernfs_idr_lock);
 	kmem_cache_free(kernfs_node_cache, kn);
 
@@ -645,8 +645,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	idr_preload_end();
 	if (ret < 0)
 		goto err_out2;
-	kn->ino = ret;
-	kn->generation = gen;
+	kn->id.ino = ret;
+	kn->id.generation = gen;
 
 	/*
 	 * set ino first. This barrier is paired with atomic_inc_not_zero in
@@ -721,7 +721,7 @@ struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
 	 * before 'count'. So if 'count' is uptodate, 'ino' should be uptodate,
 	 * hence we can use 'ino' to filter stale node.
 	 */
-	if (kn->ino != ino)
+	if (kn->id.ino != ino)
 		goto out;
 	rcu_read_unlock();
 
@@ -1654,7 +1654,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		const char *name = pos->name;
 		unsigned int type = dt_type(pos);
 		int len = strlen(name);
-		ino_t ino = pos->ino;
+		ino_t ino = pos->id.ino;
 
 		ctx->pos = pos->hash;
 		file->private_data = pos;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 7f90d4d..7441925 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -895,7 +895,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 		 * have the matching @file available.  Look up the inodes
 		 * and generate the events manually.
 		 */
-		inode = ilookup(info->sb, kn->ino);
+		inode = ilookup(info->sb, kn->id.ino);
 		if (!inode)
 			continue;
 
@@ -903,7 +903,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 		if (parent) {
 			struct inode *p_inode;
 
-			p_inode = ilookup(info->sb, parent->ino);
+			p_inode = ilookup(info->sb, parent->id.ino);
 			if (p_inode) {
 				fsnotify(p_inode, FS_MODIFY | FS_EVENT_ON_CHILD,
 					 inode, FSNOTIFY_EVENT_INODE, kn->name, 0);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 4c8b510..a343039 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,7 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
 	inode->i_private = kn;
 	inode->i_mapping->a_ops = &kernfs_aops;
 	inode->i_op = &kernfs_iops;
-	inode->i_generation = kn->generation;
+	inode->i_generation = kn->id.generation;
 
 	set_default_inode_attr(inode, kn->mode);
 	kernfs_refresh_inode(kn, inode);
@@ -266,7 +266,7 @@ struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn)
 {
 	struct inode *inode;
 
-	inode = iget_locked(sb, kn->ino);
+	inode = iget_locked(sb, kn->id.ino);
 	if (inode && (inode->i_state & I_NEW))
 		kernfs_init_inode(kn, inode);
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 710a005..30c6877 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -543,7 +543,7 @@ static inline bool cgroup_is_populated(struct cgroup *cgrp)
 /* returns ino associated with a cgroup */
 static inline ino_t cgroup_ino(struct cgroup *cgrp)
 {
-	return cgrp->kn->ino;
+	return cgrp->kn->id.ino;
 }
 
 /* cft/css accessors for cftype->write() operation */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 8c00d28..06a0c59 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -95,6 +95,15 @@ struct kernfs_elem_attr {
 	struct kernfs_node	*notify_next;	/* for kernfs_notify() */
 };
 
+/* represent a kernfs node */
+union kernfs_node_id {
+	struct {
+		u32		ino;
+		u32		generation;
+	};
+	u64			id;
+};
+
 /*
  * kernfs_node - the building block of kernfs hierarchy.  Each and every
  * kernfs node is represented by single kernfs_node.  Most fields are
@@ -131,11 +140,10 @@ struct kernfs_node {
 
 	void			*priv;
 
+	union kernfs_node_id	id;
 	unsigned short		flags;
 	umode_t			mode;
-	unsigned int		ino;
 	struct kernfs_iattrs	*iattr;
-	u32			generation;
 };
 
 /*
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 7bd8783..9b57f01 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -136,7 +136,7 @@ DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode,
 
 static inline unsigned int __trace_wb_assign_cgroup(struct bdi_writeback *wb)
 {
-	return wb->memcg_css->cgroup->kn->ino;
+	return wb->memcg_css->cgroup->kn->id.ino;
 }
 
 static inline unsigned int __trace_wbc_assign_cgroup(struct writeback_control *wbc)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 06/12] kernfs: add exportfs operations
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (4 preceding siblings ...)
  2017-06-28 16:29 ` [PATCH V4 05/12] kernfs: introduce kernfs_node_id Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-29 12:50   ` Greg KH
  2017-06-28 16:29 ` [PATCH V4 07/12] cgroup: export fhandle info for a cgroup Shaohua Li
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/mount.c      | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h | 12 +++++++++++
 kernel/cgroup/cgroup.c |  3 ++-
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index acd5426..fa32358 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/namei.h>
 #include <linux/seq_file.h>
+#include <linux/exportfs.h>
 
 #include "kernfs-internal.h"
 
@@ -64,6 +65,59 @@ const struct super_operations kernfs_sops = {
 	.show_path	= kernfs_sop_show_path,
 };
 
+static struct inode *kernfs_fh_get_inode(struct super_block *sb,
+		u64 ino, u32 generation)
+{
+	struct kernfs_super_info *info = kernfs_info(sb);
+	struct inode *inode;
+	struct kernfs_node *kn;
+
+	if (ino == 0)
+		return ERR_PTR(-ESTALE);
+
+	kn = kernfs_find_and_get_node_by_ino(info->root, ino);
+	if (!kn)
+		return ERR_PTR(-ESTALE);
+	inode = kernfs_get_inode(sb, kn);
+	kernfs_put(kn);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	if (generation && inode->i_generation != generation) {
+		/* we didn't find the right inode.. */
+		iput(inode);
+		return ERR_PTR(-ESTALE);
+	}
+	return inode;
+}
+
+static struct dentry *kernfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+		int fh_len, int fh_type)
+{
+	return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+				    kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+		int fh_len, int fh_type)
+{
+	return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+				    kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_get_parent_dentry(struct dentry *child)
+{
+	struct kernfs_node *kn = kernfs_dentry_node(child);
+
+	return d_obtain_alias(kernfs_get_inode(child->d_sb, kn->parent));
+}
+
+static const struct export_operations kernfs_export_ops = {
+	.fh_to_dentry	= kernfs_fh_to_dentry,
+	.fh_to_parent	= kernfs_fh_to_parent,
+	.get_parent	= kernfs_get_parent_dentry,
+};
+
 /**
  * kernfs_root_from_sb - determine kernfs_root associated with a super_block
  * @sb: the super_block in question
@@ -159,6 +213,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 	sb->s_magic = magic;
 	sb->s_op = &kernfs_sops;
 	sb->s_xattr = kernfs_xattr_handlers;
+	if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
+		sb->s_export_op = &kernfs_export_ops;
 	sb->s_time_gran = 1;
 
 	/* get root inode, initialize and unlock it */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 06a0c59..d149361 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -69,6 +69,12 @@ enum kernfs_root_flag {
 	 * following flag enables that behavior.
 	 */
 	KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK	= 0x0002,
+
+	/*
+	 * The filesystem supports exportfs operation, so userspace can use
+	 * fhandle to access nodes of the fs.
+	 */
+	KERNFS_ROOT_SUPPORT_EXPORTOP		= 0x0004,
 };
 
 /* type-specific structures for kernfs_node union members */
@@ -98,6 +104,12 @@ struct kernfs_elem_attr {
 /* represent a kernfs node */
 union kernfs_node_id {
 	struct {
+		/*
+		 * blktrace will export this struct as a simplified 'struct
+		 * fid' (which is a big data struction), so userspace can use
+		 * it to find kernfs node. The layout must match the first two
+		 * fields of 'struct fid' exactly.
+		 */
 		u32		ino;
 		u32		generation;
 	};
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index dbfd702..4a5cbe9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1691,7 +1691,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
 		&cgroup_kf_syscall_ops : &cgroup1_kf_syscall_ops;
 
 	root->kf_root = kernfs_create_root(kf_sops,
-					   KERNFS_ROOT_CREATE_DEACTIVATED,
+					   KERNFS_ROOT_CREATE_DEACTIVATED |
+					   KERNFS_ROOT_SUPPORT_EXPORTOP,
 					   root_cgrp);
 	if (IS_ERR(root->kf_root)) {
 		ret = PTR_ERR(root->kf_root);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 07/12] cgroup: export fhandle info for a cgroup
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (5 preceding siblings ...)
  2017-06-28 16:29 ` [PATCH V4 06/12] kernfs: add exportfs operations Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-28 18:12   ` Tejun Heo
  2017-06-28 16:29 ` [PATCH V4 08/12] blktrace: export cgroup info in trace Shaohua Li
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Add an API to export cgroup fhandle info. We don't export a full 'struct
file_handle', there are unrequired info. Sepcifically, cgroup is always
a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
we only need export the inode number and generation number just like
what generic_fh_to_dentry does. And we can avoid the overhead of getting
an inode too, since kernfs_node_id (ino and generation) has all the info
required.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/cgroup.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 30c6877..52ef9a6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -609,6 +609,10 @@ static inline void cgroup_kthread_ready(void)
 	current->no_cgroup_migration = 0;
 }
 
+static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
+{
+	return &cgrp->kn->id;
+}
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -631,6 +635,10 @@ static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_init_kthreadd(void) {}
 static inline void cgroup_kthread_ready(void) {}
+static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
+{
+	return NULL;
+}
 
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 					       struct cgroup *ancestor)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 08/12] blktrace: export cgroup info in trace
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (6 preceding siblings ...)
  2017-06-28 16:29 ` [PATCH V4 07/12] cgroup: export fhandle info for a cgroup Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-28 16:56   ` Steven Rostedt
  2017-06-28 16:29 ` [PATCH V4 09/12] block: always attach cgroup info into bio Shaohua Li
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Currently blktrace isn't cgroup aware. blktrace prints out task name of
current context, but the task of current context isn't always in the
cgroup where the BIO comes from. We can't use task name to find out IO
cgroup. For example, Writeback BIOs always comes from flusher thread but
the BIOs are for different blk cgroups. Request could be requeued and
dispatched from completely different tasks. MD/DM are another examples.

This patch tries to fix the gap. We print out cgroup fhandle info in
blktrace. Userspace can use open_by_handle_at() syscall to find the
cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to
find fhandle for a cgroup and use a BPF program to filter out blktrace
for a specific cgroup.

We add a new 'blk_cgroup' trace option for blk tracer. It's default off.
Application which doesn't know the new option isn't affected.  When it's
on, we output fhandle info right after blk_io_trace with an extra bit
set in event action. So from application point of view, blktrace with
the option will output new actions.

I didn't change blk trace event yet, since I'm not sure if changing the
trace event output is an ABI issue. If not, I'll do it later.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/uapi/linux/blktrace_api.h |   3 +
 kernel/trace/blktrace.c           | 231 ++++++++++++++++++++++++++------------
 2 files changed, 161 insertions(+), 73 deletions(-)

diff --git a/include/uapi/linux/blktrace_api.h b/include/uapi/linux/blktrace_api.h
index c590ca6..9cdaede 100644
--- a/include/uapi/linux/blktrace_api.h
+++ b/include/uapi/linux/blktrace_api.h
@@ -52,6 +52,7 @@ enum blktrace_act {
 	__BLK_TA_REMAP,			/* bio was remapped */
 	__BLK_TA_ABORT,			/* request aborted */
 	__BLK_TA_DRV_DATA,		/* driver-specific binary data */
+	__BLK_TA_CGROUP = 1 << 8,	/* from a cgroup*/
 };
 
 /*
@@ -61,6 +62,7 @@ enum blktrace_notify {
 	__BLK_TN_PROCESS = 0,		/* establish pid/name mapping */
 	__BLK_TN_TIMESTAMP,		/* include system clock */
 	__BLK_TN_MESSAGE,		/* Character string message */
+	__BLK_TN_CGROUP = __BLK_TA_CGROUP, /* from a cgroup */
 };
 
 
@@ -107,6 +109,7 @@ struct blk_io_trace {
 	__u32 cpu;		/* on what cpu did it happen */
 	__u16 error;		/* completion error */
 	__u16 pdu_len;		/* length of data after this trace */
+	/* cgroup id will be stored here if exists */
 };
 
 /*
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index bc364f8..f393d7a 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -27,6 +27,7 @@
 #include <linux/time.h>
 #include <linux/uaccess.h>
 #include <linux/list.h>
+#include <linux/blk-cgroup.h>
 
 #include "../../block/blk.h"
 
@@ -46,10 +47,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
 
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC	0x1
+#define TRACE_BLK_OPT_CGROUP	0x2
 
 static struct tracer_opt blk_tracer_opts[] = {
 	/* Default disable the minimalistic output */
 	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
+#ifdef CONFIG_BLK_CGROUP
+	{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+#endif
 	{ }
 };
 
@@ -68,7 +73,8 @@ static void blk_unregister_tracepoints(void);
  * Send out a notify message.
  */
 static void trace_note(struct blk_trace *bt, pid_t pid, int action,
-		       const void *data, size_t len)
+		       const void *data, size_t len,
+		       union kernfs_node_id *cgid)
 {
 	struct blk_io_trace *t;
 	struct ring_buffer_event *event = NULL;
@@ -76,12 +82,13 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 	int pc = 0;
 	int cpu = smp_processor_id();
 	bool blk_tracer = blk_tracer_enabled;
+	ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;
 
 	if (blk_tracer) {
 		buffer = blk_tr->trace_buffer.buffer;
 		pc = preempt_count();
 		event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
-						  sizeof(*t) + len,
+						  sizeof(*t) + len + cgid_len,
 						  0, pc);
 		if (!event)
 			return;
@@ -92,17 +99,19 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 	if (!bt->rchan)
 		return;
 
-	t = relay_reserve(bt->rchan, sizeof(*t) + len);
+	t = relay_reserve(bt->rchan, sizeof(*t) + len + cgid_len);
 	if (t) {
 		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
 		t->time = ktime_to_ns(ktime_get());
 record_it:
 		t->device = bt->dev;
-		t->action = action;
+		t->action = action | (cgid ? __BLK_TN_CGROUP : 0);
 		t->pid = pid;
 		t->cpu = cpu;
-		t->pdu_len = len;
-		memcpy((void *) t + sizeof(*t), data, len);
+		t->pdu_len = len + cgid_len;
+		if (cgid)
+			memcpy((void *)t + sizeof(*t), cgid, cgid_len);
+		memcpy((void *) t + sizeof(*t) + cgid_len, data, len);
 
 		if (blk_tracer)
 			trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -122,7 +131,7 @@ static void trace_note_tsk(struct task_struct *tsk)
 	spin_lock_irqsave(&running_trace_lock, flags);
 	list_for_each_entry(bt, &running_trace_list, running_list) {
 		trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm,
-			   sizeof(tsk->comm));
+			   sizeof(tsk->comm), NULL);
 	}
 	spin_unlock_irqrestore(&running_trace_lock, flags);
 }
@@ -139,7 +148,7 @@ static void trace_note_time(struct blk_trace *bt)
 	words[1] = now.tv_nsec;
 
 	local_irq_save(flags);
-	trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words));
+	trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words), NULL);
 	local_irq_restore(flags);
 }
 
@@ -167,7 +176,7 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
 	n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
 	va_end(args);
 
-	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n);
+	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -204,7 +213,7 @@ static const u32 ddir_act[2] = { BLK_TC_ACT(BLK_TC_READ),
  */
 static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		     int op, int op_flags, u32 what, int error, int pdu_len,
-		     void *pdu_data)
+		     void *pdu_data, union kernfs_node_id *cgid)
 {
 	struct task_struct *tsk = current;
 	struct ring_buffer_event *event = NULL;
@@ -215,6 +224,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	pid_t pid;
 	int cpu, pc = 0;
 	bool blk_tracer = blk_tracer_enabled;
+	ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;
 
 	if (unlikely(bt->trace_state != Blktrace_running && !blk_tracer))
 		return;
@@ -229,6 +239,8 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		what |= BLK_TC_ACT(BLK_TC_DISCARD);
 	if (op == REQ_OP_FLUSH)
 		what |= BLK_TC_ACT(BLK_TC_FLUSH);
+	if (cgid)
+		what |= __BLK_TA_CGROUP;
 
 	pid = tsk->pid;
 	if (act_log_check(bt, what, sector, pid))
@@ -241,7 +253,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		buffer = blk_tr->trace_buffer.buffer;
 		pc = preempt_count();
 		event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
-						  sizeof(*t) + pdu_len,
+						  sizeof(*t) + pdu_len + cgid_len,
 						  0, pc);
 		if (!event)
 			return;
@@ -258,7 +270,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	 * from coming in and stepping on our toes.
 	 */
 	local_irq_save(flags);
-	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
+	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len + cgid_len);
 	if (t) {
 		sequence = per_cpu_ptr(bt->sequence, cpu);
 
@@ -280,10 +292,12 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		t->action = what;
 		t->device = bt->dev;
 		t->error = error;
-		t->pdu_len = pdu_len;
+		t->pdu_len = pdu_len + cgid_len;
 
+		if (cgid_len)
+			memcpy((void *)t + sizeof(*t), cgid, cgid_len);
 		if (pdu_len)
-			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
+			memcpy((void *)t + sizeof(*t) + cgid_len, pdu_data, pdu_len);
 
 		if (blk_tracer) {
 			trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -684,6 +698,36 @@ void blk_trace_shutdown(struct request_queue *q)
 	}
 }
 
+#ifdef CONFIG_BLK_CGROUP
+static union kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+	struct blk_trace *bt = q->blk_trace;
+
+	if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+		return NULL;
+
+	if (!bio->bi_css)
+		return NULL;
+	return cgroup_get_kernfs_id(bio->bi_css->cgroup);
+}
+#else
+static union kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+	return NULL;
+}
+#endif
+
+static union kernfs_node_id *
+blk_trace_request_get_cgid(struct request_queue *q, struct request *rq)
+{
+	if (!rq->bio)
+		return NULL;
+	/* Use the first bio */
+	return blk_trace_bio_get_cgid(q, rq->bio);
+}
+
 /*
  * blktrace probes
  */
@@ -694,13 +738,15 @@ void blk_trace_shutdown(struct request_queue *q)
  * @error:	return status to log
  * @nr_bytes:	number of completed bytes
  * @what:	the action
+ * @cgid:	the cgroup info
  *
  * Description:
  *     Records an action against a request. Will log the bio offset + size.
  *
  **/
 static void blk_add_trace_rq(struct request *rq, int error,
-			     unsigned int nr_bytes, u32 what)
+			     unsigned int nr_bytes, u32 what,
+			     union kernfs_node_id *cgid)
 {
 	struct blk_trace *bt = rq->q->blk_trace;
 
@@ -713,32 +759,36 @@ static void blk_add_trace_rq(struct request *rq, int error,
 		what |= BLK_TC_ACT(BLK_TC_FS);
 
 	__blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, req_op(rq),
-			rq->cmd_flags, what, error, 0, NULL);
+			rq->cmd_flags, what, error, 0, NULL, cgid);
 }
 
 static void blk_add_trace_rq_insert(void *ignore,
 				    struct request_queue *q, struct request *rq)
 {
-	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT);
+	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT,
+			 blk_trace_request_get_cgid(q, rq));
 }
 
 static void blk_add_trace_rq_issue(void *ignore,
 				   struct request_queue *q, struct request *rq)
 {
-	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE);
+	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE,
+			 blk_trace_request_get_cgid(q, rq));
 }
 
 static void blk_add_trace_rq_requeue(void *ignore,
 				     struct request_queue *q,
 				     struct request *rq)
 {
-	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE);
+	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE,
+			 blk_trace_request_get_cgid(q, rq));
 }
 
 static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
 			int error, unsigned int nr_bytes)
 {
-	blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE);
+	blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE,
+			 blk_trace_request_get_cgid(rq->q, rq));
 }
 
 /**
@@ -753,7 +803,7 @@ static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
  *
  **/
 static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
-			      u32 what, int error)
+			      u32 what, int error, union kernfs_node_id *cgid)
 {
 	struct blk_trace *bt = q->blk_trace;
 
@@ -761,20 +811,22 @@ static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
 		return;
 
 	__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
-			bio_op(bio), bio->bi_opf, what, error, 0, NULL);
+			bio_op(bio), bio->bi_opf, what, error, 0, NULL, cgid);
 }
 
 static void blk_add_trace_bio_bounce(void *ignore,
 				     struct request_queue *q, struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_complete(void *ignore,
 				       struct request_queue *q, struct bio *bio,
 				       int error)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error);
+	blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_backmerge(void *ignore,
@@ -782,7 +834,8 @@ static void blk_add_trace_bio_backmerge(void *ignore,
 					struct request *rq,
 					struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0,
+			 blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_frontmerge(void *ignore,
@@ -790,13 +843,15 @@ static void blk_add_trace_bio_frontmerge(void *ignore,
 					 struct request *rq,
 					 struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_queue(void *ignore,
 				    struct request_queue *q, struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_getrq(void *ignore,
@@ -804,13 +859,14 @@ static void blk_add_trace_getrq(void *ignore,
 				struct bio *bio, int rw)
 {
 	if (bio)
-		blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0);
+		blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0,
+				  blk_trace_bio_get_cgid(q, bio));
 	else {
 		struct blk_trace *bt = q->blk_trace;
 
 		if (bt)
 			__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_GETRQ, 0, 0,
-					NULL);
+					NULL, NULL);
 	}
 }
 
@@ -820,13 +876,14 @@ static void blk_add_trace_sleeprq(void *ignore,
 				  struct bio *bio, int rw)
 {
 	if (bio)
-		blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0);
+		blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0,
+				  blk_trace_bio_get_cgid(q, bio));
 	else {
 		struct blk_trace *bt = q->blk_trace;
 
 		if (bt)
 			__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_SLEEPRQ,
-					0, 0, NULL);
+					0, 0, NULL, NULL);
 	}
 }
 
@@ -835,7 +892,7 @@ static void blk_add_trace_plug(void *ignore, struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt)
-		__blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL);
+		__blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, NULL);
 }
 
 static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
@@ -852,7 +909,7 @@ static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
 		else
 			what = BLK_TA_UNPLUG_TIMER;
 
-		__blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+		__blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, NULL);
 	}
 }
 
@@ -868,7 +925,7 @@ static void blk_add_trace_split(void *ignore,
 		__blk_add_trace(bt, bio->bi_iter.bi_sector,
 				bio->bi_iter.bi_size, bio_op(bio), bio->bi_opf,
 				BLK_TA_SPLIT, bio->bi_status, sizeof(rpdu),
-				&rpdu);
+				&rpdu, blk_trace_bio_get_cgid(q, bio));
 	}
 }
 
@@ -901,7 +958,7 @@ static void blk_add_trace_bio_remap(void *ignore,
 
 	__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
 			bio_op(bio), bio->bi_opf, BLK_TA_REMAP, bio->bi_status,
-			sizeof(r), &r);
+			sizeof(r), &r, blk_trace_bio_get_cgid(q, bio));
 }
 
 /**
@@ -934,7 +991,7 @@ static void blk_add_trace_rq_remap(void *ignore,
 
 	__blk_add_trace(bt, blk_rq_pos(rq), blk_rq_bytes(rq),
 			rq_data_dir(rq), 0, BLK_TA_REMAP, 0,
-			sizeof(r), &r);
+			sizeof(r), &r, blk_trace_request_get_cgid(q, rq));
 }
 
 /**
@@ -958,7 +1015,8 @@ void blk_add_driver_data(struct request_queue *q,
 		return;
 
 	__blk_add_trace(bt, blk_rq_trace_sector(rq), blk_rq_bytes(rq), 0, 0,
-				BLK_TA_DRV_DATA, 0, len, data);
+				BLK_TA_DRV_DATA, 0, len, data,
+				blk_trace_request_get_cgid(q, rq));
 }
 EXPORT_SYMBOL_GPL(blk_add_driver_data);
 
@@ -1031,7 +1089,7 @@ static void fill_rwbs(char *rwbs, const struct blk_io_trace *t)
 	int i = 0;
 	int tc = t->action >> BLK_TC_SHIFT;
 
-	if (t->action == BLK_TN_MESSAGE) {
+	if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
 		rwbs[i++] = 'N';
 		goto out;
 	}
@@ -1066,9 +1124,21 @@ const struct blk_io_trace *te_blk_io_trace(const struct trace_entry *ent)
 	return (const struct blk_io_trace *)ent;
 }
 
-static inline const void *pdu_start(const struct trace_entry *ent)
+static inline const void *pdu_start(const struct trace_entry *ent, bool has_cg)
 {
-	return te_blk_io_trace(ent) + 1;
+	return (void *)(te_blk_io_trace(ent) + 1) +
+		(has_cg ? sizeof(union kernfs_node_id) : 0);
+}
+
+static inline const void *cgid_start(const struct trace_entry *ent)
+{
+	return (void *)(te_blk_io_trace(ent) + 1);
+}
+
+static inline int pdu_real_len(const struct trace_entry *ent, bool has_cg)
+{
+	return te_blk_io_trace(ent)->pdu_len -
+			(has_cg ? sizeof(union kernfs_node_id) : 0);
 }
 
 static inline u32 t_action(const struct trace_entry *ent)
@@ -1096,16 +1166,16 @@ static inline __u16 t_error(const struct trace_entry *ent)
 	return te_blk_io_trace(ent)->error;
 }
 
-static __u64 get_pdu_int(const struct trace_entry *ent)
+static __u64 get_pdu_int(const struct trace_entry *ent, bool has_cg)
 {
-	const __u64 *val = pdu_start(ent);
+	const __u64 *val = pdu_start(ent, has_cg);
 	return be64_to_cpu(*val);
 }
 
 static void get_pdu_remap(const struct trace_entry *ent,
-			  struct blk_io_trace_remap *r)
+			  struct blk_io_trace_remap *r, bool has_cg)
 {
-	const struct blk_io_trace_remap *__r = pdu_start(ent);
+	const struct blk_io_trace_remap *__r = pdu_start(ent, has_cg);
 	__u64 sector_from = __r->sector_from;
 
 	r->device_from = be32_to_cpu(__r->device_from);
@@ -1113,9 +1183,11 @@ static void get_pdu_remap(const struct trace_entry *ent,
 	r->sector_from = be64_to_cpu(sector_from);
 }
 
-typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act);
+typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act,
+	bool has_cg);
 
-static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
+static void blk_log_action_classic(struct trace_iterator *iter, const char *act,
+	bool has_cg)
 {
 	char rwbs[RWBS_LEN];
 	unsigned long long ts  = iter->ts;
@@ -1131,24 +1203,33 @@ static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
 			 secs, nsec_rem, iter->ent->pid, act, rwbs);
 }
 
-static void blk_log_action(struct trace_iterator *iter, const char *act)
+static void blk_log_action(struct trace_iterator *iter, const char *act,
+	bool has_cg)
 {
 	char rwbs[RWBS_LEN];
 	const struct blk_io_trace *t = te_blk_io_trace(iter->ent);
 
 	fill_rwbs(rwbs, t);
-	trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
-			 MAJOR(t->device), MINOR(t->device), act, rwbs);
+	if (has_cg) {
+		const union kernfs_node_id *id = cgid_start(iter->ent);
+
+		trace_seq_printf(&iter->seq, "%3d,%-3d %x,%-x %2s %3s ",
+				 MAJOR(t->device), MINOR(t->device),
+				 id->ino, id->generation, act, rwbs);
+	} else
+		trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
+				 MAJOR(t->device), MINOR(t->device), act, rwbs);
 }
 
-static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_dump_pdu(struct trace_seq *s,
+	const struct trace_entry *ent, bool has_cg)
 {
 	const unsigned char *pdu_buf;
 	int pdu_len;
 	int i, end;
 
-	pdu_buf = pdu_start(ent);
-	pdu_len = te_blk_io_trace(ent)->pdu_len;
+	pdu_buf = pdu_start(ent, has_cg);
+	pdu_len = pdu_real_len(ent, has_cg);
 
 	if (!pdu_len)
 		return;
@@ -1179,7 +1260,7 @@ static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
 	trace_seq_puts(s, ") ");
 }
 
-static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
@@ -1187,7 +1268,7 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
 
 	if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
 		trace_seq_printf(s, "%u ", t_bytes(ent));
-		blk_log_dump_pdu(s, ent);
+		blk_log_dump_pdu(s, ent, has_cg);
 		trace_seq_printf(s, "[%s]\n", cmd);
 	} else {
 		if (t_sec(ent))
@@ -1199,10 +1280,10 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
 }
 
 static void blk_log_with_error(struct trace_seq *s,
-			      const struct trace_entry *ent)
+			      const struct trace_entry *ent, bool has_cg)
 {
 	if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
-		blk_log_dump_pdu(s, ent);
+		blk_log_dump_pdu(s, ent, has_cg);
 		trace_seq_printf(s, "[%d]\n", t_error(ent));
 	} else {
 		if (t_sec(ent))
@@ -1215,18 +1296,18 @@ static void blk_log_with_error(struct trace_seq *s,
 	}
 }
 
-static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	struct blk_io_trace_remap r = { .device_from = 0, };
 
-	get_pdu_remap(ent, &r);
+	get_pdu_remap(ent, &r, has_cg);
 	trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n",
 			 t_sector(ent), t_sec(ent),
 			 MAJOR(r.device_from), MINOR(r.device_from),
 			 (unsigned long long)r.sector_from);
 }
 
-static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
@@ -1235,30 +1316,31 @@ static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
 	trace_seq_printf(s, "[%s]\n", cmd);
 }
 
-static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
 	trace_find_cmdline(ent->pid, cmd);
 
-	trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent));
+	trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg));
 }
 
-static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
 	trace_find_cmdline(ent->pid, cmd);
 
 	trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
-			 get_pdu_int(ent), cmd);
+			 get_pdu_int(ent, has_cg), cmd);
 }
 
-static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent,
+			bool has_cg)
 {
-	const struct blk_io_trace *t = te_blk_io_trace(ent);
 
-	trace_seq_putmem(s, t + 1, t->pdu_len);
+	trace_seq_putmem(s, pdu_start(ent, has_cg),
+		pdu_real_len(ent, has_cg));
 	trace_seq_putc(s, '\n');
 }
 
@@ -1298,7 +1380,8 @@ static void blk_tracer_reset(struct trace_array *tr)
 
 static const struct {
 	const char *act[2];
-	void	   (*print)(struct trace_seq *s, const struct trace_entry *ent);
+	void	   (*print)(struct trace_seq *s, const struct trace_entry *ent,
+			    bool has_cg);
 } what2act[] = {
 	[__BLK_TA_QUEUE]	= {{  "Q", "queue" },	   blk_log_generic },
 	[__BLK_TA_BACKMERGE]	= {{  "M", "backmerge" },  blk_log_generic },
@@ -1326,23 +1409,25 @@ static enum print_line_t print_one_line(struct trace_iterator *iter,
 	u16 what;
 	bool long_act;
 	blk_log_action_t *log_action;
+	bool has_cg;
 
 	t	   = te_blk_io_trace(iter->ent);
-	what	   = t->action & ((1 << BLK_TC_SHIFT) - 1);
+	what	   = (t->action & ((1 << BLK_TC_SHIFT) - 1)) & ~__BLK_TA_CGROUP;
 	long_act   = !!(tr->trace_flags & TRACE_ITER_VERBOSE);
 	log_action = classic ? &blk_log_action_classic : &blk_log_action;
+	has_cg	   = t->action & __BLK_TA_CGROUP;
 
-	if (t->action == BLK_TN_MESSAGE) {
-		log_action(iter, long_act ? "message" : "m");
-		blk_log_msg(s, iter->ent);
+	if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
+		log_action(iter, long_act ? "message" : "m", has_cg);
+		blk_log_msg(s, iter->ent, has_cg);
 		return trace_handle_return(s);
 	}
 
 	if (unlikely(what == 0 || what >= ARRAY_SIZE(what2act)))
 		trace_seq_printf(s, "Unknown action %x\n", what);
 	else {
-		log_action(iter, what2act[what].act[long_act]);
-		what2act[what].print(s, iter->ent);
+		log_action(iter, what2act[what].act[long_act], has_cg);
+		what2act[what].print(s, iter->ent, has_cg);
 	}
 
 	return trace_handle_return(s);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 09/12] block: always attach cgroup info into bio
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (7 preceding siblings ...)
  2017-06-28 16:29 ` [PATCH V4 08/12] blktrace: export cgroup info in trace Shaohua Li
@ 2017-06-28 16:29 ` Shaohua Li
  2017-06-28 16:30 ` [PATCH V4 10/12] block: call __bio_free in bio_endio Shaohua Li
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:29 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

blkcg_bio_issue_check() already gets blkcg for a BIO.
bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
operation. There is no point we don't attach the cgroup info into bio at
blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
info.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c       | 7 +------
 include/linux/blk-cgroup.h | 3 +++
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7285bf..a6ebd2b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2104,14 +2104,9 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td)
 static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
 {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
-	int ret;
-
-	ret = bio_associate_current(bio);
-	if (ret == 0 || ret == -EBUSY)
+	if (bio->bi_css)
 		bio->bi_cg_private = tg;
 	blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
-#else
-	bio_associate_current(bio);
 #endif
 }
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 7104bea..9d92153 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -691,6 +691,9 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
 	rcu_read_lock();
 	blkcg = bio_blkcg(bio);
 
+	/* associate blkcg if bio hasn't attached one */
+	bio_associate_blkcg(bio, &blkcg->css);
+
 	blkg = blkg_lookup(blkcg, q);
 	if (unlikely(!blkg)) {
 		spin_lock_irq(q->queue_lock);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 10/12] block: call __bio_free in bio_endio
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (8 preceding siblings ...)
  2017-06-28 16:29 ` [PATCH V4 09/12] block: always attach cgroup info into bio Shaohua Li
@ 2017-06-28 16:30 ` Shaohua Li
  2017-06-28 21:29   ` Christoph Hellwig
  2017-06-28 16:30 ` [PATCH V4 11/12] blktrace: add an option to allow displying cgroup path Shaohua Li
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

bio_free isn't a good place to free cgroup/integrity info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
__bio_free call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup/integrity info, which
seems true in my audit. Otherwise, we probably must add a flag to
distinguish if bio will be called by bio_put.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bio-integrity.c | 1 +
 block/bio.c           | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b8a3a65..e7fa7d0 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -120,6 +120,7 @@ void bio_integrity_free(struct bio *bio)
 	}
 
 	bio->bi_integrity = NULL;
+	bio->bi_opf &= ~REQ_INTEGRITY;
 }
 EXPORT_SYMBOL(bio_integrity_free);
 
diff --git a/block/bio.c b/block/bio.c
index 9cf98b2..4bf3a29 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1828,6 +1828,8 @@ void bio_endio(struct bio *bio)
 	}
 
 	blk_throtl_bio_endio(bio);
+	/* release cgroup/integrity info */
+	__bio_free(bio);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 11/12] blktrace: add an option to allow displying cgroup path
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (9 preceding siblings ...)
  2017-06-28 16:30 ` [PATCH V4 10/12] block: call __bio_free in bio_endio Shaohua Li
@ 2017-06-28 16:30 ` Shaohua Li
  2017-06-28 16:41   ` Jens Axboe
  2017-06-28 16:30 ` [PATCH V4 12/12] block: use standard blktrace API to output cgroup info for debug notes Shaohua Li
  2017-06-28 16:43 ` [PATCH V4 00/12] blktrace: output cgroup info Jens Axboe
  12 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/mount.c       | 19 +++++++++++++++++++
 include/linux/cgroup.h  |  6 ++++++
 include/linux/kernfs.h  |  2 ++
 kernel/cgroup/cgroup.c  | 12 ++++++++++++
 kernel/trace/blktrace.c | 14 +++++++++++++-
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index fa32358..7c452f4 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -65,6 +65,25 @@ const struct super_operations kernfs_sops = {
 	.show_path	= kernfs_sop_show_path,
 };
 
+/*
+ * Similar to kernfs_fh_get_inode, this one gets kernfs node from inode
+ * number and generation
+ */
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+	const union kernfs_node_id *id)
+{
+	struct kernfs_node *kn;
+
+	kn = kernfs_find_and_get_node_by_ino(root, id->ino);
+	if (!kn)
+		return NULL;
+	if (kn->id.generation != id->generation) {
+		kernfs_put(kn);
+		return NULL;
+	}
+	return kn;
+}
+
 static struct inode *kernfs_fh_get_inode(struct super_block *sb,
 		u64 ino, u32 generation)
 {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 52ef9a6..6144fe9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -613,6 +613,9 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
 {
 	return &cgrp->kn->id;
 }
+
+void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+					char *buf, size_t buflen);
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -645,6 +648,9 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 {
 	return true;
 }
+
+static inline void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+	char *buf, size_t buflen) {}
 #endif /* !CONFIG_CGROUPS */
 
 /*
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d149361..ab25c8b 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -358,6 +358,8 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
 
 void kernfs_init(void);
 
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+	const union kernfs_node_id *id);
 #else	/* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 4a5cbe9..7fc1eba 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4614,6 +4614,18 @@ static int __init cgroup_wq_init(void)
 }
 core_initcall(cgroup_wq_init);
 
+void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+					char *buf, size_t buflen)
+{
+	struct kernfs_node *kn;
+
+	kn = kernfs_get_node_by_id(cgrp_dfl_root.kf_root, id);
+	if (!kn)
+		return;
+	kernfs_path(kn, buf, buflen);
+	kernfs_put(kn);
+}
+
 /*
  * proc_cgroup_show()
  *  - Print task's cgroup paths into seq_file, one line for each hierarchy
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index f393d7a..e90974e 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -48,12 +48,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC	0x1
 #define TRACE_BLK_OPT_CGROUP	0x2
+#define TRACE_BLK_OPT_CGNAME	0x4
 
 static struct tracer_opt blk_tracer_opts[] = {
 	/* Default disable the minimalistic output */
 	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
 #ifdef CONFIG_BLK_CGROUP
 	{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+	{ TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) },
 #endif
 	{ }
 };
@@ -1213,7 +1215,17 @@ static void blk_log_action(struct trace_iterator *iter, const char *act,
 	if (has_cg) {
 		const union kernfs_node_id *id = cgid_start(iter->ent);
 
-		trace_seq_printf(&iter->seq, "%3d,%-3d %x,%-x %2s %3s ",
+		if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) {
+			char blkcg_name_buf[NAME_MAX + 1] = "<...>";
+
+			cgroup_path_from_kernfs_id(id, blkcg_name_buf,
+				sizeof(blkcg_name_buf));
+			trace_seq_printf(&iter->seq, "%3d,%-3d %s %2s %3s ",
+				 MAJOR(t->device), MINOR(t->device),
+				 blkcg_name_buf, act, rwbs);
+		} else
+			trace_seq_printf(&iter->seq,
+				 "%3d,%-3d %x,%-x %2s %3s ",
 				 MAJOR(t->device), MINOR(t->device),
 				 id->ino, id->generation, act, rwbs);
 	} else
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V4 12/12] block: use standard blktrace API to output cgroup info for debug notes
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (10 preceding siblings ...)
  2017-06-28 16:30 ` [PATCH V4 11/12] blktrace: add an option to allow displying cgroup path Shaohua Li
@ 2017-06-28 16:30 ` Shaohua Li
  2017-06-28 16:43 ` [PATCH V4 00/12] blktrace: output cgroup info Jens Axboe
  12 siblings, 0 replies; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.

Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bfq-iosched.h          | 13 ++++++++-----
 block/blk-throttle.c         |  6 ++----
 block/cfq-iosched.c          | 15 ++++++---------
 include/linux/blktrace_api.h | 13 +++++++++----
 kernel/trace/blktrace.c      | 12 ++++++++++--
 5 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 5c3bf98..ea6a6eb 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -916,13 +916,16 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	do {			\
-	blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid,\
-			bfq_bfqq_sync((bfqq)) ? 'S' : 'A',		\
-			bfqq_group(bfqq)->blkg_path, ##args);		\
+	blk_add_cgroup_trace_msg((bfqd)->queue,				\
+			bfqg_to_blkg(bfqq_group(bfqq))->blkcg,		\
+			"bfq%d%c " fmt, (bfqq)->pid,			\
+			bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args);	\
 } while (0)
 
-#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)	\
-	blk_add_trace_msg((bfqd)->queue, "%s " fmt, (bfqg)->blkg_path, ##args)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)	do {			\
+	blk_add_cgroup_trace_msg((bfqd)->queue,				\
+		bfqg_to_blkg(bfqg)->blkcg, fmt, ##args);		\
+} while (0)
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6ebd2b..6a4c4c4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -373,10 +373,8 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	if (likely(!blk_trace_note_message_enabled(__td->queue)))	\
 		break;							\
 	if ((__tg)) {							\
-		char __pbuf[128];					\
-									\
-		blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf));	\
-		blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, ##args); \
+		blk_add_cgroup_trace_msg(__td->queue,			\
+			tg_to_blkg(__tg)->blkcg, "throtl " fmt, ##args);\
 	} else {							\
 		blk_add_trace_msg(__td->queue, "throtl " fmt, ##args);	\
 	}								\
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3d5c289..0fb78fb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -656,20 +656,17 @@ static inline void cfqg_put(struct cfq_group *cfqg)
 }
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	do {			\
-	char __pbuf[128];						\
-									\
-	blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf));	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+	blk_add_cgroup_trace_msg((cfqd)->queue,				\
+			cfqg_to_blkg((cfqq)->cfqg)->blkcg,		\
+			"cfq%d%c%c " fmt, (cfqq)->pid,			\
 			cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
 			cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
-			  __pbuf, ##args);				\
+			  ##args);					\
 } while (0)
 
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...)	do {			\
-	char __pbuf[128];						\
-									\
-	blkg_path(cfqg_to_blkg(cfqg), __pbuf, sizeof(__pbuf));		\
-	blk_add_trace_msg((cfqd)->queue, "%s " fmt, __pbuf, ##args);	\
+	blk_add_cgroup_trace_msg((cfqd)->queue,				\
+			cfqg_to_blkg(cfqg)->blkcg, fmt, ##args);	\
 } while (0)
 
 static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index d2e9085..67b4d4d 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -28,10 +28,12 @@ struct blk_trace {
 	atomic_t dropped;
 };
 
+struct blkcg;
+
 extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
 extern void blk_trace_shutdown(struct request_queue *);
-extern __printf(2, 3)
-void __trace_note_message(struct blk_trace *, const char *fmt, ...);
+extern __printf(3, 4)
+void __trace_note_message(struct blk_trace *, struct blkcg *blkcg, const char *fmt, ...);
 
 /**
  * blk_add_trace_msg - Add a (simple) message to the blktrace stream
@@ -46,12 +48,14 @@ void __trace_note_message(struct blk_trace *, const char *fmt, ...);
  *     NOTE: Can not use 'static inline' due to presence of var args...
  *
  **/
-#define blk_add_trace_msg(q, fmt, ...)					\
+#define blk_add_cgroup_trace_msg(q, cg, fmt, ...)			\
 	do {								\
 		struct blk_trace *bt = (q)->blk_trace;			\
 		if (unlikely(bt))					\
-			__trace_note_message(bt, fmt, ##__VA_ARGS__);	\
+			__trace_note_message(bt, cg, fmt, ##__VA_ARGS__);\
 	} while (0)
+#define blk_add_trace_msg(q, fmt, ...)					\
+	blk_add_cgroup_trace_msg(q, NULL, fmt, ##__VA_ARGS__)
 #define BLK_TN_MAX_MSG		128
 
 static inline bool blk_trace_note_message_enabled(struct request_queue *q)
@@ -82,6 +86,7 @@ extern struct attribute_group blk_trace_attr_group;
 # define blk_trace_startstop(q, start)			(-ENOTTY)
 # define blk_trace_remove(q)				(-ENOTTY)
 # define blk_add_trace_msg(q, fmt, ...)			do { } while (0)
+# define blk_add_cgroup_trace_msg(q, cg, fmt, ...)	do { } while (0)
 # define blk_trace_remove_sysfs(dev)			do { } while (0)
 # define blk_trace_note_message_enabled(q)		(false)
 static inline int blk_trace_init_sysfs(struct device *dev)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index e90974e..7724de1 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -154,7 +154,8 @@ static void trace_note_time(struct blk_trace *bt)
 	local_irq_restore(flags);
 }
 
-void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
+void __trace_note_message(struct blk_trace *bt, struct blkcg *blkcg,
+	const char *fmt, ...)
 {
 	int n;
 	va_list args;
@@ -178,7 +179,14 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
 	n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
 	va_end(args);
 
+	if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+		blkcg = NULL;
+#ifdef CONFIG_BLK_CGROUP
+	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n,
+		blkcg ? cgroup_get_kernfs_id(blkcg->css.cgroup) : NULL);
+#else
 	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
+#endif
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -375,7 +383,7 @@ static ssize_t blk_msg_write(struct file *filp, const char __user *buffer,
 		return PTR_ERR(msg);
 
 	bt = filp->private_data;
-	__trace_note_message(bt, "%s", msg);
+	__trace_note_message(bt, NULL, "%s", msg);
 	kfree(msg);
 
 	return count;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 11/12] blktrace: add an option to allow displying cgroup path
  2017-06-28 16:30 ` [PATCH V4 11/12] blktrace: add an option to allow displying cgroup path Shaohua Li
@ 2017-06-28 16:41   ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2017-06-28 16:41 UTC (permalink / raw)
  To: Shaohua Li, linux-kernel, linux-block
  Cc: tj, gregkh, hch, rostedt, lizefan, Kernel-team, Shaohua Li

On 06/28/2017 10:30 AM, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> By default we output cgroup id in blktrace. This adds an option to
> display cgroup path. Since get cgroup path is a relativly heavy
> operation, we don't enable it by default.
> 
> with the option enabled, blktrace will output something like this:
> dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]

Typo in your subject, "displaying".

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
                   ` (11 preceding siblings ...)
  2017-06-28 16:30 ` [PATCH V4 12/12] block: use standard blktrace API to output cgroup info for debug notes Shaohua Li
@ 2017-06-28 16:43 ` Jens Axboe
  2017-06-28 16:53   ` Shaohua Li
  12 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2017-06-28 16:43 UTC (permalink / raw)
  To: Shaohua Li, linux-kernel, linux-block
  Cc: tj, gregkh, hch, rostedt, lizefan, Kernel-team, Shaohua Li

On 06/28/2017 10:29 AM, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Hi,
> 
> Currently blktrace isn't cgroup aware. blktrace prints out task name of current
> context, but the task of current context isn't always in the cgroup where the
> BIO comes from. We can't use task name to find out IO cgroup. For example,
> Writeback BIOs always comes from flusher thread but the BIOs are for different
> blk cgroups. Request could be requeued and dispatched from completely different
> tasks. MD/DM are another examples. This brings challenges if we want to use
> blktrace for performance tunning with cgroup enabled.
> 
> This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
> Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
> userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
> use a BPF program to filter out blktrace for a specific cgroup.
> 
> The first 6 patches adds export operation handlers for kernfs, so userspace can
> use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
> blktrace output cgroup info.

Series looks fine to me. I don't know how you want to split or funnel it,
since it touches multiple different parts. Would it make sense to split this
series into two - one for the kernfs changes, and then a subsequent block
series that depend on that?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 16:43 ` [PATCH V4 00/12] blktrace: output cgroup info Jens Axboe
@ 2017-06-28 16:53   ` Shaohua Li
  2017-06-28 16:54     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-block, tj, gregkh, hch, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 10:43:48AM -0600, Jens Axboe wrote:
> On 06/28/2017 10:29 AM, Shaohua Li wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > Hi,
> > 
> > Currently blktrace isn't cgroup aware. blktrace prints out task name of current
> > context, but the task of current context isn't always in the cgroup where the
> > BIO comes from. We can't use task name to find out IO cgroup. For example,
> > Writeback BIOs always comes from flusher thread but the BIOs are for different
> > blk cgroups. Request could be requeued and dispatched from completely different
> > tasks. MD/DM are another examples. This brings challenges if we want to use
> > blktrace for performance tunning with cgroup enabled.
> > 
> > This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
> > Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
> > userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
> > use a BPF program to filter out blktrace for a specific cgroup.
> > 
> > The first 6 patches adds export operation handlers for kernfs, so userspace can
> > use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
> > blktrace output cgroup info.
> 
> Series looks fine to me. I don't know how you want to split or funnel it,
> since it touches multiple different parts. Would it make sense to split this
> series into two - one for the kernfs changes, and then a subsequent block
> series that depend on that?

What's the best practice to do this without building errors? Ask Tejun
to merge the first 7 patches first?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 16:53   ` Shaohua Li
@ 2017-06-28 16:54     ` Jens Axboe
  2017-06-28 18:11       ` Tejun Heo
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2017-06-28 16:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, gregkh, hch, rostedt, lizefan,
	Kernel-team, Shaohua Li

On 06/28/2017 10:53 AM, Shaohua Li wrote:
> On Wed, Jun 28, 2017 at 10:43:48AM -0600, Jens Axboe wrote:
>> On 06/28/2017 10:29 AM, Shaohua Li wrote:
>>> From: Shaohua Li <shli@fb.com>
>>>
>>> Hi,
>>>
>>> Currently blktrace isn't cgroup aware. blktrace prints out task name of current
>>> context, but the task of current context isn't always in the cgroup where the
>>> BIO comes from. We can't use task name to find out IO cgroup. For example,
>>> Writeback BIOs always comes from flusher thread but the BIOs are for different
>>> blk cgroups. Request could be requeued and dispatched from completely different
>>> tasks. MD/DM are another examples. This brings challenges if we want to use
>>> blktrace for performance tunning with cgroup enabled.
>>>
>>> This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
>>> Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
>>> userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
>>> use a BPF program to filter out blktrace for a specific cgroup.
>>>
>>> The first 6 patches adds export operation handlers for kernfs, so userspace can
>>> use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
>>> blktrace output cgroup info.
>>
>> Series looks fine to me. I don't know how you want to split or funnel it,
>> since it touches multiple different parts. Would it make sense to split this
>> series into two - one for the kernfs changes, and then a subsequent block
>> series that depend on that?
> 
> What's the best practice to do this without building errors? Ask Tejun
> to merge the first 7 patches first?

Yes, and then resend the block patches, just noting that dependency. Then
we can funnel them in like that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 08/12] blktrace: export cgroup info in trace
  2017-06-28 16:29 ` [PATCH V4 08/12] blktrace: export cgroup info in trace Shaohua Li
@ 2017-06-28 16:56   ` Steven Rostedt
  0 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2017-06-28 16:56 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, gregkh, hch, axboe, lizefan,
	Kernel-team, Shaohua Li

On Wed, 28 Jun 2017 09:29:58 -0700
Shaohua Li <shli@kernel.org> wrote:

> From: Shaohua Li <shli@fb.com>
> 
> Currently blktrace isn't cgroup aware. blktrace prints out task name of
> current context, but the task of current context isn't always in the
> cgroup where the BIO comes from. We can't use task name to find out IO
> cgroup. For example, Writeback BIOs always comes from flusher thread but
> the BIOs are for different blk cgroups. Request could be requeued and
> dispatched from completely different tasks. MD/DM are another examples.
> 
> This patch tries to fix the gap. We print out cgroup fhandle info in
> blktrace. Userspace can use open_by_handle_at() syscall to find the
> cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to
> find fhandle for a cgroup and use a BPF program to filter out blktrace
> for a specific cgroup.
> 
> We add a new 'blk_cgroup' trace option for blk tracer. It's default off.
> Application which doesn't know the new option isn't affected.  When it's
> on, we output fhandle info right after blk_io_trace with an extra bit
> set in event action. So from application point of view, blktrace with
> the option will output new actions.
> 
> I didn't change blk trace event yet, since I'm not sure if changing the
> trace event output is an ABI issue. If not, I'll do it later.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  include/uapi/linux/blktrace_api.h |   3 +
>  kernel/trace/blktrace.c           | 231 ++++++++++++++++++++++++++------------
>  2 files changed, 161 insertions(+), 73 deletions(-)

Doing a quick scan of the patch, nothing sticks out as an issue to me.

Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number
  2017-06-28 16:29 ` [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number Shaohua Li
@ 2017-06-28 18:07   ` Tejun Heo
  2017-06-29 12:50   ` Greg KH
  1 sibling, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2017-06-28 18:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, gregkh, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:53AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Add an API to get kernfs node from inode number. We will need this to
> implement exportfs operations.
> 
> This API will be used in blktrace too later, so it should be as fast as
> possible. To make the API lock free, kernfs node is freed in RCU
> context. And we depend on kernfs_node count/ino number to filter out
> stale kernfs nodes.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 16:54     ` Jens Axboe
@ 2017-06-28 18:11       ` Tejun Heo
  2017-06-28 20:57         ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2017-06-28 18:11 UTC (permalink / raw)
  To: Jens Axboe, gregkh
  Cc: Shaohua Li, linux-kernel, linux-block, hch, rostedt, lizefan,
	Kernel-team, Shaohua Li

Hello,

On Wed, Jun 28, 2017 at 10:54:28AM -0600, Jens Axboe wrote:
> >> Series looks fine to me. I don't know how you want to split or funnel it,
> >> since it touches multiple different parts. Would it make sense to split this
> >> series into two - one for the kernfs changes, and then a subsequent block
> >> series that depend on that?
> > 
> > What's the best practice to do this without building errors? Ask Tejun
> > to merge the first 7 patches first?
> 
> Yes, and then resend the block patches, just noting that dependency. Then
> we can funnel them in like that.

I wonder whether it'd be a lot easier to route the whole series
through one tree, most likely block.  Greg, would that be okay with
you?  Alternatively, we can route the whole thing through driver tree
if Jens is okay with that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 07/12] cgroup: export fhandle info for a cgroup
  2017-06-28 16:29 ` [PATCH V4 07/12] cgroup: export fhandle info for a cgroup Shaohua Li
@ 2017-06-28 18:12   ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2017-06-28 18:12 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, gregkh, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:57AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Add an API to export cgroup fhandle info. We don't export a full 'struct
> file_handle', there are unrequired info. Sepcifically, cgroup is always
> a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
> we only need export the inode number and generation number just like
> what generic_fh_to_dentry does. And we can avoid the overhead of getting
> an inode too, since kernfs_node_id (ino and generation) has all the info
> required.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 18:11       ` Tejun Heo
@ 2017-06-28 20:57         ` Jens Axboe
  2017-06-28 21:25           ` Tejun Heo
  2017-06-29 18:39           ` Shaohua Li
  0 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2017-06-28 20:57 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe, gregkh
  Cc: Shaohua Li, linux-kernel, linux-block, hch, rostedt, lizefan,
	Kernel-team, Shaohua Li

On 06/28/2017 12:11 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jun 28, 2017 at 10:54:28AM -0600, Jens Axboe wrote:
>>>> Series looks fine to me. I don't know how you want to split or funnel it,
>>>> since it touches multiple different parts. Would it make sense to split this
>>>> series into two - one for the kernfs changes, and then a subsequent block
>>>> series that depend on that?
>>>
>>> What's the best practice to do this without building errors? Ask Tejun
>>> to merge the first 7 patches first?
>>
>> Yes, and then resend the block patches, just noting that dependency. Then
>> we can funnel them in like that.
> 
> I wonder whether it'd be a lot easier to route the whole series
> through one tree, most likely block.  Greg, would that be okay with
> you?  Alternatively, we can route the whole thing through driver tree
> if Jens is okay with that.

Personally I don't care that much, but the risk of conflicts is much
higher on the block side, than on the kernfs side. So might be the
path of less resistance to pull it through the block tree. And I'd be
happy to do that, if the sign offs on the kernfs side are sufficient.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 20:57         ` Jens Axboe
@ 2017-06-28 21:25           ` Tejun Heo
  2017-06-29 12:50             ` Greg KH
  2017-06-29 18:39           ` Shaohua Li
  1 sibling, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2017-06-28 21:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: gregkh, Shaohua Li, linux-kernel, linux-block, hch, rostedt,
	lizefan, Kernel-team, Shaohua Li

Hello,

On Wed, Jun 28, 2017 at 02:57:38PM -0600, Jens Axboe wrote:
> Personally I don't care that much, but the risk of conflicts is much
> higher on the block side, than on the kernfs side. So might be the
> path of less resistance to pull it through the block tree. And I'd be
> happy to do that, if the sign offs on the kernfs side are sufficient.

Block tree it is then.  Let's wait for Greg to respond.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 10/12] block: call __bio_free in bio_endio
  2017-06-28 16:30 ` [PATCH V4 10/12] block: call __bio_free in bio_endio Shaohua Li
@ 2017-06-28 21:29   ` Christoph Hellwig
  2017-06-28 21:42     ` Shaohua Li
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2017-06-28 21:29 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, gregkh, hch, axboe, rostedt,
	lizefan, Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:30:00AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> bio_free isn't a good place to free cgroup/integrity info. There are a
> lot of cases bio is allocated in special way (for example, in stack) and
> never gets called by bio_put hence bio_free, we are leaking memory. This
> patch moves the free to bio endio, which should be called anyway. The
> __bio_free call in bio_free is kept, in case the bio never gets called
> bio endio.
> 
> This assumes ->bi_end_io() doesn't access cgroup/integrity info, which
> seems true in my audit. Otherwise, we probably must add a flag to
> distinguish if bio will be called by bio_put.

bio_integrity_endio -> bio_integrity_verify_fn -> bio_integrity_process
access the integrity data, so I don't think this works as-is.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 10/12] block: call __bio_free in bio_endio
  2017-06-28 21:29   ` Christoph Hellwig
@ 2017-06-28 21:42     ` Shaohua Li
  2017-06-29 17:15       ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-28 21:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-block, tj, gregkh, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 11:29:08PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 28, 2017 at 09:30:00AM -0700, Shaohua Li wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > bio_free isn't a good place to free cgroup/integrity info. There are a
> > lot of cases bio is allocated in special way (for example, in stack) and
> > never gets called by bio_put hence bio_free, we are leaking memory. This
> > patch moves the free to bio endio, which should be called anyway. The
> > __bio_free call in bio_free is kept, in case the bio never gets called
> > bio endio.
> > 
> > This assumes ->bi_end_io() doesn't access cgroup/integrity info, which
> > seems true in my audit. Otherwise, we probably must add a flag to
> > distinguish if bio will be called by bio_put.
> 
> bio_integrity_endio -> bio_integrity_verify_fn -> bio_integrity_process
> access the integrity data, so I don't think this works as-is.

oh, I probably missed the integrity endio. could we let bio_integrity_verify_fn
free integrity info and and bio_endio free cgroup info?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 21:25           ` Tejun Heo
@ 2017-06-29 12:50             ` Greg KH
  0 siblings, 0 replies; 36+ messages in thread
From: Greg KH @ 2017-06-29 12:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Shaohua Li, linux-kernel, linux-block, hch, rostedt,
	lizefan, Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 05:25:25PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jun 28, 2017 at 02:57:38PM -0600, Jens Axboe wrote:
> > Personally I don't care that much, but the risk of conflicts is much
> > higher on the block side, than on the kernfs side. So might be the
> > path of less resistance to pull it through the block tree. And I'd be
> > happy to do that, if the sign offs on the kernfs side are sufficient.
> 
> Block tree it is then.  Let's wait for Greg to respond.

block tree is fine with me, I'll go ack the kernfs patches now.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number
  2017-06-28 16:29 ` [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number Shaohua Li
  2017-06-28 18:07   ` Tejun Heo
@ 2017-06-29 12:50   ` Greg KH
  1 sibling, 0 replies; 36+ messages in thread
From: Greg KH @ 2017-06-29 12:50 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:53AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Add an API to get kernfs node from inode number. We will need this to
> implement exportfs operations.
> 
> This API will be used in blktrace too later, so it should be as fast as
> possible. To make the API lock free, kernfs node is freed in RCU
> context. And we depend on kernfs_node count/ino number to filter out
> stale kernfs nodes.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 06/12] kernfs: add exportfs operations
  2017-06-28 16:29 ` [PATCH V4 06/12] kernfs: add exportfs operations Shaohua Li
@ 2017-06-29 12:50   ` Greg KH
  0 siblings, 0 replies; 36+ messages in thread
From: Greg KH @ 2017-06-29 12:50 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:56AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Now we have the facilities to implement exportfs operations. The idea is
> cgroup can export the fhandle info to userspace, then userspace uses
> fhandle to find the cgroup name. Another example is userspace can get
> fhandle for a cgroup and BPF uses the fhandle to filter info for the
> cgroup.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 04/12] kernfs: don't set dentry->d_fsdata
  2017-06-28 16:29 ` [PATCH V4 04/12] kernfs: don't set dentry->d_fsdata Shaohua Li
@ 2017-06-29 12:51   ` Greg KH
  0 siblings, 0 replies; 36+ messages in thread
From: Greg KH @ 2017-06-29 12:51 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:54AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> When working on adding exportfs operations in kernfs, I found it's hard
> to initialize dentry->d_fsdata in the exportfs operations. Looks there
> is no way to do it without race condition. Look at the kernfs code
> closely, there is no point to set dentry->d_fsdata. inode->i_private
> already points to kernfs_node, and we can get inode from a dentry. So
> this patch just delete the d_fsdata usage.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 05/12] kernfs: introduce kernfs_node_id
  2017-06-28 16:29 ` [PATCH V4 05/12] kernfs: introduce kernfs_node_id Shaohua Li
@ 2017-06-29 12:51   ` Greg KH
  0 siblings, 0 replies; 36+ messages in thread
From: Greg KH @ 2017-06-29 12:51 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:55AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> inode number and generation can identify a kernfs node. We are going to
> export the identification by exportfs operations, so put ino and
> generation into a separate structure. It's convenient when later patches
> use the identification.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 01/12] kernfs: use idr instead of ida to manage inode number
  2017-06-28 16:29 ` [PATCH V4 01/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
@ 2017-06-29 12:51   ` Greg KH
  0 siblings, 0 replies; 36+ messages in thread
From: Greg KH @ 2017-06-29 12:51 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:51AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> kernfs uses ida to manage inode number. The problem is we can't get
> kernfs_node from inode number with ida. Switching to use idr, next patch
> will add an API to get kernfs_node from inode number.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 02/12] kernfs: implement i_generation
  2017-06-28 16:29 ` [PATCH V4 02/12] kernfs: implement i_generation Shaohua Li
@ 2017-06-29 12:51   ` Greg KH
  0 siblings, 0 replies; 36+ messages in thread
From: Greg KH @ 2017-06-29 12:51 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, tj, hch, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 09:29:52AM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Set i_generation for kernfs inode. This is required to implement
> exportfs operations. The generation is 32-bit, so it's possible the
> generation wraps up and we find stale files. To reduce the posssibility,
> we don't reuse inode numer immediately. When the inode number allocation
> wraps, we increase generation number. In this way generation/inode
> number consist of a 64-bit number which is unlikely duplicated. This
> does make the idr tree more sparse and waste some memory. Since idr
> manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
> bits of the key. In a 100k inode kernfs, the worst case will have around
> 300k radix tree node. Each node is 576bytes, so the tree will use about
> ~150M memory. Sounds not too bad, if this really is a problem, we should
> find better data structure.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 10/12] block: call __bio_free in bio_endio
  2017-06-28 21:42     ` Shaohua Li
@ 2017-06-29 17:15       ` Christoph Hellwig
  2017-06-29 18:35         ` Shaohua Li
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2017-06-29 17:15 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Hellwig, linux-kernel, linux-block, tj, gregkh, axboe,
	rostedt, lizefan, Kernel-team, Shaohua Li, Martin K. Petersen

On Wed, Jun 28, 2017 at 02:42:49PM -0700, Shaohua Li wrote:
> > bio_integrity_endio -> bio_integrity_verify_fn -> bio_integrity_process
> > access the integrity data, so I don't think this works as-is.
> 
> oh, I probably missed the integrity endio. could we let bio_integrity_verify_fn
> free integrity info and and bio_endio free cgroup info?

something like this (will need the cgroup fixes from you still) should
do the trick, although it's completely untested:

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b8a3a65f7364..b66eb92b5a00 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -102,7 +102,7 @@ EXPORT_SYMBOL(bio_integrity_alloc);
  * Description: Used to free the integrity portion of a bio. Usually
  * called from bio_free().
  */
-void bio_integrity_free(struct bio *bio)
+static void bio_integrity_free(struct bio *bio)
 {
 	struct bio_integrity_payload *bip = bio_integrity(bio);
 	struct bio_set *bs = bio->bi_pool;
@@ -120,8 +120,8 @@ void bio_integrity_free(struct bio *bio)
 	}
 
 	bio->bi_integrity = NULL;
+	bio->bi_opf &= ~REQ_INTEGRITY;
 }
-EXPORT_SYMBOL(bio_integrity_free);
 
 /**
  * bio_integrity_add_page - Attach integrity metadata
@@ -340,12 +340,6 @@ int bio_integrity_prep(struct bio *bio)
 		offset = 0;
 	}
 
-	/* Install custom I/O completion handler if read verify is enabled */
-	if (bio_data_dir(bio) == READ) {
-		bip->bip_end_io = bio->bi_end_io;
-		bio->bi_end_io = bio_integrity_endio;
-	}
-
 	/* Auto-generate integrity metadata if this is a write */
 	if (bio_data_dir(bio) == WRITE)
 		bio_integrity_process(bio, bi->profile->generate_fn);
@@ -370,14 +364,12 @@ static void bio_integrity_verify_fn(struct work_struct *work)
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
 	bio->bi_status = bio_integrity_process(bio, bi->profile->verify_fn);
-
-	/* Restore original bio completion handler */
-	bio->bi_end_io = bip->bip_end_io;
+	bio_integrity_free(bio);
 	bio_endio(bio);
 }
 
 /**
- * bio_integrity_endio - Integrity I/O completion function
+ * __bio_integrity_endio - Integrity I/O completion function
  * @bio:	Protected bio
  * @error:	Pointer to errno
  *
@@ -388,27 +380,19 @@ static void bio_integrity_verify_fn(struct work_struct *work)
  * in process context.	This function postpones completion
  * accordingly.
  */
-void bio_integrity_endio(struct bio *bio)
+bool __bio_integrity_endio(struct bio *bio)
 {
-	struct bio_integrity_payload *bip = bio_integrity(bio);
+	if (bio_op(bio) == REQ_OP_READ && !bio->bi_status) {
+		struct bio_integrity_payload *bip = bio_integrity(bio);
 
-	BUG_ON(bip->bip_bio != bio);
-
-	/* In case of an I/O error there is no point in verifying the
-	 * integrity metadata.  Restore original bio end_io handler
-	 * and run it.
-	 */
-	if (bio->bi_status) {
-		bio->bi_end_io = bip->bip_end_io;
-		bio_endio(bio);
-
-		return;
+		INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
+		queue_work(kintegrityd_wq, &bip->bip_work);
+		return false;
 	}
 
-	INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
-	queue_work(kintegrityd_wq, &bip->bip_work);
+	bio_integrity_free(bio);
+	return true;
 }
-EXPORT_SYMBOL(bio_integrity_endio);
 
 /**
  * bio_integrity_advance - Advance integrity vector
diff --git a/block/bio.c b/block/bio.c
index 9cf98b29588a..1ac81de06e74 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,9 +243,6 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
 static void __bio_free(struct bio *bio)
 {
 	bio_disassociate_task(bio);
-
-	if (bio_integrity(bio))
-		bio_integrity_free(bio);
 }
 
 static void bio_free(struct bio *bio)
@@ -1807,6 +1804,8 @@ void bio_endio(struct bio *bio)
 again:
 	if (!bio_remaining_done(bio))
 		return;
+	if (!bio_integrity_endio(bio))
+		return;
 
 	/*
 	 * Need to have a real endio function for chained bios, otherwise
diff --git a/block/blk.h b/block/blk.h
index 01ebb8185f6b..921106babba8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -81,10 +81,21 @@ static inline void blk_queue_enter_live(struct request_queue *q)
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
+bool __bio_integrity_endio(struct bio *);
+static inline bool bio_integrity_endio(struct bio *bio)
+{
+	if (bio_integrity(bio))
+		return __bio_integrity_endio(bio);
+	return true;
+}
 #else
 static inline void blk_flush_integrity(void)
 {
 }
+static inline bool bio_integrity_endio(struct bio *)
+{
+	return true;
+}
 #endif
 
 void blk_timeout_work(struct work_struct *work);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 4907bea03908..b3a05cd49c06 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -303,8 +303,6 @@ struct bio_integrity_payload {
 
 	struct bvec_iter	bip_iter;
 
-	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
-
 	unsigned short		bip_slab;	/* slab the bip came from */
 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
 	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
@@ -721,11 +719,9 @@ struct biovec_slab {
 		bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
 
 extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
-extern void bio_integrity_free(struct bio *);
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
-extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
@@ -760,11 +756,6 @@ static inline int bio_integrity_prep(struct bio *bio)
 	return 0;
 }
 
-static inline void bio_integrity_free(struct bio *bio)
-{
-	return;
-}
-
 static inline int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 				      gfp_t gfp_mask)
 {

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 10/12] block: call __bio_free in bio_endio
  2017-06-29 17:15       ` Christoph Hellwig
@ 2017-06-29 18:35         ` Shaohua Li
  2017-06-29 20:22           ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Shaohua Li @ 2017-06-29 18:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-block, tj, gregkh, axboe, rostedt, lizefan,
	Kernel-team, Shaohua Li, Martin K. Petersen

On Thu, Jun 29, 2017 at 07:15:52PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 28, 2017 at 02:42:49PM -0700, Shaohua Li wrote:
> > > bio_integrity_endio -> bio_integrity_verify_fn -> bio_integrity_process
> > > access the integrity data, so I don't think this works as-is.
> > 
> > oh, I probably missed the integrity endio. could we let bio_integrity_verify_fn
> > free integrity info and and bio_endio free cgroup info?
> 
> something like this (will need the cgroup fixes from you still) should
> do the trick, although it's completely untested:
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index b8a3a65f7364..b66eb92b5a00 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -102,7 +102,7 @@ EXPORT_SYMBOL(bio_integrity_alloc);
>   * Description: Used to free the integrity portion of a bio. Usually
>   * called from bio_free().
>   */
> -void bio_integrity_free(struct bio *bio)
> +static void bio_integrity_free(struct bio *bio)
>  {
>  	struct bio_integrity_payload *bip = bio_integrity(bio);
>  	struct bio_set *bs = bio->bi_pool;
> @@ -120,8 +120,8 @@ void bio_integrity_free(struct bio *bio)
>  	}
>  
>  	bio->bi_integrity = NULL;
> +	bio->bi_opf &= ~REQ_INTEGRITY;
>  }
> -EXPORT_SYMBOL(bio_integrity_free);
>  
>  /**
>   * bio_integrity_add_page - Attach integrity metadata
> @@ -340,12 +340,6 @@ int bio_integrity_prep(struct bio *bio)
>  		offset = 0;
>  	}
>  
> -	/* Install custom I/O completion handler if read verify is enabled */
> -	if (bio_data_dir(bio) == READ) {
> -		bip->bip_end_io = bio->bi_end_io;
> -		bio->bi_end_io = bio_integrity_endio;
> -	}
> -
>  	/* Auto-generate integrity metadata if this is a write */
>  	if (bio_data_dir(bio) == WRITE)
>  		bio_integrity_process(bio, bi->profile->generate_fn);
> @@ -370,14 +364,12 @@ static void bio_integrity_verify_fn(struct work_struct *work)
>  	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
>  
>  	bio->bi_status = bio_integrity_process(bio, bi->profile->verify_fn);
> -
> -	/* Restore original bio completion handler */
> -	bio->bi_end_io = bip->bip_end_io;
> +	bio_integrity_free(bio);
>  	bio_endio(bio);

should we directly call bi_end_io here? Otherwise, looks reasonable to me.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 00/12] blktrace: output cgroup info
  2017-06-28 20:57         ` Jens Axboe
  2017-06-28 21:25           ` Tejun Heo
@ 2017-06-29 18:39           ` Shaohua Li
  1 sibling, 0 replies; 36+ messages in thread
From: Shaohua Li @ 2017-06-29 18:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, gregkh, linux-kernel, linux-block, hch, rostedt,
	lizefan, Kernel-team, Shaohua Li

On Wed, Jun 28, 2017 at 02:57:38PM -0600, Jens Axboe wrote:
> On 06/28/2017 12:11 PM, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Jun 28, 2017 at 10:54:28AM -0600, Jens Axboe wrote:
> >>>> Series looks fine to me. I don't know how you want to split or funnel it,
> >>>> since it touches multiple different parts. Would it make sense to split this
> >>>> series into two - one for the kernfs changes, and then a subsequent block
> >>>> series that depend on that?
> >>>
> >>> What's the best practice to do this without building errors? Ask Tejun
> >>> to merge the first 7 patches first?
> >>
> >> Yes, and then resend the block patches, just noting that dependency. Then
> >> we can funnel them in like that.
> > 
> > I wonder whether it'd be a lot easier to route the whole series
> > through one tree, most likely block.  Greg, would that be okay with
> > you?  Alternatively, we can route the whole thing through driver tree
> > if Jens is okay with that.
> 
> Personally I don't care that much, but the risk of conflicts is much
> higher on the block side, than on the kernfs side. So might be the
> path of less resistance to pull it through the block tree. And I'd be
> happy to do that, if the sign offs on the kernfs side are sufficient.

Jens,
Now we have the stamps, could you please queue the patches in your tree?

For the patch 10, please drop it right now. With Christoph's integrity patch,
we only need to free cgroup info in bio_endio. I'll resend a patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V4 10/12] block: call __bio_free in bio_endio
  2017-06-29 18:35         ` Shaohua Li
@ 2017-06-29 20:22           ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-06-29 20:22 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Hellwig, linux-kernel, linux-block, tj, gregkh, axboe,
	rostedt, lizefan, Kernel-team, Shaohua Li, Martin K. Petersen

On Thu, Jun 29, 2017 at 11:35:44AM -0700, Shaohua Li wrote:
> > -
> >  	/* Auto-generate integrity metadata if this is a write */
> >  	if (bio_data_dir(bio) == WRITE)
> >  		bio_integrity_process(bio, bi->profile->generate_fn);
> > @@ -370,14 +364,12 @@ static void bio_integrity_verify_fn(struct work_struct *work)
> >  	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> >  
> >  	bio->bi_status = bio_integrity_process(bio, bi->profile->verify_fn);
> > -
> > -	/* Restore original bio completion handler */
> > -	bio->bi_end_io = bip->bip_end_io;
> > +	bio_integrity_free(bio);
> >  	bio_endio(bio);
> 
> should we directly call bi_end_io here? Otherwise, looks reasonable to me.

We should call bio_endio to get the proper chaining behavior.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-06-29 20:22 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-28 16:29 [PATCH V4 00/12] blktrace: output cgroup info Shaohua Li
2017-06-28 16:29 ` [PATCH V4 01/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
2017-06-29 12:51   ` Greg KH
2017-06-28 16:29 ` [PATCH V4 02/12] kernfs: implement i_generation Shaohua Li
2017-06-29 12:51   ` Greg KH
2017-06-28 16:29 ` [PATCH V4 03/12] kernfs: add an API to get kernfs node from inode number Shaohua Li
2017-06-28 18:07   ` Tejun Heo
2017-06-29 12:50   ` Greg KH
2017-06-28 16:29 ` [PATCH V4 04/12] kernfs: don't set dentry->d_fsdata Shaohua Li
2017-06-29 12:51   ` Greg KH
2017-06-28 16:29 ` [PATCH V4 05/12] kernfs: introduce kernfs_node_id Shaohua Li
2017-06-29 12:51   ` Greg KH
2017-06-28 16:29 ` [PATCH V4 06/12] kernfs: add exportfs operations Shaohua Li
2017-06-29 12:50   ` Greg KH
2017-06-28 16:29 ` [PATCH V4 07/12] cgroup: export fhandle info for a cgroup Shaohua Li
2017-06-28 18:12   ` Tejun Heo
2017-06-28 16:29 ` [PATCH V4 08/12] blktrace: export cgroup info in trace Shaohua Li
2017-06-28 16:56   ` Steven Rostedt
2017-06-28 16:29 ` [PATCH V4 09/12] block: always attach cgroup info into bio Shaohua Li
2017-06-28 16:30 ` [PATCH V4 10/12] block: call __bio_free in bio_endio Shaohua Li
2017-06-28 21:29   ` Christoph Hellwig
2017-06-28 21:42     ` Shaohua Li
2017-06-29 17:15       ` Christoph Hellwig
2017-06-29 18:35         ` Shaohua Li
2017-06-29 20:22           ` Christoph Hellwig
2017-06-28 16:30 ` [PATCH V4 11/12] blktrace: add an option to allow displying cgroup path Shaohua Li
2017-06-28 16:41   ` Jens Axboe
2017-06-28 16:30 ` [PATCH V4 12/12] block: use standard blktrace API to output cgroup info for debug notes Shaohua Li
2017-06-28 16:43 ` [PATCH V4 00/12] blktrace: output cgroup info Jens Axboe
2017-06-28 16:53   ` Shaohua Li
2017-06-28 16:54     ` Jens Axboe
2017-06-28 18:11       ` Tejun Heo
2017-06-28 20:57         ` Jens Axboe
2017-06-28 21:25           ` Tejun Heo
2017-06-29 12:50             ` Greg KH
2017-06-29 18:39           ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.