linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 00/12]blktrace: output cgroup info
@ 2017-06-14 16:11 Shaohua Li
  2017-06-14 16:11 ` [PATCH V2 01/12] kernfs: implement i_generation Shaohua Li
                   ` (11 more replies)
  0 siblings, 12 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:11 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Hi,

Currently blktrace isn't cgroup aware. blktrace prints out task name of current
context, but the task of current context isn't always in the cgroup where the
BIO comes from. We can't use task name to find out IO cgroup. For example,
Writeback BIOs always comes from flusher thread but the BIOs are for different
blk cgroups. Request could be requeued and dispatched from completely different
tasks. MD/DM are another examples. This brings challenges if we want to use
blktrace for performance tunning with cgroup enabled.

This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
use a BPF program to filter out blktrace for a specific cgroup.

The first 6 patches adds export operation handlers for kernfs, so userspace can
use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
blktrace output cgroup info.

Note, we export 64-bit inode number and 32-bit generation number for fhandle.
Currently kernfs only supports 32-bit inode number actually because idr only
supports 32-bit allocation. We had plan to support 64-bit inode number soon, as
Tejun has concerns the 32-bit inode/generation could wrap easily. This patchset
hasn't converted inode number to 64-bit yet.

Thanks,
Shaohua

V1 -> V2:
- Fix a bug in cgroup association
- Fix build errors reported by 0day
- Address some issues pointed out by Tejun

Shaohua Li (12):
  kernfs: implement i_generation
  kernfs: use idr instead of ida to manage inode number
  kernfs: add an API to get kernfs node from inode number
  kernfs: don't set dentry->d_fsdata
  kernfs: introduce kernfs_node_id
  kernfs: add exportfs operations
  cgroup: export fhandle info for a cgroup
  blktrace: export cgroup info in trace
  block: always attach cgroup info into bio
  block: call __bio_free in bio_endio
  blktrace: add an option to allow displying cgroup path
  block: use standard blktrace API to output cgroup info for debug notes

 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |   2 +-
 block/bfq-iosched.h                      |  13 +-
 block/bio-integrity.c                    |   1 +
 block/bio.c                              |   2 +
 block/blk-throttle.c                     |  13 +-
 block/cfq-iosched.c                      |  15 +-
 fs/kernfs/dir.c                          | 101 +++++++++---
 fs/kernfs/file.c                         |  10 +-
 fs/kernfs/inode.c                        |   9 +-
 fs/kernfs/kernfs-internal.h              |   9 ++
 fs/kernfs/mount.c                        | 144 +++++++++++++++--
 fs/kernfs/symlink.c                      |   6 +-
 fs/sysfs/mount.c                         |   2 +-
 include/linux/blk-cgroup.h               |  17 +-
 include/linux/blktrace_api.h             |  13 +-
 include/linux/cgroup.h                   |  16 +-
 include/linux/exportfs.h                 |  11 ++
 include/linux/kernfs.h                   |  36 ++++-
 include/uapi/linux/blktrace_api.h        |   3 +
 kernel/cgroup/cgroup.c                   |  15 +-
 kernel/trace/blktrace.c                  | 259 ++++++++++++++++++++++---------
 21 files changed, 523 insertions(+), 174 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V2 01/12] kernfs: implement i_generation
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
@ 2017-06-14 16:11 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 02/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:11 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Set i_generation for kernfs inode. This is required to implement exportfs
operations.

Note, the generation is 32-bit, so it's possible the generation wraps up
and we find stale files. The possiblity is low, since fhandle matches
both inode number and generation. In most fs, the generation is 32-bit.
fhandle only export 32-bit generation for most fs. So unless we have
solid reason, we'd live with the possible conflict.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c        | 2 ++
 fs/kernfs/inode.c      | 1 +
 include/linux/kernfs.h | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index db5900aaa..09d093e 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -634,6 +634,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	if (ret < 0)
 		goto err_out2;
 	kn->ino = ret;
+	kn->generation = atomic_inc_return(&root->next_generation);
 
 	atomic_set(&kn->count, 1);
 	atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
@@ -877,6 +878,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 
 	ida_init(&root->ino_ida);
 	INIT_LIST_HEAD(&root->supers);
+	atomic_set(&root->next_generation, 0);
 
 	kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       KERNFS_DIR);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fb4b4a7..79cdae4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,6 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
 	inode->i_private = kn;
 	inode->i_mapping->a_ops = &kernfs_aops;
 	inode->i_op = &kernfs_iops;
+	inode->i_generation = kn->generation;
 
 	set_default_inode_attr(inode, kn->mode);
 	kernfs_refresh_inode(kn, inode);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index a9b11b8..c5f0fa7 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -135,6 +135,7 @@ struct kernfs_node {
 	umode_t			mode;
 	unsigned int		ino;
 	struct kernfs_iattrs	*iattr;
+	u32			generation;
 };
 
 /*
@@ -170,6 +171,7 @@ struct kernfs_root {
 	struct list_head	supers;
 
 	wait_queue_head_t	deactivate_waitq;
+	atomic_t		next_generation;
 };
 
 struct kernfs_open_file {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 02/12] kernfs: use idr instead of ida to manage inode number
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
  2017-06-14 16:11 ` [PATCH V2 01/12] kernfs: implement i_generation Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 03/12] kernfs: add an API to get kernfs node from " Shaohua Li
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c        | 17 ++++++++++++-----
 include/linux/kernfs.h |  2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 09d093e..8e8545a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -21,6 +21,7 @@
 DEFINE_MUTEX(kernfs_mutex);
 static DEFINE_SPINLOCK(kernfs_rename_lock);	/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];	/* protected by rename_lock */
+static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
 #define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)
 
@@ -533,7 +534,9 @@ void kernfs_put(struct kernfs_node *kn)
 		simple_xattrs_free(&kn->iattr->xattrs);
 	}
 	kfree(kn->iattr);
-	ida_simple_remove(&root->ino_ida, kn->ino);
+	spin_lock(&kernfs_idr_lock);
+	idr_remove(&root->ino_idr, kn->ino);
+	spin_unlock(&kernfs_idr_lock);
 	kmem_cache_free(kernfs_node_cache, kn);
 
 	kn = parent;
@@ -542,7 +545,7 @@ void kernfs_put(struct kernfs_node *kn)
 			goto repeat;
 	} else {
 		/* just released the root kn, free @root too */
-		ida_destroy(&root->ino_ida);
+		idr_destroy(&root->ino_idr);
 		kfree(root);
 	}
 }
@@ -630,7 +633,11 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	if (!kn)
 		goto err_out1;
 
-	ret = ida_simple_get(&root->ino_ida, 1, 0, GFP_KERNEL);
+	idr_preload(GFP_KERNEL);
+	spin_lock(&kernfs_idr_lock);
+	ret = idr_alloc(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+	spin_unlock(&kernfs_idr_lock);
+	idr_preload_end();
 	if (ret < 0)
 		goto err_out2;
 	kn->ino = ret;
@@ -876,14 +883,14 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 	if (!root)
 		return ERR_PTR(-ENOMEM);
 
-	ida_init(&root->ino_ida);
+	idr_init(&root->ino_idr);
 	INIT_LIST_HEAD(&root->supers);
 	atomic_set(&root->next_generation, 0);
 
 	kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       KERNFS_DIR);
 	if (!kn) {
-		ida_destroy(&root->ino_ida);
+		idr_destroy(&root->ino_idr);
 		kfree(root);
 		return ERR_PTR(-ENOMEM);
 	}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index c5f0fa7..61668d1 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -164,7 +164,7 @@ struct kernfs_root {
 	unsigned int		flags;	/* KERNFS_ROOT_* flags */
 
 	/* private fields, do not use outside kernfs proper */
-	struct ida		ino_ida;
+	struct idr		ino_idr;
 	struct kernfs_syscall_ops *syscall_ops;
 
 	/* list of kernfs_super_info of this root, protected by kernfs_mutex */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 03/12] kernfs: add an API to get kernfs node from inode number
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
  2017-06-14 16:11 ` [PATCH V2 01/12] kernfs: implement i_generation Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 02/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 04/12] kernfs: don't set dentry->d_fsdata Shaohua Li
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.

To make the API lock free, kernfs node is freed in RCU context. And we
depend on kernfs_node count/ino number to filter stale kernfs nodes.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c             | 53 +++++++++++++++++++++++++++++++++++++++++++++
 fs/kernfs/kernfs-internal.h |  2 ++
 fs/kernfs/mount.c           |  4 +++-
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8e8545a..646b56b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -508,6 +508,10 @@ void kernfs_put(struct kernfs_node *kn)
 	struct kernfs_node *parent;
 	struct kernfs_root *root;
 
+	/*
+	 * kernfs_node is freed with ->count 0, kernfs_find_and_get_node_by_ino
+	 * depends on this to filter reused stale node
+	 */
 	if (!kn || !atomic_dec_and_test(&kn->count))
 		return;
 	root = kernfs_root(kn);
@@ -643,6 +647,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	kn->ino = ret;
 	kn->generation = atomic_inc_return(&root->next_generation);
 
+	/* set ino first. Above atomic_inc_return has a barrier */
 	atomic_set(&kn->count, 1);
 	atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
 	RB_CLEAR_NODE(&kn->rb);
@@ -674,6 +679,54 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 	return kn;
 }
 
+/*
+ * kernfs_find_and_get_node_by_ino - get kernfs_node from inode number
+ * @root: the kernfs root
+ * @ino: inode number
+ *
+ * RETURNS:
+ * NULL on failure. Return a kernfs node with reference counter incremented
+ */
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+						    unsigned int ino)
+{
+	struct kernfs_node *kn;
+
+	rcu_read_lock();
+	kn = idr_find(&root->ino_idr, ino);
+	if (!kn)
+		goto out;
+
+	/*
+	 * Since kernfs_node is freed in RCU, it's possible an old node for ino
+	 * is freed, but reused before RCU grace period. But a freed node (see
+	 * kernfs_put) or an incompletedly initialized node (see
+	 * __kernfs_new_node) should have 'count' 0. We can use this fact to
+	 * filter out such node.
+	 */
+	if (!atomic_inc_not_zero(&kn->count)) {
+		kn = NULL;
+		goto out;
+	}
+
+	/*
+	 * The node could be a new node or a reused node. If it's a new node,
+	 * we are ok. If it's reused because of RCU, the __kernfs_new_node
+	 * always sets its 'ino' before 'count'. So if 'count' is uptodate,
+	 * 'ino' should be uptodate, hence we can use 'ino' to filter stale
+	 * node.
+	 */
+	if (kn->ino != ino)
+		goto out;
+	rcu_read_unlock();
+
+	return kn;
+out:
+	rcu_read_unlock();
+	kernfs_put(kn);
+	return NULL;
+}
+
 /**
  *	kernfs_add_one - add kernfs_node to parent without warning
  *	@kn: kernfs_node to be added
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 2d5144a..e9c226f 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -98,6 +98,8 @@ int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    unsigned flags);
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+						    unsigned int ino);
 
 /*
  * file.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d5b149a..343dfeb 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -332,5 +332,7 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0,
+					      SLAB_PANIC | SLAB_TYPESAFE_BY_RCU,
+					      NULL);
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 04/12] kernfs: don't set dentry->d_fsdata
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (2 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 03/12] kernfs: add an API to get kernfs node from " Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 05/12] kernfs: introduce kernfs_node_id Shaohua Li
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c             | 25 +++++++++----------------
 fs/kernfs/file.c            |  6 +++---
 fs/kernfs/inode.c           |  6 +++---
 fs/kernfs/kernfs-internal.h |  7 +++++++
 fs/kernfs/mount.c           |  8 ++------
 fs/kernfs/symlink.c         |  6 +++---
 6 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 646b56b..e54fdd9 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -566,7 +566,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	if (d_really_is_negative(dentry))
 		goto out_bad_unlocked;
 
-	kn = dentry->d_fsdata;
+	kn = kernfs_dentry_node(dentry);
 	mutex_lock(&kernfs_mutex);
 
 	/* The kernfs node has been deactivated */
@@ -574,7 +574,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		goto out_bad;
 
 	/* The kernfs node has been moved? */
-	if (dentry->d_parent->d_fsdata != kn->parent)
+	if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
 		goto out_bad;
 
 	/* The kernfs node has been renamed */
@@ -594,14 +594,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	return 0;
 }
 
-static void kernfs_dop_release(struct dentry *dentry)
-{
-	kernfs_put(dentry->d_fsdata);
-}
-
 const struct dentry_operations kernfs_dops = {
 	.d_revalidate	= kernfs_dop_revalidate,
-	.d_release	= kernfs_dop_release,
 };
 
 /**
@@ -617,8 +611,9 @@ const struct dentry_operations kernfs_dops = {
  */
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
 {
-	if (dentry->d_sb->s_op == &kernfs_sops)
-		return dentry->d_fsdata;
+	if (dentry->d_sb->s_op == &kernfs_sops &&
+	    !d_really_is_negative(dentry))
+		return kernfs_dentry_node(dentry);
 	return NULL;
 }
 
@@ -1046,7 +1041,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 					unsigned int flags)
 {
 	struct dentry *ret;
-	struct kernfs_node *parent = dentry->d_parent->d_fsdata;
+	struct kernfs_node *parent = dir->i_private;
 	struct kernfs_node *kn;
 	struct inode *inode;
 	const void *ns = NULL;
@@ -1063,8 +1058,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 		ret = NULL;
 		goto out_unlock;
 	}
-	kernfs_get(kn);
-	dentry->d_fsdata = kn;
 
 	/* attach dentry and inode */
 	inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1101,7 +1094,7 @@ static int kernfs_iop_mkdir(struct inode *dir, struct dentry *dentry,
 
 static int kernfs_iop_rmdir(struct inode *dir, struct dentry *dentry)
 {
-	struct kernfs_node *kn  = dentry->d_fsdata;
+	struct kernfs_node *kn  = kernfs_dentry_node(dentry);
 	struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
 	int ret;
 
@@ -1121,7 +1114,7 @@ static int kernfs_iop_rename(struct inode *old_dir, struct dentry *old_dentry,
 			     struct inode *new_dir, struct dentry *new_dentry,
 			     unsigned int flags)
 {
-	struct kernfs_node *kn  = old_dentry->d_fsdata;
+	struct kernfs_node *kn = kernfs_dentry_node(old_dentry);
 	struct kernfs_node *new_parent = new_dir->i_private;
 	struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
 	int ret;
@@ -1634,7 +1627,7 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct dentry *dentry = file->f_path.dentry;
-	struct kernfs_node *parent = dentry->d_fsdata;
+	struct kernfs_node *parent = kernfs_dentry_node(dentry);
 	struct kernfs_node *pos = file->private_data;
 	const void *ns = NULL;
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ac2dfe0..7f90d4d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -616,7 +616,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
 {
-	struct kernfs_node *kn = file->f_path.dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_root *root = kernfs_root(kn);
 	const struct kernfs_ops *ops;
 	struct kernfs_open_file *of;
@@ -768,7 +768,7 @@ static void kernfs_release_file(struct kernfs_node *kn,
 
 static int kernfs_fop_release(struct inode *inode, struct file *filp)
 {
-	struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_open_file *of = kernfs_of(filp);
 
 	if (kn->flags & KERNFS_HAS_RELEASE) {
@@ -835,7 +835,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 static unsigned int kernfs_fop_poll(struct file *filp, poll_table *wait)
 {
 	struct kernfs_open_file *of = kernfs_of(filp);
-	struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+	struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry);
 	struct kernfs_open_node *on = kn->attr.open;
 
 	if (!kernfs_get_active(kn))
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 79cdae4..4c8b510 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -112,7 +112,7 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr)
 {
 	struct inode *inode = d_inode(dentry);
-	struct kernfs_node *kn = dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	int error;
 
 	if (!kn)
@@ -154,7 +154,7 @@ static int kernfs_node_setsecdata(struct kernfs_iattrs *attrs, void **secdata,
 
 ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size)
 {
-	struct kernfs_node *kn = dentry->d_fsdata;
+	struct kernfs_node *kn = kernfs_dentry_node(dentry);
 	struct kernfs_iattrs *attrs;
 
 	attrs = kernfs_iattrs(kn);
@@ -203,8 +203,8 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
 		       u32 request_mask, unsigned int query_flags)
 {
-	struct kernfs_node *kn = path->dentry->d_fsdata;
 	struct inode *inode = d_inode(path->dentry);
+	struct kernfs_node *kn = inode->i_private;
 
 	mutex_lock(&kernfs_mutex);
 	kernfs_refresh_inode(kn, inode);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index e9c226f..0f260dc 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -70,6 +70,13 @@ struct kernfs_super_info {
 };
 #define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
 
+static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
+{
+	if (d_really_is_negative(dentry))
+		return NULL;
+	return d_inode(dentry)->i_private;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache;
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 343dfeb..462a40c 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -33,7 +33,7 @@ static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data)
 
 static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 {
-	struct kernfs_root *root = kernfs_root(dentry->d_fsdata);
+	struct kernfs_root *root = kernfs_root(kernfs_dentry_node(dentry));
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
 
 	if (scops && scops->show_options)
@@ -43,7 +43,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 
 static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
 {
-	struct kernfs_node *node = dentry->d_fsdata;
+	struct kernfs_node *node = kernfs_dentry_node(dentry);
 	struct kernfs_root *root = kernfs_root(node);
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
 
@@ -176,8 +176,6 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 		pr_debug("%s: could not get root dentry!\n", __func__);
 		return -ENOMEM;
 	}
-	kernfs_get(info->root->kn);
-	root->d_fsdata = info->root->kn;
 	sb->s_root = root;
 	sb->s_d_op = &kernfs_dops;
 	return 0;
@@ -283,7 +281,6 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
 void kernfs_kill_sb(struct super_block *sb)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
-	struct kernfs_node *root_kn = sb->s_root->d_fsdata;
 
 	mutex_lock(&kernfs_mutex);
 	list_del(&info->node);
@@ -295,7 +292,6 @@ void kernfs_kill_sb(struct super_block *sb)
 	 */
 	kill_anon_super(sb);
 	kfree(info);
-	kernfs_put(root_kn);
 }
 
 /**
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 1684af4..08ccabd 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -98,9 +98,9 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
 	return 0;
 }
 
-static int kernfs_getlink(struct dentry *dentry, char *path)
+static int kernfs_getlink(struct inode *inode, char *path)
 {
-	struct kernfs_node *kn = dentry->d_fsdata;
+	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_node *parent = kn->parent;
 	struct kernfs_node *target = kn->symlink.target_kn;
 	int error;
@@ -124,7 +124,7 @@ static const char *kernfs_iop_get_link(struct dentry *dentry,
 	body = kzalloc(PAGE_SIZE, GFP_KERNEL);
 	if (!body)
 		return ERR_PTR(-ENOMEM);
-	error = kernfs_getlink(dentry, body);
+	error = kernfs_getlink(inode, body);
 	if (unlikely(error < 0)) {
 		kfree(body);
 		return ERR_PTR(error);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 05/12] kernfs: introduce kernfs_node_id
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (3 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 04/12] kernfs: don't set dentry->d_fsdata Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-15 20:14   ` kbuild test robot
  2017-06-14 16:12 ` [PATCH V2 06/12] kernfs: add exportfs operations Shaohua Li
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Please note, I extend inode number to 64 bits. idr actually doesn't
support 64-bit yet, so currently inode number is always 32-bit. Tejun
has concerns 32-bit inode and generation could wrap. We will need to fix
this later. The data structure will be exported out to userspace, so the
goal of the 64-bit data structure is to avoid change the data structure
if we make inode number to 64-bit later.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/dir.c        | 10 +++++-----
 fs/kernfs/file.c       |  4 ++--
 fs/kernfs/inode.c      |  4 ++--
 include/linux/cgroup.h |  2 +-
 include/linux/kernfs.h |  9 +++++++--
 5 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index e54fdd9..7483bea 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -539,7 +539,7 @@ void kernfs_put(struct kernfs_node *kn)
 	}
 	kfree(kn->iattr);
 	spin_lock(&kernfs_idr_lock);
-	idr_remove(&root->ino_idr, kn->ino);
+	idr_remove(&root->ino_idr, kn->id.ino);
 	spin_unlock(&kernfs_idr_lock);
 	kmem_cache_free(kernfs_node_cache, kn);
 
@@ -639,8 +639,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	idr_preload_end();
 	if (ret < 0)
 		goto err_out2;
-	kn->ino = ret;
-	kn->generation = atomic_inc_return(&root->next_generation);
+	kn->id.ino = ret;
+	kn->id.generation = atomic_inc_return(&root->next_generation);
 
 	/* set ino first. Above atomic_inc_return has a barrier */
 	atomic_set(&kn->count, 1);
@@ -711,7 +711,7 @@ struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
 	 * 'ino' should be uptodate, hence we can use 'ino' to filter stale
 	 * node.
 	 */
-	if (kn->ino != ino)
+	if (kn->id.ino != ino)
 		goto out;
 	rcu_read_unlock();
 
@@ -1644,7 +1644,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		const char *name = pos->name;
 		unsigned int type = dt_type(pos);
 		int len = strlen(name);
-		ino_t ino = pos->ino;
+		ino_t ino = pos->id.ino;
 
 		ctx->pos = pos->hash;
 		file->private_data = pos;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 7f90d4d..7441925 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -895,7 +895,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 		 * have the matching @file available.  Look up the inodes
 		 * and generate the events manually.
 		 */
-		inode = ilookup(info->sb, kn->ino);
+		inode = ilookup(info->sb, kn->id.ino);
 		if (!inode)
 			continue;
 
@@ -903,7 +903,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 		if (parent) {
 			struct inode *p_inode;
 
-			p_inode = ilookup(info->sb, parent->ino);
+			p_inode = ilookup(info->sb, parent->id.ino);
 			if (p_inode) {
 				fsnotify(p_inode, FS_MODIFY | FS_EVENT_ON_CHILD,
 					 inode, FSNOTIFY_EVENT_INODE, kn->name, 0);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 4c8b510..a343039 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,7 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
 	inode->i_private = kn;
 	inode->i_mapping->a_ops = &kernfs_aops;
 	inode->i_op = &kernfs_iops;
-	inode->i_generation = kn->generation;
+	inode->i_generation = kn->id.generation;
 
 	set_default_inode_attr(inode, kn->mode);
 	kernfs_refresh_inode(kn, inode);
@@ -266,7 +266,7 @@ struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn)
 {
 	struct inode *inode;
 
-	inode = iget_locked(sb, kn->ino);
+	inode = iget_locked(sb, kn->id.ino);
 	if (inode && (inode->i_state & I_NEW))
 		kernfs_init_inode(kn, inode);
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 710a005..30c6877 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -543,7 +543,7 @@ static inline bool cgroup_is_populated(struct cgroup *cgrp)
 /* returns ino associated with a cgroup */
 static inline ino_t cgroup_ino(struct cgroup *cgrp)
 {
-	return cgrp->kn->ino;
+	return cgrp->kn->id.ino;
 }
 
 /* cft/css accessors for cftype->write() operation */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 61668d1..6be2d57 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -95,6 +95,12 @@ struct kernfs_elem_attr {
 	struct kernfs_node	*notify_next;	/* for kernfs_notify() */
 };
 
+/* represent a kernfs node */
+struct kernfs_node_id {
+	u64			ino;
+	u32			generation;
+} __attribute__((packed));
+
 /*
  * kernfs_node - the building block of kernfs hierarchy.  Each and every
  * kernfs node is represented by single kernfs_node.  Most fields are
@@ -131,11 +137,10 @@ struct kernfs_node {
 
 	void			*priv;
 
+	struct kernfs_node_id	id;
 	unsigned short		flags;
 	umode_t			mode;
-	unsigned int		ino;
 	struct kernfs_iattrs	*iattr;
-	u32			generation;
 };
 
 /*
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 06/12] kernfs: add exportfs operations
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (4 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 05/12] kernfs: introduce kernfs_node_id Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 07/12] cgroup: export fhandle info for a cgroup Shaohua Li
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |   2 +-
 fs/kernfs/mount.c                        | 113 ++++++++++++++++++++++++++++++-
 fs/sysfs/mount.c                         |   2 +-
 include/linux/exportfs.h                 |  11 +++
 include/linux/kernfs.h                   |  23 +++++--
 kernel/cgroup/cgroup.c                   |   3 +-
 6 files changed, 144 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index f5af0cc..fee2126 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -854,7 +854,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 	}
 
 	dentry = kernfs_mount(fs_type, flags, rdt_root,
-			      RDTGROUP_SUPER_MAGIC, NULL);
+			      RDTGROUP_SUPER_MAGIC, NULL, false);
 	if (IS_ERR(dentry))
 		goto out_cdp;
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 462a40c..5af73a8 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/namei.h>
 #include <linux/seq_file.h>
+#include <linux/exportfs.h>
 
 #include "kernfs-internal.h"
 
@@ -64,6 +65,107 @@ const struct super_operations kernfs_sops = {
 	.show_path	= kernfs_sop_show_path,
 };
 
+static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
+                        struct inode *parent)
+{
+	struct kernfs_fid *fid = (struct kernfs_fid *)fh;
+
+	if (parent && (*max_len) < KERNFS_FID_WITH_PARENT_LEN) {
+		*max_len = KERNFS_FID_WITH_PARENT_LEN;
+		return FILEID_INVALID;
+	} else if ((*max_len) < KERNFS_FID_WITHOUT_PARENT_LEN) {
+		*max_len = KERNFS_FID_WITHOUT_PARENT_LEN;
+		return FILEID_INVALID;
+	}
+
+	fid->ino = inode->i_ino;
+	fid->gen = inode->i_generation;
+	if (parent) {
+		fid->parent_ino = parent->i_ino;
+		fid->parent_gen = parent->i_generation;
+		*max_len = KERNFS_FID_WITH_PARENT_LEN;
+		return FILEID_KERNFS_WITH_PARENT;
+	} else {
+		*max_len = KERNFS_FID_WITHOUT_PARENT_LEN;
+		return FILEID_KERNFS_WITHOUT_PARENT;
+	}
+}
+
+static struct inode *kernfs_fh_get_inode(struct super_block *sb,
+		u64 ino, u32 generation)
+{
+	struct kernfs_super_info *info = kernfs_info(sb);
+	struct inode *inode;
+	struct kernfs_node *kn;
+
+	if (ino == 0)
+		return ERR_PTR(-ESTALE);
+
+	kn = kernfs_find_and_get_node_by_ino(info->root, ino);
+	if (!kn)
+		return ERR_PTR(-ESTALE);
+	inode = kernfs_get_inode(sb, kn);
+	kernfs_put(kn);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	if (inode->i_generation != generation) {
+		/* we didn't find the right inode.. */
+		iput(inode);
+		return ERR_PTR(-ESTALE);
+	}
+	return inode;
+}
+
+static struct dentry *kernfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+		int fh_len, int fh_type)
+{
+	struct kernfs_fid *kfid = (struct kernfs_fid *)fid;
+	struct inode *inode = NULL;
+
+	if (fh_len < KERNFS_FID_WITHOUT_PARENT_LEN)
+		return NULL;
+
+	switch (fh_type) {
+	case FILEID_KERNFS_WITHOUT_PARENT:
+	case FILEID_KERNFS_WITH_PARENT:
+		inode = kernfs_fh_get_inode(sb, kfid->ino, kfid->gen);
+		break;
+	}
+
+	return d_obtain_alias(inode);
+}
+
+static struct dentry *kernfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+		int fh_len, int fh_type)
+{
+	struct kernfs_fid *kfid = (struct kernfs_fid *)fid;
+	struct inode *inode = NULL;
+
+	if (fh_len < KERNFS_FID_WITH_PARENT_LEN)
+		return NULL;
+
+	if (fh_type == FILEID_KERNFS_WITH_PARENT)
+		inode = kernfs_fh_get_inode(sb, kfid->parent_ino,
+					    kfid->parent_gen);
+
+	return d_obtain_alias(inode);
+}
+
+static struct dentry *kernfs_get_parent_dentry(struct dentry *child)
+{
+	struct kernfs_node *kn = kernfs_dentry_node(child);
+
+	return d_obtain_alias(kernfs_get_inode(child->d_sb, kn->parent));
+}
+
+static const struct export_operations kernfs_export_ops = {
+	.encode_fh	= kernfs_encode_fh,
+	.fh_to_dentry	= kernfs_fh_to_dentry,
+	.fh_to_parent	= kernfs_fh_to_parent,
+	.get_parent	= kernfs_get_parent_dentry,
+};
+
 /**
  * kernfs_root_from_sb - determine kernfs_root associated with a super_block
  * @sb: the super_block in question
@@ -145,7 +247,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
 	} while (true);
 }
 
-static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
+static int kernfs_fill_super(struct super_block *sb, unsigned long magic,
+			     bool enable_expop)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
 	struct inode *inode;
@@ -159,6 +262,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 	sb->s_magic = magic;
 	sb->s_op = &kernfs_sops;
 	sb->s_xattr = kernfs_xattr_handlers;
+	if (enable_expop)
+		sb->s_export_op = &kernfs_export_ops;
 	sb->s_time_gran = 1;
 
 	/* get root inode, initialize and unlock it */
@@ -219,6 +324,7 @@ const void *kernfs_super_ns(struct super_block *sb)
  * @magic: file system specific magic number
  * @new_sb_created: tell the caller if we allocated a new superblock
  * @ns: optional namespace tag of the mount
+ * @enable_expop: if adding fhandle support
  *
  * This is to be called from each kernfs user's file_system_type->mount()
  * implementation, which should pass through the specified @fs_type and
@@ -229,7 +335,8 @@ const void *kernfs_super_ns(struct super_block *sb)
  */
 struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
 				struct kernfs_root *root, unsigned long magic,
-				bool *new_sb_created, const void *ns)
+				bool *new_sb_created, const void *ns,
+				bool enable_expop)
 {
 	struct super_block *sb;
 	struct kernfs_super_info *info;
@@ -255,7 +362,7 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
 	if (!sb->s_root) {
 		struct kernfs_super_info *info = kernfs_info(sb);
 
-		error = kernfs_fill_super(sb, magic);
+		error = kernfs_fill_super(sb, magic, enable_expop);
 		if (error) {
 			deactivate_locked_super(sb);
 			return ERR_PTR(error);
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 20b8f82..d1a3336b 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -37,7 +37,7 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 
 	ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
 	root = kernfs_mount_ns(fs_type, flags, sysfs_root,
-				SYSFS_MAGIC, &new_sb, ns);
+				SYSFS_MAGIC, &new_sb, ns, false);
 	if (IS_ERR(root) || !new_sb)
 		kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
 	else if (new_sb)
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 5ab958c..e9abf75 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -104,6 +104,17 @@ enum fid_type {
 	FILEID_LUSTRE = 0x97,
 
 	/*
+	 * 64 bit inode number, 32 bit generation number
+	 */
+	FILEID_KERNFS_WITHOUT_PARENT = 0x91,
+
+	/*
+	 * 64 bit inode number, 32 bit generation number
+	 * 32 bit parent generation bumber, 64 bit parent inode number
+	 */
+	FILEID_KERNFS_WITH_PARENT = 0x92,
+
+	/*
 	 * Filesystems must not use 0xff file ID.
 	 */
 	FILEID_INVALID = 0xff,
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 6be2d57..3b38bdf 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -101,6 +101,20 @@ struct kernfs_node_id {
 	u32			generation;
 } __attribute__((packed));
 
+struct kernfs_fid {
+	/*
+	 * the first two fields should have the identical layout as
+	 * kernfs_node_id
+	 */
+	u64			ino;
+	u32			gen;
+	u32			parent_gen;
+	u64			parent_ino;
+} __attribute__((packed));
+#define KERNFS_FID_WITHOUT_PARENT_LEN (offsetof(struct kernfs_fid, \
+			parent_gen) / 4)
+#define KERNFS_FID_WITH_PARENT_LEN (sizeof(struct kernfs_fid) / 4)
+
 /*
  * kernfs_node - the building block of kernfs hierarchy.  Each and every
  * kernfs node is represented by single kernfs_node.  Most fields are
@@ -337,7 +351,8 @@ void kernfs_notify(struct kernfs_node *kn);
 const void *kernfs_super_ns(struct super_block *sb);
 struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
 			       struct kernfs_root *root, unsigned long magic,
-			       bool *new_sb_created, const void *ns);
+			       bool *new_sb_created, const void *ns,
+			       bool enable_expop);
 void kernfs_kill_sb(struct super_block *sb);
 struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
 
@@ -440,7 +455,7 @@ static inline const void *kernfs_super_ns(struct super_block *sb)
 static inline struct dentry *
 kernfs_mount_ns(struct file_system_type *fs_type, int flags,
 		struct kernfs_root *root, unsigned long magic,
-		bool *new_sb_created, const void *ns)
+		bool *new_sb_created, const void *ns, bool enable_expop)
 { return ERR_PTR(-ENOSYS); }
 
 static inline void kernfs_kill_sb(struct super_block *sb) { }
@@ -521,10 +536,10 @@ static inline int kernfs_rename(struct kernfs_node *kn,
 static inline struct dentry *
 kernfs_mount(struct file_system_type *fs_type, int flags,
 		struct kernfs_root *root, unsigned long magic,
-		bool *new_sb_created)
+		bool *new_sb_created, bool enable_expop)
 {
 	return kernfs_mount_ns(fs_type, flags, root,
-				magic, new_sb_created, NULL);
+				magic, new_sb_created, NULL, enable_expop);
 }
 
 #endif	/* __LINUX_KERNFS_H */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8d4e85e..639e27d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1749,7 +1749,8 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
 	struct dentry *dentry;
 	bool new_sb;
 
-	dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
+	dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb,
+			      true);
 
 	/*
 	 * In non-init cgroup namespace, instead of root cgroup's dentry,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 07/12] cgroup: export fhandle info for a cgroup
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (5 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 06/12] kernfs: add exportfs operations Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 08/12] blktrace: export cgroup info in trace Shaohua Li
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Add an API to export cgroup fhandle info. We don't export a full 'struct
file_handle', there are unrequired info. Sepcifically, cgroup is always
a directory, so we don't need a 'FILEID_KERNFS_WITH_PARENT' type
fhandle, we only need export the inode number and generation number.
Since the first part of 'kernfs_fid' and 'kernfs_node_id' have the same
layout, we can just export a 'kernfs_node_id'.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/cgroup.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 30c6877..5ebe89f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -609,6 +609,10 @@ static inline void cgroup_kthread_ready(void)
 	current->no_cgroup_migration = 0;
 }
 
+static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
+{
+	return &cgrp->kn->id;
+}
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -631,6 +635,10 @@ static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_init_kthreadd(void) {}
 static inline void cgroup_kthread_ready(void) {}
+static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
+{
+	return NULL;
+}
 
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 					       struct cgroup *ancestor)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 08/12] blktrace: export cgroup info in trace
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (6 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 07/12] cgroup: export fhandle info for a cgroup Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 09/12] block: always attach cgroup info into bio Shaohua Li
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Currently blktrace isn't cgroup aware. blktrace prints out task name of
current context, but the task of current context isn't always in the
cgroup where the BIO comes from. We can't use task name to find out IO
cgroup. For example, Writeback BIOs always comes from flusher thread but
the BIOs are for different blk cgroups. Request could be requeued and
dispatched from completely different tasks. MD/DM are another examples.

This patch tries to fix the gap. We print out cgroup fhandle info in
blktrace. Userspace can use open_by_handle_at() syscall to find the
cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to
find fhandle for a cgroup and use a BPF program to filter out blktrace
for a specific cgroup.

We add a new 'blk_cgroup' trace option for blk tracer. It's default off.
Application which doesn't know the new option isn't affected.  When it's
on, we output fhandle info right after blk_io_trace with an extra bit
set in event action. So from application point of view, blktrace with
the option will output new actions.

I didn't change blk trace event yet, since I'm not sure if changing the
trace event output is an ABI issue. If not, I'll do it later.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/uapi/linux/blktrace_api.h |   3 +
 kernel/trace/blktrace.c           | 231 ++++++++++++++++++++++++++------------
 2 files changed, 161 insertions(+), 73 deletions(-)

diff --git a/include/uapi/linux/blktrace_api.h b/include/uapi/linux/blktrace_api.h
index c590ca6..9cdaede 100644
--- a/include/uapi/linux/blktrace_api.h
+++ b/include/uapi/linux/blktrace_api.h
@@ -52,6 +52,7 @@ enum blktrace_act {
 	__BLK_TA_REMAP,			/* bio was remapped */
 	__BLK_TA_ABORT,			/* request aborted */
 	__BLK_TA_DRV_DATA,		/* driver-specific binary data */
+	__BLK_TA_CGROUP = 1 << 8,	/* from a cgroup*/
 };
 
 /*
@@ -61,6 +62,7 @@ enum blktrace_notify {
 	__BLK_TN_PROCESS = 0,		/* establish pid/name mapping */
 	__BLK_TN_TIMESTAMP,		/* include system clock */
 	__BLK_TN_MESSAGE,		/* Character string message */
+	__BLK_TN_CGROUP = __BLK_TA_CGROUP, /* from a cgroup */
 };
 
 
@@ -107,6 +109,7 @@ struct blk_io_trace {
 	__u32 cpu;		/* on what cpu did it happen */
 	__u16 error;		/* completion error */
 	__u16 pdu_len;		/* length of data after this trace */
+	/* cgroup id will be stored here if exists */
 };
 
 /*
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 193c5f5..02cfcf2 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -27,6 +27,7 @@
 #include <linux/time.h>
 #include <linux/uaccess.h>
 #include <linux/list.h>
+#include <linux/blk-cgroup.h>
 
 #include "../../block/blk.h"
 
@@ -46,10 +47,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
 
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC	0x1
+#define TRACE_BLK_OPT_CGROUP	0x2
 
 static struct tracer_opt blk_tracer_opts[] = {
 	/* Default disable the minimalistic output */
 	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
+#ifdef CONFIG_BLK_CGROUP
+	{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+#endif
 	{ }
 };
 
@@ -68,7 +73,8 @@ static void blk_unregister_tracepoints(void);
  * Send out a notify message.
  */
 static void trace_note(struct blk_trace *bt, pid_t pid, int action,
-		       const void *data, size_t len)
+		       const void *data, size_t len,
+		       struct kernfs_node_id *cgid)
 {
 	struct blk_io_trace *t;
 	struct ring_buffer_event *event = NULL;
@@ -76,12 +82,13 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 	int pc = 0;
 	int cpu = smp_processor_id();
 	bool blk_tracer = blk_tracer_enabled;
+	ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;
 
 	if (blk_tracer) {
 		buffer = blk_tr->trace_buffer.buffer;
 		pc = preempt_count();
 		event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
-						  sizeof(*t) + len,
+						  sizeof(*t) + len + cgid_len,
 						  0, pc);
 		if (!event)
 			return;
@@ -92,17 +99,19 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 	if (!bt->rchan)
 		return;
 
-	t = relay_reserve(bt->rchan, sizeof(*t) + len);
+	t = relay_reserve(bt->rchan, sizeof(*t) + len + cgid_len);
 	if (t) {
 		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
 		t->time = ktime_to_ns(ktime_get());
 record_it:
 		t->device = bt->dev;
-		t->action = action;
+		t->action = action | (cgid ? __BLK_TN_CGROUP : 0);
 		t->pid = pid;
 		t->cpu = cpu;
-		t->pdu_len = len;
-		memcpy((void *) t + sizeof(*t), data, len);
+		t->pdu_len = len + cgid_len;
+		if (cgid)
+			memcpy((void *)t + sizeof(*t), cgid, cgid_len);
+		memcpy((void *) t + sizeof(*t) + cgid_len, data, len);
 
 		if (blk_tracer)
 			trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -122,7 +131,7 @@ static void trace_note_tsk(struct task_struct *tsk)
 	spin_lock_irqsave(&running_trace_lock, flags);
 	list_for_each_entry(bt, &running_trace_list, running_list) {
 		trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm,
-			   sizeof(tsk->comm));
+			   sizeof(tsk->comm), NULL);
 	}
 	spin_unlock_irqrestore(&running_trace_lock, flags);
 }
@@ -139,7 +148,7 @@ static void trace_note_time(struct blk_trace *bt)
 	words[1] = now.tv_nsec;
 
 	local_irq_save(flags);
-	trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words));
+	trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words), NULL);
 	local_irq_restore(flags);
 }
 
@@ -167,7 +176,7 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
 	n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
 	va_end(args);
 
-	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n);
+	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -204,7 +213,7 @@ static const u32 ddir_act[2] = { BLK_TC_ACT(BLK_TC_READ),
  */
 static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		     int op, int op_flags, u32 what, int error, int pdu_len,
-		     void *pdu_data)
+		     void *pdu_data, struct kernfs_node_id *cgid)
 {
 	struct task_struct *tsk = current;
 	struct ring_buffer_event *event = NULL;
@@ -215,6 +224,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	pid_t pid;
 	int cpu, pc = 0;
 	bool blk_tracer = blk_tracer_enabled;
+	ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;
 
 	if (unlikely(bt->trace_state != Blktrace_running && !blk_tracer))
 		return;
@@ -229,6 +239,8 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		what |= BLK_TC_ACT(BLK_TC_DISCARD);
 	if (op == REQ_OP_FLUSH)
 		what |= BLK_TC_ACT(BLK_TC_FLUSH);
+	if (cgid)
+		what |= __BLK_TA_CGROUP;
 
 	pid = tsk->pid;
 	if (act_log_check(bt, what, sector, pid))
@@ -241,7 +253,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		buffer = blk_tr->trace_buffer.buffer;
 		pc = preempt_count();
 		event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
-						  sizeof(*t) + pdu_len,
+						  sizeof(*t) + pdu_len + cgid_len,
 						  0, pc);
 		if (!event)
 			return;
@@ -258,7 +270,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	 * from coming in and stepping on our toes.
 	 */
 	local_irq_save(flags);
-	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
+	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len + cgid_len);
 	if (t) {
 		sequence = per_cpu_ptr(bt->sequence, cpu);
 
@@ -280,10 +292,12 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		t->action = what;
 		t->device = bt->dev;
 		t->error = error;
-		t->pdu_len = pdu_len;
+		t->pdu_len = pdu_len + cgid_len;
 
+		if (cgid_len)
+			memcpy((void *)t + sizeof(*t), cgid, cgid_len);
 		if (pdu_len)
-			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
+			memcpy((void *)t + sizeof(*t) + cgid_len, pdu_data, pdu_len);
 
 		if (blk_tracer) {
 			trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -684,6 +698,36 @@ void blk_trace_shutdown(struct request_queue *q)
 	}
 }
 
+#ifdef CONFIG_BLK_CGROUP
+static struct kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+	struct blk_trace *bt = q->blk_trace;
+
+	if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+		return NULL;
+
+	if (!bio->bi_css)
+		return NULL;
+	return cgroup_get_node_id(bio->bi_css->cgroup);
+}
+#else
+static struct kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+	return NULL;
+}
+#endif
+
+static struct kernfs_node_id *
+blk_trace_request_get_cgid(struct request_queue *q, struct request *rq)
+{
+	if (!rq->bio)
+		return NULL;
+	/* Use the first bio */
+	return blk_trace_bio_get_cgid(q, rq->bio);
+}
+
 /*
  * blktrace probes
  */
@@ -694,13 +738,15 @@ void blk_trace_shutdown(struct request_queue *q)
  * @error:	return status to log
  * @nr_bytes:	number of completed bytes
  * @what:	the action
+ * @cgid:	the cgroup info
  *
  * Description:
  *     Records an action against a request. Will log the bio offset + size.
  *
  **/
 static void blk_add_trace_rq(struct request *rq, int error,
-			     unsigned int nr_bytes, u32 what)
+			     unsigned int nr_bytes, u32 what,
+			     struct kernfs_node_id *cgid)
 {
 	struct blk_trace *bt = rq->q->blk_trace;
 
@@ -713,32 +759,36 @@ static void blk_add_trace_rq(struct request *rq, int error,
 		what |= BLK_TC_ACT(BLK_TC_FS);
 
 	__blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, req_op(rq),
-			rq->cmd_flags, what, error, 0, NULL);
+			rq->cmd_flags, what, error, 0, NULL, cgid);
 }
 
 static void blk_add_trace_rq_insert(void *ignore,
 				    struct request_queue *q, struct request *rq)
 {
-	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT);
+	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT,
+			 blk_trace_request_get_cgid(q, rq));
 }
 
 static void blk_add_trace_rq_issue(void *ignore,
 				   struct request_queue *q, struct request *rq)
 {
-	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE);
+	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE,
+			 blk_trace_request_get_cgid(q, rq));
 }
 
 static void blk_add_trace_rq_requeue(void *ignore,
 				     struct request_queue *q,
 				     struct request *rq)
 {
-	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE);
+	blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE,
+			 blk_trace_request_get_cgid(q, rq));
 }
 
 static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
 			int error, unsigned int nr_bytes)
 {
-	blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE);
+	blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE,
+			 blk_trace_request_get_cgid(rq->q, rq));
 }
 
 /**
@@ -753,7 +803,7 @@ static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
  *
  **/
 static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
-			      u32 what, int error)
+			      u32 what, int error, struct kernfs_node_id *cgid)
 {
 	struct blk_trace *bt = q->blk_trace;
 
@@ -761,20 +811,22 @@ static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
 		return;
 
 	__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
-			bio_op(bio), bio->bi_opf, what, error, 0, NULL);
+			bio_op(bio), bio->bi_opf, what, error, 0, NULL, cgid);
 }
 
 static void blk_add_trace_bio_bounce(void *ignore,
 				     struct request_queue *q, struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_complete(void *ignore,
 				       struct request_queue *q, struct bio *bio,
 				       int error)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error);
+	blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_backmerge(void *ignore,
@@ -782,7 +834,8 @@ static void blk_add_trace_bio_backmerge(void *ignore,
 					struct request *rq,
 					struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0,
+			 blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_frontmerge(void *ignore,
@@ -790,13 +843,15 @@ static void blk_add_trace_bio_frontmerge(void *ignore,
 					 struct request *rq,
 					 struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_bio_queue(void *ignore,
 				    struct request_queue *q, struct bio *bio)
 {
-	blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0);
+	blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0,
+			  blk_trace_bio_get_cgid(q, bio));
 }
 
 static void blk_add_trace_getrq(void *ignore,
@@ -804,13 +859,14 @@ static void blk_add_trace_getrq(void *ignore,
 				struct bio *bio, int rw)
 {
 	if (bio)
-		blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0);
+		blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0,
+				  blk_trace_bio_get_cgid(q, bio));
 	else {
 		struct blk_trace *bt = q->blk_trace;
 
 		if (bt)
 			__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_GETRQ, 0, 0,
-					NULL);
+					NULL, NULL);
 	}
 }
 
@@ -820,13 +876,14 @@ static void blk_add_trace_sleeprq(void *ignore,
 				  struct bio *bio, int rw)
 {
 	if (bio)
-		blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0);
+		blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0,
+				  blk_trace_bio_get_cgid(q, bio));
 	else {
 		struct blk_trace *bt = q->blk_trace;
 
 		if (bt)
 			__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_SLEEPRQ,
-					0, 0, NULL);
+					0, 0, NULL, NULL);
 	}
 }
 
@@ -835,7 +892,7 @@ static void blk_add_trace_plug(void *ignore, struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt)
-		__blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL);
+		__blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, NULL);
 }
 
 static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
@@ -852,7 +909,7 @@ static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
 		else
 			what = BLK_TA_UNPLUG_TIMER;
 
-		__blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+		__blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, NULL);
 	}
 }
 
@@ -868,7 +925,7 @@ static void blk_add_trace_split(void *ignore,
 		__blk_add_trace(bt, bio->bi_iter.bi_sector,
 				bio->bi_iter.bi_size, bio_op(bio), bio->bi_opf,
 				BLK_TA_SPLIT, bio->bi_error, sizeof(rpdu),
-				&rpdu);
+				&rpdu, blk_trace_bio_get_cgid(q, bio));
 	}
 }
 
@@ -901,7 +958,7 @@ static void blk_add_trace_bio_remap(void *ignore,
 
 	__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
 			bio_op(bio), bio->bi_opf, BLK_TA_REMAP, bio->bi_error,
-			sizeof(r), &r);
+			sizeof(r), &r, blk_trace_bio_get_cgid(q, bio));
 }
 
 /**
@@ -934,7 +991,7 @@ static void blk_add_trace_rq_remap(void *ignore,
 
 	__blk_add_trace(bt, blk_rq_pos(rq), blk_rq_bytes(rq),
 			rq_data_dir(rq), 0, BLK_TA_REMAP, 0,
-			sizeof(r), &r);
+			sizeof(r), &r, blk_trace_request_get_cgid(q, rq));
 }
 
 /**
@@ -958,7 +1015,8 @@ void blk_add_driver_data(struct request_queue *q,
 		return;
 
 	__blk_add_trace(bt, blk_rq_trace_sector(rq), blk_rq_bytes(rq), 0, 0,
-				BLK_TA_DRV_DATA, 0, len, data);
+				BLK_TA_DRV_DATA, 0, len, data,
+				blk_trace_request_get_cgid(q, rq));
 }
 EXPORT_SYMBOL_GPL(blk_add_driver_data);
 
@@ -1031,7 +1089,7 @@ static void fill_rwbs(char *rwbs, const struct blk_io_trace *t)
 	int i = 0;
 	int tc = t->action >> BLK_TC_SHIFT;
 
-	if (t->action == BLK_TN_MESSAGE) {
+	if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
 		rwbs[i++] = 'N';
 		goto out;
 	}
@@ -1066,9 +1124,21 @@ const struct blk_io_trace *te_blk_io_trace(const struct trace_entry *ent)
 	return (const struct blk_io_trace *)ent;
 }
 
-static inline const void *pdu_start(const struct trace_entry *ent)
+static inline const void *pdu_start(const struct trace_entry *ent, bool has_cg)
 {
-	return te_blk_io_trace(ent) + 1;
+	return (void *)(te_blk_io_trace(ent) + 1) +
+		(has_cg ? sizeof(struct kernfs_node_id) : 0);
+}
+
+static inline const void *cgid_start(const struct trace_entry *ent)
+{
+	return (void *)(te_blk_io_trace(ent) + 1);
+}
+
+static inline int pdu_real_len(const struct trace_entry *ent, bool has_cg)
+{
+	return te_blk_io_trace(ent)->pdu_len -
+			(has_cg ? sizeof(struct kernfs_node_id) : 0);
 }
 
 static inline u32 t_action(const struct trace_entry *ent)
@@ -1096,16 +1166,16 @@ static inline __u16 t_error(const struct trace_entry *ent)
 	return te_blk_io_trace(ent)->error;
 }
 
-static __u64 get_pdu_int(const struct trace_entry *ent)
+static __u64 get_pdu_int(const struct trace_entry *ent, bool has_cg)
 {
-	const __u64 *val = pdu_start(ent);
+	const __u64 *val = pdu_start(ent, has_cg);
 	return be64_to_cpu(*val);
 }
 
 static void get_pdu_remap(const struct trace_entry *ent,
-			  struct blk_io_trace_remap *r)
+			  struct blk_io_trace_remap *r, bool has_cg)
 {
-	const struct blk_io_trace_remap *__r = pdu_start(ent);
+	const struct blk_io_trace_remap *__r = pdu_start(ent, has_cg);
 	__u64 sector_from = __r->sector_from;
 
 	r->device_from = be32_to_cpu(__r->device_from);
@@ -1113,9 +1183,11 @@ static void get_pdu_remap(const struct trace_entry *ent,
 	r->sector_from = be64_to_cpu(sector_from);
 }
 
-typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act);
+typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act,
+	bool has_cg);
 
-static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
+static void blk_log_action_classic(struct trace_iterator *iter, const char *act,
+	bool has_cg)
 {
 	char rwbs[RWBS_LEN];
 	unsigned long long ts  = iter->ts;
@@ -1131,24 +1203,33 @@ static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
 			 secs, nsec_rem, iter->ent->pid, act, rwbs);
 }
 
-static void blk_log_action(struct trace_iterator *iter, const char *act)
+static void blk_log_action(struct trace_iterator *iter, const char *act,
+	bool has_cg)
 {
 	char rwbs[RWBS_LEN];
 	const struct blk_io_trace *t = te_blk_io_trace(iter->ent);
 
 	fill_rwbs(rwbs, t);
-	trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
-			 MAJOR(t->device), MINOR(t->device), act, rwbs);
+	if (has_cg) {
+		const struct kernfs_node_id *id = cgid_start(iter->ent);
+
+		trace_seq_printf(&iter->seq, "%3d,%-3d %llx,%-x %2s %3s ",
+				 MAJOR(t->device), MINOR(t->device),
+				 id->ino, id->generation, act, rwbs);
+	} else
+		trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
+				 MAJOR(t->device), MINOR(t->device), act, rwbs);
 }
 
-static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_dump_pdu(struct trace_seq *s,
+	const struct trace_entry *ent, bool has_cg)
 {
 	const unsigned char *pdu_buf;
 	int pdu_len;
 	int i, end;
 
-	pdu_buf = pdu_start(ent);
-	pdu_len = te_blk_io_trace(ent)->pdu_len;
+	pdu_buf = pdu_start(ent, has_cg);
+	pdu_len = pdu_real_len(ent, has_cg);
 
 	if (!pdu_len)
 		return;
@@ -1179,7 +1260,7 @@ static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
 	trace_seq_puts(s, ") ");
 }
 
-static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
@@ -1187,7 +1268,7 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
 
 	if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
 		trace_seq_printf(s, "%u ", t_bytes(ent));
-		blk_log_dump_pdu(s, ent);
+		blk_log_dump_pdu(s, ent, has_cg);
 		trace_seq_printf(s, "[%s]\n", cmd);
 	} else {
 		if (t_sec(ent))
@@ -1199,10 +1280,10 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
 }
 
 static void blk_log_with_error(struct trace_seq *s,
-			      const struct trace_entry *ent)
+			      const struct trace_entry *ent, bool has_cg)
 {
 	if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
-		blk_log_dump_pdu(s, ent);
+		blk_log_dump_pdu(s, ent, has_cg);
 		trace_seq_printf(s, "[%d]\n", t_error(ent));
 	} else {
 		if (t_sec(ent))
@@ -1215,18 +1296,18 @@ static void blk_log_with_error(struct trace_seq *s,
 	}
 }
 
-static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	struct blk_io_trace_remap r = { .device_from = 0, };
 
-	get_pdu_remap(ent, &r);
+	get_pdu_remap(ent, &r, has_cg);
 	trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n",
 			 t_sector(ent), t_sec(ent),
 			 MAJOR(r.device_from), MINOR(r.device_from),
 			 (unsigned long long)r.sector_from);
 }
 
-static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
@@ -1235,30 +1316,31 @@ static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
 	trace_seq_printf(s, "[%s]\n", cmd);
 }
 
-static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
 	trace_find_cmdline(ent->pid, cmd);
 
-	trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent));
+	trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg));
 }
 
-static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
 {
 	char cmd[TASK_COMM_LEN];
 
 	trace_find_cmdline(ent->pid, cmd);
 
 	trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
-			 get_pdu_int(ent), cmd);
+			 get_pdu_int(ent, has_cg), cmd);
 }
 
-static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent,
+			bool has_cg)
 {
-	const struct blk_io_trace *t = te_blk_io_trace(ent);
 
-	trace_seq_putmem(s, t + 1, t->pdu_len);
+	trace_seq_putmem(s, pdu_start(ent, has_cg),
+		pdu_real_len(ent, has_cg));
 	trace_seq_putc(s, '\n');
 }
 
@@ -1298,7 +1380,8 @@ static void blk_tracer_reset(struct trace_array *tr)
 
 static const struct {
 	const char *act[2];
-	void	   (*print)(struct trace_seq *s, const struct trace_entry *ent);
+	void	   (*print)(struct trace_seq *s, const struct trace_entry *ent,
+			    bool has_cg);
 } what2act[] = {
 	[__BLK_TA_QUEUE]	= {{  "Q", "queue" },	   blk_log_generic },
 	[__BLK_TA_BACKMERGE]	= {{  "M", "backmerge" },  blk_log_generic },
@@ -1326,23 +1409,25 @@ static enum print_line_t print_one_line(struct trace_iterator *iter,
 	u16 what;
 	bool long_act;
 	blk_log_action_t *log_action;
+	bool has_cg;
 
 	t	   = te_blk_io_trace(iter->ent);
-	what	   = t->action & ((1 << BLK_TC_SHIFT) - 1);
+	what	   = (t->action & ((1 << BLK_TC_SHIFT) - 1)) & ~__BLK_TA_CGROUP;
 	long_act   = !!(tr->trace_flags & TRACE_ITER_VERBOSE);
 	log_action = classic ? &blk_log_action_classic : &blk_log_action;
+	has_cg	   = t->action & __BLK_TA_CGROUP;
 
-	if (t->action == BLK_TN_MESSAGE) {
-		log_action(iter, long_act ? "message" : "m");
-		blk_log_msg(s, iter->ent);
+	if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
+		log_action(iter, long_act ? "message" : "m", has_cg);
+		blk_log_msg(s, iter->ent, has_cg);
 		return trace_handle_return(s);
 	}
 
 	if (unlikely(what == 0 || what >= ARRAY_SIZE(what2act)))
 		trace_seq_printf(s, "Unknown action %x\n", what);
 	else {
-		log_action(iter, what2act[what].act[long_act]);
-		what2act[what].print(s, iter->ent);
+		log_action(iter, what2act[what].act[long_act], has_cg);
+		what2act[what].print(s, iter->ent, has_cg);
 	}
 
 	return trace_handle_return(s);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 09/12] block: always attach cgroup info into bio
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (7 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 08/12] blktrace: export cgroup info in trace Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 10/12] block: call __bio_free in bio_endio Shaohua Li
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

blkcg_bio_issue_check() already gets blkcg for a BIO.
bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
operation. There is no point we don't attach the cgroup info into bio at
blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
info.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c       | 7 +------
 include/linux/blk-cgroup.h | 3 +++
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7285bf..a6ebd2b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2104,14 +2104,9 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td)
 static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
 {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
-	int ret;
-
-	ret = bio_associate_current(bio);
-	if (ret == 0 || ret == -EBUSY)
+	if (bio->bi_css)
 		bio->bi_cg_private = tg;
 	blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
-#else
-	bio_associate_current(bio);
 #endif
 }
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7..fe10b85 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -691,6 +691,9 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
 	rcu_read_lock();
 	blkcg = bio_blkcg(bio);
 
+	/* associate blkcg if bio hasn't attached one */
+	bio_associate_blkcg(bio, &blkcg->css);
+
 	blkg = blkg_lookup(blkcg, q);
 	if (unlikely(!blkg)) {
 		spin_lock_irq(q->queue_lock);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 10/12] block: call __bio_free in bio_endio
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (8 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 09/12] block: always attach cgroup info into bio Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 11/12] blktrace: add an option to allow displying cgroup path Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 12/12] block: use standard blktrace API to output cgroup info for debug notes Shaohua Li
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

bio_free isn't a good place to free cgroup/integrity info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
__bio_free call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup/integrity info, which
seems true in my audit. Otherwise, we probably must add a flag to
distinguish if bio will be called by bio_put.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bio-integrity.c | 1 +
 block/bio.c           | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..869ac7a 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -120,6 +120,7 @@ void bio_integrity_free(struct bio *bio)
 	}
 
 	bio->bi_integrity = NULL;
+	bio->bi_opf &= ~REQ_INTEGRITY;
 }
 EXPORT_SYMBOL(bio_integrity_free);
 
diff --git a/block/bio.c b/block/bio.c
index 888e780..9bfd8d4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1823,6 +1823,8 @@ void bio_endio(struct bio *bio)
 	}
 
 	blk_throtl_bio_endio(bio);
+	/* release cgroup/integrity info */
+	__bio_free(bio);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 11/12] blktrace: add an option to allow displying cgroup path
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (9 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 10/12] block: call __bio_free in bio_endio Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-14 16:12 ` [PATCH V2 12/12] block: use standard blktrace API to output cgroup info for debug notes Shaohua Li
  11 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]

Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/kernfs/mount.c       | 19 +++++++++++++++++++
 include/linux/cgroup.h  |  6 ++++++
 include/linux/kernfs.h  |  2 ++
 kernel/cgroup/cgroup.c  | 12 ++++++++++++
 kernel/trace/blktrace.c | 14 +++++++++++++-
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 5af73a8..3bbca9c 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -91,6 +91,25 @@ static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
 	}
 }
 
+/*
+ * Similar like kernfs_fh_get_inode, this one gets kernfs node from inode
+ * number and generation
+ */
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+	const struct kernfs_node_id *id)
+{
+	struct kernfs_node *kn;
+
+	kn = kernfs_find_and_get_node_by_ino(root, id->ino);
+	if (!kn)
+		return NULL;
+	if (kn->id.generation != id->generation) {
+		kernfs_put(kn);
+		return NULL;
+	}
+	return kn;
+}
+
 static struct inode *kernfs_fh_get_inode(struct super_block *sb,
 		u64 ino, u32 generation)
 {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5ebe89f..163e5c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -613,6 +613,9 @@ static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
 {
 	return &cgrp->kn->id;
 }
+
+void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+					char *buf, size_t buflen);
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -645,6 +648,9 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 {
 	return true;
 }
+
+static inline void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+	char *buf, size_t buflen) {}
 #endif /* !CONFIG_CGROUPS */
 
 /*
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3b38bdf..e8c685f 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -358,6 +358,8 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
 
 void kernfs_init(void);
 
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+	const struct kernfs_node_id *id);
 #else	/* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 639e27d..ae5ff1c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4606,6 +4606,18 @@ static int __init cgroup_wq_init(void)
 }
 core_initcall(cgroup_wq_init);
 
+void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+					char *buf, size_t buflen)
+{
+	struct kernfs_node *kn;
+
+	kn = kernfs_get_node_by_id(cgrp_dfl_root.kf_root, id);
+	if (!kn)
+		return;
+	kernfs_path(kn, buf, buflen);
+	kernfs_put(kn);
+}
+
 /*
  * proc_cgroup_show()
  *  - Print task's cgroup paths into seq_file, one line for each hierarchy
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 02cfcf2..ad5abb9 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -48,12 +48,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC	0x1
 #define TRACE_BLK_OPT_CGROUP	0x2
+#define TRACE_BLK_OPT_CGNAME	0x4
 
 static struct tracer_opt blk_tracer_opts[] = {
 	/* Default disable the minimalistic output */
 	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
 #ifdef CONFIG_BLK_CGROUP
 	{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+	{ TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) },
 #endif
 	{ }
 };
@@ -1213,7 +1215,17 @@ static void blk_log_action(struct trace_iterator *iter, const char *act,
 	if (has_cg) {
 		const struct kernfs_node_id *id = cgid_start(iter->ent);
 
-		trace_seq_printf(&iter->seq, "%3d,%-3d %llx,%-x %2s %3s ",
+		if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) {
+			char blkcg_name_buf[NAME_MAX + 1] = "<...>";
+
+			cgroup_path_from_node_id(id, blkcg_name_buf,
+				sizeof(blkcg_name_buf));
+			trace_seq_printf(&iter->seq, "%3d,%-3d %s %2s %3s ",
+				 MAJOR(t->device), MINOR(t->device),
+				 blkcg_name_buf, act, rwbs);
+		} else
+			trace_seq_printf(&iter->seq,
+				 "%3d,%-3d %llx,%-x %2s %3s ",
 				 MAJOR(t->device), MINOR(t->device),
 				 id->ino, id->generation, act, rwbs);
 	} else
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V2 12/12] block: use standard blktrace API to output cgroup info for debug notes
  2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
                   ` (10 preceding siblings ...)
  2017-06-14 16:12 ` [PATCH V2 11/12] blktrace: add an option to allow displying cgroup path Shaohua Li
@ 2017-06-14 16:12 ` Shaohua Li
  2017-06-15 13:56   ` kbuild test robot
  11 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2017-06-14 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: tj, gregkh, hch, axboe, rostedt, lizefan, Kernel-team, Shaohua Li

From: Shaohua Li <shli@fb.com>

Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.

Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bfq-iosched.h          | 13 ++++++++-----
 block/blk-throttle.c         |  6 ++----
 block/cfq-iosched.c          | 15 ++++++---------
 include/linux/blk-cgroup.h   | 14 --------------
 include/linux/blktrace_api.h | 13 +++++++++----
 kernel/trace/blktrace.c      | 12 ++++++++++--
 6 files changed, 35 insertions(+), 38 deletions(-)

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 5c3bf98..ea6a6eb 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -916,13 +916,16 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	do {			\
-	blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid,\
-			bfq_bfqq_sync((bfqq)) ? 'S' : 'A',		\
-			bfqq_group(bfqq)->blkg_path, ##args);		\
+	blk_add_cgroup_trace_msg((bfqd)->queue,				\
+			bfqg_to_blkg(bfqq_group(bfqq))->blkcg,		\
+			"bfq%d%c " fmt, (bfqq)->pid,			\
+			bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args);	\
 } while (0)
 
-#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)	\
-	blk_add_trace_msg((bfqd)->queue, "%s " fmt, (bfqg)->blkg_path, ##args)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)	do {			\
+	blk_add_cgroup_trace_msg((bfqd)->queue,				\
+		bfqg_to_blkg(bfqg)->blkcg, fmt, ##args);		\
+} while (0)
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6ebd2b..6a4c4c4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -373,10 +373,8 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	if (likely(!blk_trace_note_message_enabled(__td->queue)))	\
 		break;							\
 	if ((__tg)) {							\
-		char __pbuf[128];					\
-									\
-		blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf));	\
-		blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, ##args); \
+		blk_add_cgroup_trace_msg(__td->queue,			\
+			tg_to_blkg(__tg)->blkcg, "throtl " fmt, ##args);\
 	} else {							\
 		blk_add_trace_msg(__td->queue, "throtl " fmt, ##args);	\
 	}								\
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b7e9c7f..6df342f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -656,20 +656,17 @@ static inline void cfqg_put(struct cfq_group *cfqg)
 }
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	do {			\
-	char __pbuf[128];						\
-									\
-	blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf));	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+	blk_add_cgroup_trace_msg((cfqd)->queue,				\
+			cfqg_to_blkg((cfqq)->cfqg)->blkcg,		\
+			"cfq%d%c%c " fmt, (cfqq)->pid,			\
 			cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
 			cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
-			  __pbuf, ##args);				\
+			  ##args);					\
 } while (0)
 
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...)	do {			\
-	char __pbuf[128];						\
-									\
-	blkg_path(cfqg_to_blkg(cfqg), __pbuf, sizeof(__pbuf));		\
-	blk_add_trace_msg((cfqd)->queue, "%s " fmt, __pbuf, ##args);	\
+	blk_add_cgroup_trace_msg((cfqd)->queue,				\
+			cfqg_to_blkg(cfqg)->blkcg, fmt, ##args);	\
 } while (0)
 
 static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index fe10b85..4df5d32 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -334,19 +334,6 @@ static inline struct blkcg *cpd_to_blkcg(struct blkcg_policy_data *cpd)
 }
 
 /**
- * blkg_path - format cgroup path of blkg
- * @blkg: blkg of interest
- * @buf: target buffer
- * @buflen: target buffer length
- *
- * Format the path of the cgroup of @blkg into @buf.
- */
-static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
-{
-	return cgroup_path(blkg->blkcg->css.cgroup, buf, buflen);
-}
-
-/**
  * blkg_get - get a blkg reference
  * @blkg: blkg to get
  *
@@ -759,7 +746,6 @@ static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
 static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
 						  struct blkcg_policy *pol) { return NULL; }
 static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
-static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
 static inline void blkg_get(struct blkcg_gq *blkg) { }
 static inline void blkg_put(struct blkcg_gq *blkg) { }
 
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index d2e9085..67b4d4d 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -28,10 +28,12 @@ struct blk_trace {
 	atomic_t dropped;
 };
 
+struct blkcg;
+
 extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
 extern void blk_trace_shutdown(struct request_queue *);
-extern __printf(2, 3)
-void __trace_note_message(struct blk_trace *, const char *fmt, ...);
+extern __printf(3, 4)
+void __trace_note_message(struct blk_trace *, struct blkcg *blkcg, const char *fmt, ...);
 
 /**
  * blk_add_trace_msg - Add a (simple) message to the blktrace stream
@@ -46,12 +48,14 @@ void __trace_note_message(struct blk_trace *, const char *fmt, ...);
  *     NOTE: Can not use 'static inline' due to presence of var args...
  *
  **/
-#define blk_add_trace_msg(q, fmt, ...)					\
+#define blk_add_cgroup_trace_msg(q, cg, fmt, ...)			\
 	do {								\
 		struct blk_trace *bt = (q)->blk_trace;			\
 		if (unlikely(bt))					\
-			__trace_note_message(bt, fmt, ##__VA_ARGS__);	\
+			__trace_note_message(bt, cg, fmt, ##__VA_ARGS__);\
 	} while (0)
+#define blk_add_trace_msg(q, fmt, ...)					\
+	blk_add_cgroup_trace_msg(q, NULL, fmt, ##__VA_ARGS__)
 #define BLK_TN_MAX_MSG		128
 
 static inline bool blk_trace_note_message_enabled(struct request_queue *q)
@@ -82,6 +86,7 @@ extern struct attribute_group blk_trace_attr_group;
 # define blk_trace_startstop(q, start)			(-ENOTTY)
 # define blk_trace_remove(q)				(-ENOTTY)
 # define blk_add_trace_msg(q, fmt, ...)			do { } while (0)
+# define blk_add_cgroup_trace_msg(q, cg, fmt, ...)	do { } while (0)
 # define blk_trace_remove_sysfs(dev)			do { } while (0)
 # define blk_trace_note_message_enabled(q)		(false)
 static inline int blk_trace_init_sysfs(struct device *dev)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index ad5abb9..4e3f747 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -154,7 +154,8 @@ static void trace_note_time(struct blk_trace *bt)
 	local_irq_restore(flags);
 }
 
-void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
+void __trace_note_message(struct blk_trace *bt, struct blkcg *blkcg,
+	const char *fmt, ...)
 {
 	int n;
 	va_list args;
@@ -178,7 +179,14 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
 	n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
 	va_end(args);
 
+	if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+		blkcg = NULL;
+#ifdef CONFIG_BLK_CGROUP
+	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n,
+		blkcg ? cgroup_get_node_id(blkcg->css.cgroup) : NULL);
+#else
 	trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
+#endif
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -375,7 +383,7 @@ static ssize_t blk_msg_write(struct file *filp, const char __user *buffer,
 		return PTR_ERR(msg);
 
 	bt = filp->private_data;
-	__trace_note_message(bt, "%s", msg);
+	__trace_note_message(bt, NULL, "%s", msg);
 	kfree(msg);
 
 	return count;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH V2 12/12] block: use standard blktrace API to output cgroup info for debug notes
  2017-06-14 16:12 ` [PATCH V2 12/12] block: use standard blktrace API to output cgroup info for debug notes Shaohua Li
@ 2017-06-15 13:56   ` kbuild test robot
  0 siblings, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2017-06-15 13:56 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kbuild-all, linux-kernel, linux-block, tj, gregkh, hch, axboe,
	rostedt, lizefan, Kernel-team, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 2206 bytes --]

Hi Shaohua,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12-rc5]
[cannot apply to driver-core/driver-core-testing block/for-next next-20170615]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shaohua-Li/kernfs-implement-i_generation/20170615-211524
config: x86_64-randconfig-x014-201724 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   block/bfq-cgroup.c: In function 'bfq_bic_update_cgroup':
>> block/bfq-cgroup.c:679:2: error: implicit declaration of function 'blkg_path' [-Werror=implicit-function-declaration]
     blkg_path(bfqg_to_blkg(bfqg), bfqg->blkg_path, sizeof(bfqg->blkg_path));
     ^~~~~~~~~
   cc1: some warnings being treated as errors

vim +/blkg_path +679 block/bfq-cgroup.c

8f9bebc3 Paolo Valente 2017-06-05  673  	 * Finally, note that bfqg itself needs to be protected from
8f9bebc3 Paolo Valente 2017-06-05  674  	 * destruction on the blkg_free of the original blkg (which
8f9bebc3 Paolo Valente 2017-06-05  675  	 * invokes bfq_pd_free). We use an additional private
8f9bebc3 Paolo Valente 2017-06-05  676  	 * refcounter for bfqg, to let it disappear only after no
8f9bebc3 Paolo Valente 2017-06-05  677  	 * bfq_queue refers to it any longer.
8f9bebc3 Paolo Valente 2017-06-05  678  	 */
8f9bebc3 Paolo Valente 2017-06-05 @679  	blkg_path(bfqg_to_blkg(bfqg), bfqg->blkg_path, sizeof(bfqg->blkg_path));
ea25da48 Paolo Valente 2017-04-19  680  	bic->blkcg_serial_nr = serial_nr;
ea25da48 Paolo Valente 2017-04-19  681  out:
ea25da48 Paolo Valente 2017-04-19  682  	rcu_read_unlock();

:::::: The code at line 679 was first introduced by commit
:::::: 8f9bebc33dd718283183582fc4a762e178552fb8 block, bfq: access and cache blkg data only when safe

:::::: TO: Paolo Valente <paolo.valente@linaro.org>
:::::: CC: Jens Axboe <axboe@fb.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28082 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V2 05/12] kernfs: introduce kernfs_node_id
  2017-06-14 16:12 ` [PATCH V2 05/12] kernfs: introduce kernfs_node_id Shaohua Li
@ 2017-06-15 20:14   ` kbuild test robot
  0 siblings, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2017-06-15 20:14 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kbuild-all, linux-kernel, linux-block, tj, gregkh, hch, axboe,
	rostedt, lizefan, Kernel-team, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 2366 bytes --]

Hi Shaohua,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12-rc5 next-20170615]
[cannot apply to driver-core/driver-core-testing block/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shaohua-Li/kernfs-implement-i_generation/20170615-211524
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 6.3.0-18) 6.3.0 20170516
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All error/warnings (new ones prefixed by >>):

   In file included from fs/fs-writeback.c:98:0:
   include/trace/events/writeback.h: In function '__trace_wb_assign_cgroup':
>> include/trace/events/writeback.h:139:34: error: 'struct kernfs_node' has no member named 'ino'
     return wb->memcg_css->cgroup->kn->ino;
                                     ^~
   In file included from fs/fs-writeback.c:98:0:
>> include/trace/events/writeback.h:140:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^

vim +139 include/trace/events/writeback.h

9fb0a7da Tejun Heo 2013-01-11  133  
5634cc2a Tejun Heo 2015-08-18  134  #ifdef CREATE_TRACE_POINTS
5634cc2a Tejun Heo 2015-08-18  135  #ifdef CONFIG_CGROUP_WRITEBACK
5634cc2a Tejun Heo 2015-08-18  136  
a664edb3 Yang Shi  2016-03-03  137  static inline unsigned int __trace_wb_assign_cgroup(struct bdi_writeback *wb)
5634cc2a Tejun Heo 2015-08-18  138  {
a664edb3 Yang Shi  2016-03-03 @139  	return wb->memcg_css->cgroup->kn->ino;
5634cc2a Tejun Heo 2015-08-18 @140  }
5634cc2a Tejun Heo 2015-08-18  141  
a664edb3 Yang Shi  2016-03-03  142  static inline unsigned int __trace_wbc_assign_cgroup(struct writeback_control *wbc)
5634cc2a Tejun Heo 2015-08-18  143  {

:::::: The code at line 139 was first introduced by commit
:::::: a664edb374c704a734a0df41fc742c285a5beb52 tracing, writeback: Replace cgroup path to cgroup ino

:::::: TO: Yang Shi <yang.shi@linaro.org>
:::::: CC: Steven Rostedt <rostedt@goodmis.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45304 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-06-15 20:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14 16:11 [PATCH V2 00/12]blktrace: output cgroup info Shaohua Li
2017-06-14 16:11 ` [PATCH V2 01/12] kernfs: implement i_generation Shaohua Li
2017-06-14 16:12 ` [PATCH V2 02/12] kernfs: use idr instead of ida to manage inode number Shaohua Li
2017-06-14 16:12 ` [PATCH V2 03/12] kernfs: add an API to get kernfs node from " Shaohua Li
2017-06-14 16:12 ` [PATCH V2 04/12] kernfs: don't set dentry->d_fsdata Shaohua Li
2017-06-14 16:12 ` [PATCH V2 05/12] kernfs: introduce kernfs_node_id Shaohua Li
2017-06-15 20:14   ` kbuild test robot
2017-06-14 16:12 ` [PATCH V2 06/12] kernfs: add exportfs operations Shaohua Li
2017-06-14 16:12 ` [PATCH V2 07/12] cgroup: export fhandle info for a cgroup Shaohua Li
2017-06-14 16:12 ` [PATCH V2 08/12] blktrace: export cgroup info in trace Shaohua Li
2017-06-14 16:12 ` [PATCH V2 09/12] block: always attach cgroup info into bio Shaohua Li
2017-06-14 16:12 ` [PATCH V2 10/12] block: call __bio_free in bio_endio Shaohua Li
2017-06-14 16:12 ` [PATCH V2 11/12] blktrace: add an option to allow displying cgroup path Shaohua Li
2017-06-14 16:12 ` [PATCH V2 12/12] block: use standard blktrace API to output cgroup info for debug notes Shaohua Li
2017-06-15 13:56   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).