All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] tracing/events: block: bring more on a par with blktrace
@ 2018-04-13 13:07 Steffen Maier
  2018-04-13 13:07 ` [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule Steffen Maier
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Steffen Maier @ 2018-04-13 13:07 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: Steven Rostedt, Ingo Molnar, Jens Axboe, Li Zefan,
	Greg Kroah-Hartman, Steffen Maier

I had the need to understand I/O request processing in detail.
But I also had the need to enrich block traces with other trace events
including my own dynamic kprobe events. So I preferred block trace events
over blktrace to get everything nicely sorted into one ftrace output.
However, I missed device filtering for (un)plug events and also
the difference between the two flavors of unplug.

The first two patches bring block trace events closer to their
counterpart in blktrace tooling.

The last patch is just an RFC. I still kept it in this patch set because
it is inspired by PATCH 2/2.

Steffen Maier (3):
  tracing/events: block: track and print if unplug was explicit or
    schedule
  tracing/events: block: dev_t via driver core for plug and unplug
    events
  tracing/events: block: also try to get dev_t via driver core for some
    events

 include/trace/events/block.h | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule
  2018-04-13 13:07 [PATCH 0/2] tracing/events: block: bring more on a par with blktrace Steffen Maier
@ 2018-04-13 13:07 ` Steffen Maier
  2018-04-13 14:16   ` Steven Rostedt
  2018-04-13 13:07 ` [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events Steffen Maier
  2018-04-13 13:07 ` [RFC] tracing/events: block: also try to get dev_t via driver core for some events Steffen Maier
  2 siblings, 1 reply; 16+ messages in thread
From: Steffen Maier @ 2018-04-13 13:07 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: Steven Rostedt, Ingo Molnar, Jens Axboe, Li Zefan,
	Greg Kroah-Hartman, Steffen Maier

Just like blktrace distinguishes explicit and schedule by means of
BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the
existing argument "explicit" to distinguish the two cases in the one
common tracepoint block_unplug.

Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace
event correspond to the schedule() unplug") and commit d9c978331790
("block: remove block_unplug_timer() trace point").

Signed-off-by: Steffen Maier <maier@linux.ibm.com>
---
 include/trace/events/block.h | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 81b43f5bdf23..a13613d27cee 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -470,6 +470,11 @@ TRACE_EVENT(block_plug,
 	TP_printk("[%s]", __entry->comm)
 );
 
+#define show_block_unplug_explicit(val)		\
+	__print_symbolic(val,			\
+			 {false, "schedule"},	\
+			 {true,	 "explicit"})
+
 DECLARE_EVENT_CLASS(block_unplug,
 
 	TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit),
@@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug,
 
 	TP_STRUCT__entry(
 		__field( int,		nr_rq			)
+		__field( bool,		explicit		)
 		__array( char,		comm,	TASK_COMM_LEN	)
 	),
 
 	TP_fast_assign(
 		__entry->nr_rq = depth;
+		__entry->explicit = explicit;
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
-	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
+	TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq,
+		  show_block_unplug_explicit(__entry->explicit))
 );
 
 /**
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
  2018-04-13 13:07 [PATCH 0/2] tracing/events: block: bring more on a par with blktrace Steffen Maier
  2018-04-13 13:07 ` [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule Steffen Maier
@ 2018-04-13 13:07 ` Steffen Maier
  2018-04-15  8:31   ` Greg Kroah-Hartman
  2018-04-13 13:07 ` [RFC] tracing/events: block: also try to get dev_t via driver core for some events Steffen Maier
  2 siblings, 1 reply; 16+ messages in thread
From: Steffen Maier @ 2018-04-13 13:07 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: Steven Rostedt, Ingo Molnar, Jens Axboe, Li Zefan,
	Greg Kroah-Hartman, Steffen Maier

Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
trace points to TRACE_EVENT()") to be equivalent to traditional blktrace
output. Also this allows event filtering to not always get all (un)plug
events.

NB: The NULL pointer check for q->kobj.parent is certainly racy and
I don't have enough experience if it's good enough for a trace event.
The change did work for my cases (block device read/write I/O on
zfcp-attached SCSI disks and dm-mpath on top).

While I haven't seen any prior art using driver core (parent) relations
for trace events, there are other cases using this when no direct pointer
exists between objects, such as:
 #define to_scsi_target(d)	container_of(d, struct scsi_target, dev)
 static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
 {
	return to_scsi_target(sdev->sdev_gendev.parent);
 }

This is the object model we make use of here:

struct gendisk {
        struct hd_struct {
                struct device {      /*container_of*/
                        struct kobject kobj; <--+
                        dev_t  devt; /*deref*/  |
                } __dev;                        |
        } part0;                                |
        struct request_queue *queue; ..+        |
}                                      :        |
                                       :        |
struct request_queue {  <..............+        |
        /* queue kobject */                     |
        struct kobject {                        |
                struct kobject *parent; --------+
        } kobj;
}

The parent pointer comes from:
 #define disk_to_dev(disk)	(&(disk)->part0.__dev)
int blk_register_queue(struct gendisk *disk)
	struct device *dev = disk_to_dev(disk);
	struct request_queue *q = disk->queue;
	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
	                            ^^^parent

$ ls -d /sys/block/sdf/queue
/sys/block/sda/queue
$ cat /sys/block/sdf/dev
80:0

A partition does not have its own request queue:

$ cat /sys/block/sdf/sdf1/dev
8:81
$ ls -d /sys/block/sdf/sdf1/queue
ls: cannot access '/sys/block/sdf/sdf1/queue': No such file or directory

The difference to blktrace parsed output is that block events don't use the
partition's minor number but the containing block device's minor number:

$ dd if=/dev/sdf1 count=1

$ cat /sys/kernel/debug/tracing/trace
block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0
block_bio_queue: 8,80 R 2048 + 32 [dd]
block_getrq: 8,80 R 2048 + 32 [dd]
block_plug: 8,80 [dd]
            ^^^^
block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd]
block_unplug: 8,80 [dd] 1 explicit
              ^^^^
block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd]
block_rq_complete: 8,80 R () 2048 + 32 [0]

$ btrace /dev/sdf1
  8,80   1        1     0.000000000 240240  A   R 2048 + 32 <- (8,81) 0
  8,81   1        2     0.000220890 240240  Q   R 2048 + 32 [dd]
  8,81   1        3     0.000229639 240240  G   R 2048 + 32 [dd]
  8,81   1        4     0.000231805 240240  P   N [dd]
    ^^
  8,81   1        5     0.000234671 240240  I   R 2048 + 32 [dd]
  8,81   1        6     0.000236365 240240  U   N [dd] 1
    ^^
  8,81   1        7     0.000238527 240240  D   R 2048 + 32 [dd]
  8,81   2        2     0.000613741     0  C   R 2048 + 32 [0]

Signed-off-by: Steffen Maier <maier@linux.ibm.com>
---
 include/trace/events/block.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index a13613d27cee..cffedc26e8a3 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
 	TP_ARGS(q),
 
 	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
 		__array( char,		comm,	TASK_COMM_LEN	)
 	),
 
 	TP_fast_assign(
+		__entry->dev = q->kobj.parent ?
+		container_of(q->kobj.parent, struct device, kobj)->devt : 0;
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
-	TP_printk("[%s]", __entry->comm)
+	TP_printk("%d,%d [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->comm)
 );
 
 #define show_block_unplug_explicit(val)		\
@@ -482,18 +486,23 @@ DECLARE_EVENT_CLASS(block_unplug,
 	TP_ARGS(q, depth, explicit),
 
 	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
 		__field( int,		nr_rq			)
 		__field( bool,		explicit		)
 		__array( char,		comm,	TASK_COMM_LEN	)
 	),
 
 	TP_fast_assign(
+		__entry->dev   = q->kobj.parent ?
+		container_of(q->kobj.parent, struct device, kobj)->devt : 0;
 		__entry->nr_rq = depth;
 		__entry->explicit = explicit;
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
-	TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq,
+	TP_printk("%d,%d [%s] %d %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->comm, __entry->nr_rq,
 		  show_block_unplug_explicit(__entry->explicit))
 );
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC] tracing/events: block: also try to get dev_t via driver core for some events
  2018-04-13 13:07 [PATCH 0/2] tracing/events: block: bring more on a par with blktrace Steffen Maier
  2018-04-13 13:07 ` [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule Steffen Maier
  2018-04-13 13:07 ` [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events Steffen Maier
@ 2018-04-13 13:07 ` Steffen Maier
  2 siblings, 0 replies; 16+ messages in thread
From: Steffen Maier @ 2018-04-13 13:07 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: Steven Rostedt, Ingo Molnar, Jens Axboe, Li Zefan,
	Greg Kroah-Hartman, Steffen Maier

Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
trace points to TRACE_EVENT()") for cases where rq->rq_disk == NULL:
  block_rq_requeue, block_rq_insert, block_rq_issue;
and for cases where bio == NULL:
  block_getrq, block_sleeprq.

NB: The NULL pointer check for q->kobj.parent is certainly racy and
I don't have enough experience if it's good enough for a trace event.

Since I don't know when above cases would occur,
I'm not sure this is worth it.

Signed-off-by: Steffen Maier <maier@linux.ibm.com>
---
 include/trace/events/block.h | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index cffedc26e8a3..c476b7af404b 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -86,7 +86,9 @@ TRACE_EVENT(block_rq_requeue,
 	),
 
 	TP_fast_assign(
-		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) :
+		(q->kobj.parent ?
+		 container_of(q->kobj.parent, struct device, kobj)->devt : 0);
 		__entry->sector    = blk_rq_trace_sector(rq);
 		__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
 
@@ -162,7 +164,9 @@ DECLARE_EVENT_CLASS(block_rq,
 	),
 
 	TP_fast_assign(
-		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) :
+		(q->kobj.parent ?
+		 container_of(q->kobj.parent, struct device, kobj)->devt : 0);
 		__entry->sector    = blk_rq_trace_sector(rq);
 		__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
 		__entry->bytes     = blk_rq_bytes(rq);
@@ -397,7 +401,9 @@ DECLARE_EVENT_CLASS(block_get_rq,
         ),
 
 	TP_fast_assign(
-		__entry->dev		= bio ? bio_dev(bio) : 0;
+		__entry->dev		= bio ? bio_dev(bio) :
+		(q->kobj.parent ?
+		 container_of(q->kobj.parent, struct device, kobj)->devt : 0);
 		__entry->sector		= bio ? bio->bi_iter.bi_sector : 0;
 		__entry->nr_sector	= bio ? bio_sectors(bio) : 0;
 		blk_fill_rwbs(__entry->rwbs,
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule
  2018-04-13 13:07 ` [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule Steffen Maier
@ 2018-04-13 14:16   ` Steven Rostedt
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2018-04-13 14:16 UTC (permalink / raw)
  To: Steffen Maier
  Cc: linux-kernel, linux-block, Ingo Molnar, Jens Axboe, Li Zefan,
	Greg Kroah-Hartman

On Fri, 13 Apr 2018 15:07:17 +0200
Steffen Maier <maier@linux.ibm.com> wrote:

> Just like blktrace distinguishes explicit and schedule by means of
> BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the
> existing argument "explicit" to distinguish the two cases in the one
> common tracepoint block_unplug.
> 
> Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace
> event correspond to the schedule() unplug") and commit d9c978331790
> ("block: remove block_unplug_timer() trace point").
> 
> Signed-off-by: Steffen Maier <maier@linux.ibm.com>
> ---
>  include/trace/events/block.h | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 81b43f5bdf23..a13613d27cee 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -470,6 +470,11 @@ TRACE_EVENT(block_plug,
>  	TP_printk("[%s]", __entry->comm)
>  );
>  
> +#define show_block_unplug_explicit(val)		\
> +	__print_symbolic(val,			\
> +			 {false, "schedule"},	\
> +			 {true,	 "explicit"})

That's new. I haven't seen "true"/"false" values used for
print_symbolic before. But could you please use 1 and 0 instead, because
perf and trace-cmd won't be able to parse that. I could update
libtraceevent to handle it, but really, the first parameter is suppose
to be numeric.

-- Steve

> +
>  DECLARE_EVENT_CLASS(block_unplug,
>  
>  	TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit),
> @@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug,
>  
>  	TP_STRUCT__entry(
>  		__field( int,		nr_rq			)
> +		__field( bool,		explicit		)
>  		__array( char,		comm,	TASK_COMM_LEN	)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->nr_rq = depth;
> +		__entry->explicit = explicit;
>  		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>  	),
>  
> -	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
> +	TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq,
> +		  show_block_unplug_explicit(__entry->explicit))
>  );
>  
>  /**

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
  2018-04-13 13:07 ` [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events Steffen Maier
@ 2018-04-15  8:31   ` Greg Kroah-Hartman
  2018-04-16 16:33       ` Steffen Maier
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Kroah-Hartman @ 2018-04-15  8:31 UTC (permalink / raw)
  To: Steffen Maier
  Cc: linux-kernel, linux-block, Steven Rostedt, Ingo Molnar,
	Jens Axboe, Li Zefan

On Fri, Apr 13, 2018 at 03:07:18PM +0200, Steffen Maier wrote:
> Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
> trace points to TRACE_EVENT()") to be equivalent to traditional blktrace
> output. Also this allows event filtering to not always get all (un)plug
> events.
> 
> NB: The NULL pointer check for q->kobj.parent is certainly racy and
> I don't have enough experience if it's good enough for a trace event.
> The change did work for my cases (block device read/write I/O on
> zfcp-attached SCSI disks and dm-mpath on top).
> 
> While I haven't seen any prior art using driver core (parent) relations
> for trace events, there are other cases using this when no direct pointer
> exists between objects, such as:
>  #define to_scsi_target(d)	container_of(d, struct scsi_target, dev)
>  static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
>  {
> 	return to_scsi_target(sdev->sdev_gendev.parent);
>  }

That is because you "know" the parent of a target device is a
scsi_target.

> This is the object model we make use of here:
> 
> struct gendisk {
>         struct hd_struct {
>                 struct device {      /*container_of*/
>                         struct kobject kobj; <--+
>                         dev_t  devt; /*deref*/  |
>                 } __dev;                        |
>         } part0;                                |
>         struct request_queue *queue; ..+        |
> }                                      :        |
>                                        :        |
> struct request_queue {  <..............+        |
>         /* queue kobject */                     |
>         struct kobject {                        |
>                 struct kobject *parent; --------+

Are you sure about this?

>         } kobj;
> }
> 
> The parent pointer comes from:
>  #define disk_to_dev(disk)	(&(disk)->part0.__dev)
> int blk_register_queue(struct gendisk *disk)
> 	struct device *dev = disk_to_dev(disk);
> 	struct request_queue *q = disk->queue;
> 	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
> 	                            ^^^parent
> 
> $ ls -d /sys/block/sdf/queue
> /sys/block/sda/queue
> $ cat /sys/block/sdf/dev
> 80:0
> 
> A partition does not have its own request queue:
> 
> $ cat /sys/block/sdf/sdf1/dev
> 8:81
> $ ls -d /sys/block/sdf/sdf1/queue
> ls: cannot access '/sys/block/sdf/sdf1/queue': No such file or directory
> 
> The difference to blktrace parsed output is that block events don't use the
> partition's minor number but the containing block device's minor number:

Why do you want the block device's minor number here?  What is wrong
with the partition's minor number?  I would think you want that instead.

> 
> $ dd if=/dev/sdf1 count=1
> 
> $ cat /sys/kernel/debug/tracing/trace
> block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0
> block_bio_queue: 8,80 R 2048 + 32 [dd]
> block_getrq: 8,80 R 2048 + 32 [dd]
> block_plug: 8,80 [dd]
>             ^^^^
> block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd]
> block_unplug: 8,80 [dd] 1 explicit
>               ^^^^
> block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd]
> block_rq_complete: 8,80 R () 2048 + 32 [0]
> 
> $ btrace /dev/sdf1
>   8,80   1        1     0.000000000 240240  A   R 2048 + 32 <- (8,81) 0
>   8,81   1        2     0.000220890 240240  Q   R 2048 + 32 [dd]
>   8,81   1        3     0.000229639 240240  G   R 2048 + 32 [dd]
>   8,81   1        4     0.000231805 240240  P   N [dd]
>     ^^
>   8,81   1        5     0.000234671 240240  I   R 2048 + 32 [dd]
>   8,81   1        6     0.000236365 240240  U   N [dd] 1
>     ^^
>   8,81   1        7     0.000238527 240240  D   R 2048 + 32 [dd]
>   8,81   2        2     0.000613741     0  C   R 2048 + 32 [0]
> 
> Signed-off-by: Steffen Maier <maier@linux.ibm.com>
> ---
>  include/trace/events/block.h | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index a13613d27cee..cffedc26e8a3 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
>  	TP_ARGS(q),
>  
>  	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
>  		__array( char,		comm,	TASK_COMM_LEN	)
>  	),
>  
>  	TP_fast_assign(
> +		__entry->dev = q->kobj.parent ?
> +		container_of(q->kobj.parent, struct device, kobj)->devt : 0;

That really really really scares me.  It feels very fragile and messing
with parent pointers is ripe for things breaking in the future in odd
and unexplainable ways.

And how can the parent be NULL?

>  		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>  	),
>  
> -	TP_printk("[%s]", __entry->comm)
> +	TP_printk("%d,%d [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->comm)
>  );
>  
>  #define show_block_unplug_explicit(val)		\
> @@ -482,18 +486,23 @@ DECLARE_EVENT_CLASS(block_unplug,
>  	TP_ARGS(q, depth, explicit),
>  
>  	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
>  		__field( int,		nr_rq			)
>  		__field( bool,		explicit		)
>  		__array( char,		comm,	TASK_COMM_LEN	)
>  	),
>  
>  	TP_fast_assign(
> +		__entry->dev   = q->kobj.parent ?
> +		container_of(q->kobj.parent, struct device, kobj)->devt : 0;

Again, how can the parent be null?

>  		__entry->nr_rq = depth;
>  		__entry->explicit = explicit;
>  		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>  	),
>  
> -	TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq,
> +	TP_printk("%d,%d [%s] %d %s",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),

I don't know the block layer but this feels very wrong to me.  Are you
sure there isn't some other way to get this info?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
  2018-04-15  8:31   ` Greg Kroah-Hartman
@ 2018-04-16 16:33       ` Steffen Maier
  0 siblings, 0 replies; 16+ messages in thread
From: Steffen Maier @ 2018-04-16 16:33 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-kernel, linux-block, Steven Rostedt, Ingo Molnar,
	Jens Axboe, Li Zefan, Li Zefan

Hi Greg,

On 04/15/2018 10:31 AM, Greg Kroah-Hartman wrote:
> On Fri, Apr 13, 2018 at 03:07:18PM +0200, Steffen Maier wrote:
>> Complements v2.6.31 commit 55782138e47d ("tracing/events: convert bloc=
k
>> trace points to TRACE_EVENT()") to be equivalent to traditional blktra=
ce
>> output. Also this allows event filtering to not always get all (un)plu=
g
>> events.
>>
>> NB: The NULL pointer check for q->kobj.parent is certainly racy and
>> I don't have enough experience if it's good enough for a trace event.
>> The change did work for my cases (block device read/write I/O on
>> zfcp-attached SCSI disks and dm-mpath on top).
>>
>> While I haven't seen any prior art using driver core (parent) relation=
s
>> for trace events, there are other cases using this when no direct poin=
ter
>> exists between objects, such as:
>>   #define to_scsi_target(d)	container_of(d, struct scsi_target, dev)
>>   static inline struct scsi_target *scsi_target(struct scsi_device *sd=
ev)
>>   {
>> 	return to_scsi_target(sdev->sdev_gendev.parent);
>>   }
>=20
> That is because you "know" the parent of a target device is a
> scsi_target.

true

>> This is the object model we make use of here:
>>
>> struct gendisk {
>>          struct hd_struct {
>>                  struct device {      /*container_of*/
>>                          struct kobject kobj; <--+
>>                          dev_t  devt; /*deref*/  |
>>                  } __dev;                        |
>>          } part0;                                |
>>          struct request_queue *queue; ..+        |
>> }                                      :        |
>>                                         :        |
>> struct request_queue {  <..............+        |
>>          /* queue kobject */                     |
>>          struct kobject {                        |
>>                  struct kobject *parent; --------+
>=20
> Are you sure about this?

I double checked it with crash on a running system chasing pointers and=20
looking at structure debug symbols.
But of course I cannot guarantee it's always been like this and will be.

>>          } kobj;
>> }
>>

>> The difference to blktrace parsed output is that block events don't us=
e the
>> partition's minor number but the containing block device's minor numbe=
r:
>=20
> Why do you want the block device's minor number here?  What is wrong
> with the partition's minor number?  I would think you want that instead=
=2E

No change introduced with my patch. I just describe state of the art=20
since the mentioned=20
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit=
/?id=3D55782138e47d.
It (or even its predecessor) used request_queue as trace function=20
argument (plus mostly either request or bio). So that's the currently=20
available context for these events. My change is consistent with that.
But then again, it's not much of a problem as we do have the remap event =

which shows the mapping from partition to blockdev.

blktrace, hooking with callbacks on the block trace events, has its own=20
context information [struct blk_trace] and can get to e.g. the dev_t=20
with its own real pointers without using driver core relations. But I=20
had the impression that's only wired if one uses the blktrace IOCTL or=20
the blk tracer [do_blk_trace_setup()], not for "pure" block events.
> static void blk_add_trace_plug(void *ignore, struct request_queue *q)
> {
> 	struct blk_trace *bt =3D q->blk_trace;
                                ^^^^^^^^^^^^
>=20
> 	if (bt)
> 		__blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, NULL);
> }
>=20
> static void blk_add_trace_unplug(void *ignore, struct request_queue *q,=

> 				    unsigned int depth, bool explicit)
> {
> 	struct blk_trace *bt =3D q->blk_trace;
                                ^^^^^^^^^^^^
>=20
> 	if (bt) {
> 		__be64 rpdu =3D cpu_to_be64(depth);
> 		u32 what;
>=20
> 		if (explicit)
> 			what =3D BLK_TA_UNPLUG_IO;
> 		else
> 			what =3D BLK_TA_UNPLUG_TIMER;
>=20
> 		__blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, NULL);
> 	}
> }

> struct blk_trace {
> 	int trace_state;
> 	struct rchan *rchan;
> 	unsigned long __percpu *sequence;
> 	unsigned char __percpu *msg_data;
> 	u16 act_mask;
> 	u64 start_lba;
> 	u64 end_lba;
> 	u32 pid;
> 	u32 dev;
             ^^^
> 	struct dentry *dir;
> 	struct dentry *dropped_file;
> 	struct dentry *msg_file;
> 	struct list_head running_list;
> 	atomic_t dropped;
> };



>> $ dd if=3D/dev/sdf1 count=3D1
>>
>> $ cat /sys/kernel/debug/tracing/trace
>> block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0
>> block_bio_queue: 8,80 R 2048 + 32 [dd]
>> block_getrq: 8,80 R 2048 + 32 [dd]
>> block_plug: 8,80 [dd]
>>              ^^^^
>> block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd]
>> block_unplug: 8,80 [dd] 1 explicit
>>                ^^^^
>> block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd]
>> block_rq_complete: 8,80 R () 2048 + 32 [0]

>> diff --git a/include/trace/events/block.h b/include/trace/events/block=
=2Eh
>> index a13613d27cee..cffedc26e8a3 100644
>> --- a/include/trace/events/block.h
>> +++ b/include/trace/events/block.h
>> @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
>>   	TP_ARGS(q),
>>  =20
>>   	TP_STRUCT__entry(
>> +		__field( dev_t,		dev			)
>>   		__array( char,		comm,	TASK_COMM_LEN	)
>>   	),
>>  =20
>>   	TP_fast_assign(
>> +		__entry->dev =3D q->kobj.parent ?
>> +		container_of(q->kobj.parent, struct device, kobj)->devt : 0;
>=20
> That really really really scares me.  It feels very fragile and messing=

> with parent pointers is ripe for things breaking in the future in odd
> and unexplainable ways.
>=20
> And how can the parent be NULL?

I'm hoping for help by block layer experts.

I suppose block device unplug/removal could be a case. But I don't know=20
the details how this works and if object release is protected while I/O=20
is pending and new I/O is rejected beforehand. That might make my=20
approach safe as it would not call the trace functions while the parent=20
pointer changes.

> I don't know the block layer but this feels very wrong to me.  Are you
> sure there isn't some other way to get this info?

No, I'm not sure at all. But I'm no block layer expert either. This is=20
just an idea I had which did work for my cases and I'm looking for=20
confirmation or denial by the experts. So if it's not safe from a block=20
layer point of view either, then I have to ditch it.

--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / Kind regards
Steffen Maier

Linux on z Systems Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
@ 2018-04-16 16:33       ` Steffen Maier
  0 siblings, 0 replies; 16+ messages in thread
From: Steffen Maier @ 2018-04-16 16:33 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-kernel, linux-block, Steven Rostedt, Ingo Molnar,
	Jens Axboe, Li Zefan, Li Zefan

Hi Greg,

On 04/15/2018 10:31 AM, Greg Kroah-Hartman wrote:
> On Fri, Apr 13, 2018 at 03:07:18PM +0200, Steffen Maier wrote:
>> Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
>> trace points to TRACE_EVENT()") to be equivalent to traditional blktrace
>> output. Also this allows event filtering to not always get all (un)plug
>> events.
>>
>> NB: The NULL pointer check for q->kobj.parent is certainly racy and
>> I don't have enough experience if it's good enough for a trace event.
>> The change did work for my cases (block device read/write I/O on
>> zfcp-attached SCSI disks and dm-mpath on top).
>>
>> While I haven't seen any prior art using driver core (parent) relations
>> for trace events, there are other cases using this when no direct pointer
>> exists between objects, such as:
>>   #define to_scsi_target(d)	container_of(d, struct scsi_target, dev)
>>   static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
>>   {
>> 	return to_scsi_target(sdev->sdev_gendev.parent);
>>   }
> 
> That is because you "know" the parent of a target device is a
> scsi_target.

true

>> This is the object model we make use of here:
>>
>> struct gendisk {
>>          struct hd_struct {
>>                  struct device {      /*container_of*/
>>                          struct kobject kobj; <--+
>>                          dev_t  devt; /*deref*/  |
>>                  } __dev;                        |
>>          } part0;                                |
>>          struct request_queue *queue; ..+        |
>> }                                      :        |
>>                                         :        |
>> struct request_queue {  <..............+        |
>>          /* queue kobject */                     |
>>          struct kobject {                        |
>>                  struct kobject *parent; --------+
> 
> Are you sure about this?

I double checked it with crash on a running system chasing pointers and 
looking at structure debug symbols.
But of course I cannot guarantee it's always been like this and will be.

>>          } kobj;
>> }
>>

>> The difference to blktrace parsed output is that block events don't use the
>> partition's minor number but the containing block device's minor number:
> 
> Why do you want the block device's minor number here?  What is wrong
> with the partition's minor number?  I would think you want that instead.

No change introduced with my patch. I just describe state of the art 
since the mentioned 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=55782138e47d.
It (or even its predecessor) used request_queue as trace function 
argument (plus mostly either request or bio). So that's the currently 
available context for these events. My change is consistent with that.
But then again, it's not much of a problem as we do have the remap event 
which shows the mapping from partition to blockdev.

blktrace, hooking with callbacks on the block trace events, has its own 
context information [struct blk_trace] and can get to e.g. the dev_t 
with its own real pointers without using driver core relations. But I 
had the impression that's only wired if one uses the blktrace IOCTL or 
the blk tracer [do_blk_trace_setup()], not for "pure" block events.
> static void blk_add_trace_plug(void *ignore, struct request_queue *q)
> {
> 	struct blk_trace *bt = q->blk_trace;
                                ^^^^^^^^^^^^
> 
> 	if (bt)
> 		__blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, NULL);
> }
> 
> static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
> 				    unsigned int depth, bool explicit)
> {
> 	struct blk_trace *bt = q->blk_trace;
                                ^^^^^^^^^^^^
> 
> 	if (bt) {
> 		__be64 rpdu = cpu_to_be64(depth);
> 		u32 what;
> 
> 		if (explicit)
> 			what = BLK_TA_UNPLUG_IO;
> 		else
> 			what = BLK_TA_UNPLUG_TIMER;
> 
> 		__blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, NULL);
> 	}
> }

> struct blk_trace {
> 	int trace_state;
> 	struct rchan *rchan;
> 	unsigned long __percpu *sequence;
> 	unsigned char __percpu *msg_data;
> 	u16 act_mask;
> 	u64 start_lba;
> 	u64 end_lba;
> 	u32 pid;
> 	u32 dev;
             ^^^
> 	struct dentry *dir;
> 	struct dentry *dropped_file;
> 	struct dentry *msg_file;
> 	struct list_head running_list;
> 	atomic_t dropped;
> };



>> $ dd if=/dev/sdf1 count=1
>>
>> $ cat /sys/kernel/debug/tracing/trace
>> block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0
>> block_bio_queue: 8,80 R 2048 + 32 [dd]
>> block_getrq: 8,80 R 2048 + 32 [dd]
>> block_plug: 8,80 [dd]
>>              ^^^^
>> block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd]
>> block_unplug: 8,80 [dd] 1 explicit
>>                ^^^^
>> block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd]
>> block_rq_complete: 8,80 R () 2048 + 32 [0]

>> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
>> index a13613d27cee..cffedc26e8a3 100644
>> --- a/include/trace/events/block.h
>> +++ b/include/trace/events/block.h
>> @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
>>   	TP_ARGS(q),
>>   
>>   	TP_STRUCT__entry(
>> +		__field( dev_t,		dev			)
>>   		__array( char,		comm,	TASK_COMM_LEN	)
>>   	),
>>   
>>   	TP_fast_assign(
>> +		__entry->dev = q->kobj.parent ?
>> +		container_of(q->kobj.parent, struct device, kobj)->devt : 0;
> 
> That really really really scares me.  It feels very fragile and messing
> with parent pointers is ripe for things breaking in the future in odd
> and unexplainable ways.
> 
> And how can the parent be NULL?

I'm hoping for help by block layer experts.

I suppose block device unplug/removal could be a case. But I don't know 
the details how this works and if object release is protected while I/O 
is pending and new I/O is rejected beforehand. That might make my 
approach safe as it would not call the trace functions while the parent 
pointer changes.

> I don't know the block layer but this feels very wrong to me.  Are you
> sure there isn't some other way to get this info?

No, I'm not sure at all. But I'm no block layer expert either. This is 
just an idea I had which did work for my cases and I'm looking for 
confirmation or denial by the experts. So if it's not safe from a block 
layer point of view either, then I have to ditch it.

-- 
Mit freundlichen Grüßen / Kind regards
Steffen Maier

Linux on z Systems Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
  2018-04-16 16:33       ` Steffen Maier
@ 2018-04-19 19:24         ` Omar Sandoval
  -1 siblings, 0 replies; 16+ messages in thread
From: Omar Sandoval @ 2018-04-19 19:24 UTC (permalink / raw)
  To: Steffen Maier
  Cc: Greg Kroah-Hartman, linux-kernel, linux-block, Steven Rostedt,
	Ingo Molnar, Jens Axboe, Li Zefan, Li Zefan, Bart Van Assche

On Mon, Apr 16, 2018 at 06:33:27PM +0200, Steffen Maier wrote:
> Hi Greg,
> 
> On 04/15/2018 10:31 AM, Greg Kroah-Hartman wrote:
> > On Fri, Apr 13, 2018 at 03:07:18PM +0200, Steffen Maier wrote:
> > > diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> > > index a13613d27cee..cffedc26e8a3 100644
> > > --- a/include/trace/events/block.h
> > > +++ b/include/trace/events/block.h
> > > @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
> > >   	TP_ARGS(q),
> > >   	TP_STRUCT__entry(
> > > +		__field( dev_t,		dev			)
> > >   		__array( char,		comm,	TASK_COMM_LEN	)
> > >   	),
> > >   	TP_fast_assign(
> > > +		__entry->dev = q->kobj.parent ?
> > > +		container_of(q->kobj.parent, struct device, kobj)->devt : 0;
> > 
> > That really really really scares me.  It feels very fragile and messing
> > with parent pointers is ripe for things breaking in the future in odd
> > and unexplainable ways.
> > 
> > And how can the parent be NULL?
> 
> I'm hoping for help by block layer experts.
> 
> I suppose block device unplug/removal could be a case. But I don't know the
> details how this works and if object release is protected while I/O is
> pending and new I/O is rejected beforehand. That might make my approach safe
> as it would not call the trace functions while the parent pointer changes.

We quiesce and freeze the queue before tearing it down, so it won't be
NULL while we're still doing I/O. Cc'ing Bart in case I'm lying to you,
though ;)

> > I don't know the block layer but this feels very wrong to me.  Are you
> > sure there isn't some other way to get this info?
> 
> No, I'm not sure at all. But I'm no block layer expert either. This is just
> an idea I had which did work for my cases and I'm looking for confirmation
> or denial by the experts. So if it's not safe from a block layer point of
> view either, then I have to ditch it.

There's not a pretty way to do this, but using the proper helpers would
be preferred:

disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)))

The parent of a request_queue is always a gendisk, so this should always
work.

> -- 
> Mit freundlichen Gr��en / Kind regards
> Steffen Maier
> 
> Linux on z Systems Development
> 
> IBM Deutschland Research & Development GmbH
> Vorsitzende des Aufsichtsrats: Martina Koederitz
> Geschaeftsfuehrung: Dirk Wittkopp
> Sitz der Gesellschaft: Boeblingen
> Registergericht: Amtsgericht Stuttgart, HRB 243294
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
@ 2018-04-19 19:24         ` Omar Sandoval
  0 siblings, 0 replies; 16+ messages in thread
From: Omar Sandoval @ 2018-04-19 19:24 UTC (permalink / raw)
  To: Steffen Maier
  Cc: Greg Kroah-Hartman, linux-kernel, linux-block, Steven Rostedt,
	Ingo Molnar, Jens Axboe, Li Zefan, Li Zefan, Bart Van Assche

On Mon, Apr 16, 2018 at 06:33:27PM +0200, Steffen Maier wrote:
> Hi Greg,
> 
> On 04/15/2018 10:31 AM, Greg Kroah-Hartman wrote:
> > On Fri, Apr 13, 2018 at 03:07:18PM +0200, Steffen Maier wrote:
> > > diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> > > index a13613d27cee..cffedc26e8a3 100644
> > > --- a/include/trace/events/block.h
> > > +++ b/include/trace/events/block.h
> > > @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
> > >   	TP_ARGS(q),
> > >   	TP_STRUCT__entry(
> > > +		__field( dev_t,		dev			)
> > >   		__array( char,		comm,	TASK_COMM_LEN	)
> > >   	),
> > >   	TP_fast_assign(
> > > +		__entry->dev = q->kobj.parent ?
> > > +		container_of(q->kobj.parent, struct device, kobj)->devt : 0;
> > 
> > That really really really scares me.  It feels very fragile and messing
> > with parent pointers is ripe for things breaking in the future in odd
> > and unexplainable ways.
> > 
> > And how can the parent be NULL?
> 
> I'm hoping for help by block layer experts.
> 
> I suppose block device unplug/removal could be a case. But I don't know the
> details how this works and if object release is protected while I/O is
> pending and new I/O is rejected beforehand. That might make my approach safe
> as it would not call the trace functions while the parent pointer changes.

We quiesce and freeze the queue before tearing it down, so it won't be
NULL while we're still doing I/O. Cc'ing Bart in case I'm lying to you,
though ;)

> > I don't know the block layer but this feels very wrong to me.  Are you
> > sure there isn't some other way to get this info?
> 
> No, I'm not sure at all. But I'm no block layer expert either. This is just
> an idea I had which did work for my cases and I'm looking for confirmation
> or denial by the experts. So if it's not safe from a block layer point of
> view either, then I have to ditch it.

There's not a pretty way to do this, but using the proper helpers would
be preferred:

disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)))

The parent of a request_queue is always a gendisk, so this should always
work.

> -- 
> Mit freundlichen Grüßen / Kind regards
> Steffen Maier
> 
> Linux on z Systems Development
> 
> IBM Deutschland Research & Development GmbH
> Vorsitzende des Aufsichtsrats: Martina Koederitz
> Geschaeftsfuehrung: Dirk Wittkopp
> Sitz der Gesellschaft: Boeblingen
> Registergericht: Amtsgericht Stuttgart, HRB 243294
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
  2018-04-19 19:24         ` Omar Sandoval
@ 2018-04-19 20:56           ` Bart Van Assche
  -1 siblings, 0 replies; 16+ messages in thread
From: Bart Van Assche @ 2018-04-19 20:56 UTC (permalink / raw)
  To: maier, osandov
  Cc: lizefan, linux-kernel, linux-block, rostedt, axboe, mingo, gregkh, lizf

T24gVGh1LCAyMDE4LTA0LTE5IGF0IDEyOjI0IC0wNzAwLCBPbWFyIFNhbmRvdmFsIHdyb3RlOg0K
PiBXZSBxdWllc2NlIGFuZCBmcmVlemUgdGhlIHF1ZXVlIGJlZm9yZSB0ZWFyaW5nIGl0IGRvd24s
IHNvIGl0IHdvbid0IGJlDQo+IE5VTEwgd2hpbGUgd2UncmUgc3RpbGwgZG9pbmcgSS9PLiBDYydp
bmcgQmFydCBpbiBjYXNlIEknbSBseWluZyB0byB5b3UsDQo+IHRob3VnaCA7KQ0KDQpibGtfY2xl
YW51cF9xdWV1ZSgpIHdhaXRzIHVudGlsIGFsbCBJL08gcmVxdWVzdHMgaGF2ZSBmaW5pc2hlZC4g
U2luY2UgdGhlDQpibG9jayBsYXllciB0cmFjaW5nIGNvZGUgaXMgdHJpZ2dlcmVkIGZyb20gaW5z
aWRlIHRoZSBjb2RlIHRoYXQgcHJvY2Vzc2VzIGENCnJlcXVlc3QgaXQgaXMgc2FmZSB0byBhY2Nl
c3MgdGhlIHJlcXVlc3QgcG9pbnRlciBmcm9tIGluc2lkZSB0aGUgdHJhY2luZyBjb2RlLg0KQnV0
IEkgdGhpbmsgdGhlIHF1ZXN0aW9uIHdhcyBhYm91dCB0aGUgcGFyZW50IG9mIHRoZSByZXF1ZXN0
IHF1ZXVlIGtvYmoNCmluc3RlYWQgb2YgYWJvdXQgdGhlIHJlcXVlc3QgcXVldWUgcG9pbnRlciwg
dGhlIGRldmljZSBzdHJ1Y3R1cmUgdGhhdCBpcw0KZW1iZWRkZWQgaW4gc3RydWN0IGdlbmRpc2s/
IEkgdGhpbmsgdGhhdCBwYXJlbnQgY2FuIGRpc2FwcGVhciBhdCBhbnkgdGltZS4NCk1vc3QgYmxv
Y2sgZHJpdmVycyBjYWxsIGRlbF9nZW5kaXNrKCkgYmVmb3JlIHRoZXkgY2FsbCBibGtfY2xlYW51
cF9xdWV1ZSgpLg0KVW5sZXNzIEknbSBvdmVybG9va2luZyBzb21ldGhpbmcgSSB0aGluayB0aGUg
cmVxdWVzdCBxdWV1ZSB3aWxsIG5lZWQgdG8NCm9idGFpbiBhIHJlZmVyZW5jZSB0byB0aGUgZ2Vu
ZGlzayBkZXZpY2UgZnJvbSBpbnNpZGUgYmxrX3JlZ2lzdGVyX3F1ZXVlKCkNCmFuZCBkcm9wIHRo
YXQgcmVmZXJlbmNlIGZyb20gaW5zaWRlIGJsa19jbGVhbnVwX3F1ZXVlKCkgdG8gbWFrZSBpdCBz
YWZlIGZvcg0KdGhlIHRyYWNpbmcgY29kZSB0byBhY2Nlc3Mgc3RydWN0IGdlbmRpc2suDQoNCkJh
cnQuDQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
@ 2018-04-19 20:56           ` Bart Van Assche
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Van Assche @ 2018-04-19 20:56 UTC (permalink / raw)
  To: maier, osandov
  Cc: lizefan, linux-kernel, linux-block, rostedt, axboe, mingo, gregkh, lizf

On Thu, 2018-04-19 at 12:24 -0700, Omar Sandoval wrote:
> We quiesce and freeze the queue before tearing it down, so it won't be
> NULL while we're still doing I/O. Cc'ing Bart in case I'm lying to you,
> though ;)

blk_cleanup_queue() waits until all I/O requests have finished. Since the
block layer tracing code is triggered from inside the code that processes a
request it is safe to access the request pointer from inside the tracing code.
But I think the question was about the parent of the request queue kobj
instead of about the request queue pointer, the device structure that is
embedded in struct gendisk? I think that parent can disappear at any time.
Most block drivers call del_gendisk() before they call blk_cleanup_queue().
Unless I'm overlooking something I think the request queue will need to
obtain a reference to the gendisk device from inside blk_register_queue()
and drop that reference from inside blk_cleanup_queue() to make it safe for
the tracing code to access struct gendisk.

Bart.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
  2018-04-19 20:56           ` Bart Van Assche
@ 2018-04-24 14:49             ` Steffen Maier
  -1 siblings, 0 replies; 16+ messages in thread
From: Steffen Maier @ 2018-04-24 14:49 UTC (permalink / raw)
  To: Bart Van Assche, osandov
  Cc: lizefan, linux-kernel, linux-block, rostedt, axboe, mingo, gregkh


On 04/19/2018 10:56 PM, Bart Van Assche wrote:
> On Thu, 2018-04-19 at 12:24 -0700, Omar Sandoval wrote:
>> We quiesce and freeze the queue before tearing it down, so it won't be=

>> NULL while we're still doing I/O. Cc'ing Bart in case I'm lying to you=
,
>> though ;)
>=20
> blk_cleanup_queue() waits until all I/O requests have finished. Since t=
he
> block layer tracing code is triggered from inside the code that process=
es a
> request it is safe to access the request pointer from inside the tracin=
g code.
> But I think the question was about the parent of the request queue kobj=

> instead of about the request queue pointer, the device structure that i=
s
> embedded in struct gendisk? I think that parent can disappear at any ti=
me.
> Most block drivers call del_gendisk() before they call blk_cleanup_queu=
e().
> Unless I'm overlooking something I think the request queue will need to=

> obtain a reference to the gendisk device from inside blk_register_queue=
()
> and drop that reference from inside blk_cleanup_queue() to make it safe=
for
> the tracing code to access struct gendisk.

Got it, thanks, makes sense.

I did some research, tried the refcount approach, and failed.

Experiment Preparation:
Stopped any udevd and lvm services to avoid additional I/O.
Used the following debug/tracing setup:
modprobe sd_mod
echo -n 'module core +pt; module kobject +pt' > /sys/kernel/debug/dynamic=
_debug/control
trace-cmd start -e block -f "(dev >=3D 0x800000 && dev <=3D 0x80000f) || =
dev =3D=3D 0" -e scsi -p function -l "*_gendisk" -l blk_alloc_queue -l bl=
k_register_queue -l blk_unregister_queue -l blk_cleanup_queue
echo "blk_alloc_queue_node:stacktrace" >> /sys/kernel/debug/tracing/set_f=
trace_filter
echo "scsi_execute:stacktrace" >> /sys/kernel/debug/tracing/set_ftrace_fi=
lter
echo "blk_register_queue:stacktrace" >> /sys/kernel/debug/tracing/set_ftr=
ace_filter
echo "blk_unregister_queue:stacktrace" >> /sys/kernel/debug/tracing/set_f=
trace_filter
echo "blk_cleanup_queue:stacktrace" >> /sys/kernel/debug/tracing/set_ftra=
ce_filter

Experiment:
modprobe scsi_debug num_parts=3D1
sleep 3
echo "HOTUNPLUG" > /sys/kernel/debug/tracing/trace_marker
echo 1 > /sys/class/scsi_device/0:0:0:0/device/delete
echo 0 > /sys/kernel/debug/tracing/tracing_on

Also tried a similar sequence but hotplugged & hotunplugged
a DASD disk (non-SCSI) incl. mq for comparison.

The object life cycle seems to be:

(1) blk_alloc_queue() also creates gendisk kobj
(1a) SCSI does LUN probing I/O
     most of that is blktrace RWBS field value 'N' i.e. non-R/W with dev_=
t=3D=3D0
      since q->kobj.parent is still NULL we cannot get gendisk and thus d=
ev_t
     near the end is the first regular read block I/O with dev_t!=3D0
      as bio!=3DNULL and bi_disk or rq_disk are non-NULL
(2) blk_register_queue() also creates queue kobj with gendisk parent
    now we can follow q->kobj.parent to gendisk to get dev_t,
    e.g. for RWBS 'N' such as implicit SCSI test unit ready on blk dev cl=
ose
(3) optionally "regular" I/O
(4) blk_unregister_queue() called by del_gendisk()
    does kobject_del(&q->kobj) which NULLifies q->kobj.parent again
    the last put of gendisk reference releases gendisk kobj
(5) blk_cleanup_queue()
    the last put of queue (/mq) reference releases queue (/mq) kobj

The patch set would have made dev_t meaningful between and incl. (2) and =
(3)
for (un)plug and even for other block events with RWBS 'N'.
I noticed that I missed one more spot in block_rq_complete [PATCH 3/3],
which explains why I had missed some rq_completions while filtering
block trace events for specific (non-zero) dev_t.
It still wouldn't have been perfect as (1a) events still are dev_t=3D=3D0=

even though the gendisk exists since (1).

I tried the same but increased scsi_debug/delay to a huge value
and made one I/O request pending with "dd" before the hotunplug
to see how flushing would work. For SCSI, at least,
all flushing including a full (and unsuccessful due to eh TUR tmo)
scsi_eh escalation for several minutes happened between
'block' kobj release during (4) and gendisk release during (5),
i.e. before the gendisk was gone but q->kobj.parent was already NULL.
The 'queue' kobj was released last finally.

Since I cannot prove it's always like this, I followed Bart's suggestion
and added another refcount get at (2) blk_register_queue().
However, when I want to put that additional refcount at (5) blk_cleanup_q=
ueue(),
we only get a queue pointer argument as context
and q->kobj.parent=3D=3DNULL since (4),
so I don't know how to get to the gendisk object for the refcount put.
Probably the additional refcount would not help either as (4) has the
unconditional kobject_del(&q->kobj) NULLifying q->kobj.parent anyway...
I guess I came full circle.



Side notes:
The existing code holds a reference on gendisk, but between (2) and (4).

Apparently block trace events can easily observe parts such as (1a)
which are difficult for traditional blktrace to be started via blk dev io=
ctl
(or for ftrace 'blk' tracer which can be enabled between (2) and (4)).



Independently, I started to wonder if there is a refcount leak in the ear=
ly error out if kobject_add() failed:

> int blk_register_queue(struct gendisk *disk)
>         struct device *dev =3D disk_to_dev(disk);
>         struct request_queue *q =3D disk->queue;
>         ret =3D kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "q=
ueue");
>                                     ^^^ref+1
> 	if (ret < 0) {
> 		blk_trace_remove_sysfs(dev);

Above kobject_get already did its side-effect to provide the second argum=
ent when calling kobject_add. Even if kobject_add failed, we still have t=
he refcount on the gendisk but don't put it?

> 		goto unlock;
> 	}

> 	if (q->request_fn || (q->mq_ops && q->elevator)) {
> 		ret =3D elv_register_queue(q);
> 		if (ret) {
> 			mutex_unlock(&q->sysfs_lock);
> 			kobject_uevent(&q->kobj, KOBJ_REMOVE);
> 			kobject_del(&q->kobj);
> 			blk_trace_remove_sysfs(dev);
> 			kobject_put(&dev->kobj);

In contrast, this early return seems to do the pairwise put.

> 			return ret;
> 		}
> 	}

> unlock:
> 	mutex_unlock(&q->sysfs_lock);
> 	return ret;
> }



--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / Kind regards
Steffen Maier

Linux on z Systems Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
@ 2018-04-24 14:49             ` Steffen Maier
  0 siblings, 0 replies; 16+ messages in thread
From: Steffen Maier @ 2018-04-24 14:49 UTC (permalink / raw)
  To: Bart Van Assche, osandov
  Cc: lizefan, linux-kernel, linux-block, rostedt, axboe, mingo, gregkh


On 04/19/2018 10:56 PM, Bart Van Assche wrote:
> On Thu, 2018-04-19 at 12:24 -0700, Omar Sandoval wrote:
>> We quiesce and freeze the queue before tearing it down, so it won't be
>> NULL while we're still doing I/O. Cc'ing Bart in case I'm lying to you,
>> though ;)
> 
> blk_cleanup_queue() waits until all I/O requests have finished. Since the
> block layer tracing code is triggered from inside the code that processes a
> request it is safe to access the request pointer from inside the tracing code.
> But I think the question was about the parent of the request queue kobj
> instead of about the request queue pointer, the device structure that is
> embedded in struct gendisk? I think that parent can disappear at any time.
> Most block drivers call del_gendisk() before they call blk_cleanup_queue().
> Unless I'm overlooking something I think the request queue will need to
> obtain a reference to the gendisk device from inside blk_register_queue()
> and drop that reference from inside blk_cleanup_queue() to make it safefor
> the tracing code to access struct gendisk.

Got it, thanks, makes sense.

I did some research, tried the refcount approach, and failed.

Experiment Preparation:
Stopped any udevd and lvm services to avoid additional I/O.
Used the following debug/tracing setup:
modprobe sd_mod
echo -n 'module core +pt; module kobject +pt' > /sys/kernel/debug/dynamic_debug/control
trace-cmd start -e block -f "(dev >= 0x800000 && dev <= 0x80000f) || dev == 0" -e scsi -p function -l "*_gendisk" -l blk_alloc_queue -l blk_register_queue -l blk_unregister_queue -l blk_cleanup_queue
echo "blk_alloc_queue_node:stacktrace" >> /sys/kernel/debug/tracing/set_ftrace_filter
echo "scsi_execute:stacktrace" >> /sys/kernel/debug/tracing/set_ftrace_filter
echo "blk_register_queue:stacktrace" >> /sys/kernel/debug/tracing/set_ftrace_filter
echo "blk_unregister_queue:stacktrace" >> /sys/kernel/debug/tracing/set_ftrace_filter
echo "blk_cleanup_queue:stacktrace" >> /sys/kernel/debug/tracing/set_ftrace_filter

Experiment:
modprobe scsi_debug num_parts=1
sleep 3
echo "HOTUNPLUG" > /sys/kernel/debug/tracing/trace_marker
echo 1 > /sys/class/scsi_device/0:0:0:0/device/delete
echo 0 > /sys/kernel/debug/tracing/tracing_on

Also tried a similar sequence but hotplugged & hotunplugged
a DASD disk (non-SCSI) incl. mq for comparison.

The object life cycle seems to be:

(1) blk_alloc_queue() also creates gendisk kobj
(1a) SCSI does LUN probing I/O
     most of that is blktrace RWBS field value 'N' i.e. non-R/W with dev_t==0
      since q->kobj.parent is still NULL we cannot get gendisk and thus dev_t
     near the end is the first regular read block I/O with dev_t!=0
      as bio!=NULL and bi_disk or rq_disk are non-NULL
(2) blk_register_queue() also creates queue kobj with gendisk parent
    now we can follow q->kobj.parent to gendisk to get dev_t,
    e.g. for RWBS 'N' such as implicit SCSI test unit ready on blk dev close
(3) optionally "regular" I/O
(4) blk_unregister_queue() called by del_gendisk()
    does kobject_del(&q->kobj) which NULLifies q->kobj.parent again
    the last put of gendisk reference releases gendisk kobj
(5) blk_cleanup_queue()
    the last put of queue (/mq) reference releases queue (/mq) kobj

The patch set would have made dev_t meaningful between and incl. (2) and (3)
for (un)plug and even for other block events with RWBS 'N'.
I noticed that I missed one more spot in block_rq_complete [PATCH 3/3],
which explains why I had missed some rq_completions while filtering
block trace events for specific (non-zero) dev_t.
It still wouldn't have been perfect as (1a) events still are dev_t==0
even though the gendisk exists since (1).

I tried the same but increased scsi_debug/delay to a huge value
and made one I/O request pending with "dd" before the hotunplug
to see how flushing would work. For SCSI, at least,
all flushing including a full (and unsuccessful due to eh TUR tmo)
scsi_eh escalation for several minutes happened between
'block' kobj release during (4) and gendisk release during (5),
i.e. before the gendisk was gone but q->kobj.parent was already NULL.
The 'queue' kobj was released last finally.

Since I cannot prove it's always like this, I followed Bart's suggestion
and added another refcount get at (2) blk_register_queue().
However, when I want to put that additional refcount at (5) blk_cleanup_queue(),
we only get a queue pointer argument as context
and q->kobj.parent==NULL since (4),
so I don't know how to get to the gendisk object for the refcount put.
Probably the additional refcount would not help either as (4) has the
unconditional kobject_del(&q->kobj) NULLifying q->kobj.parent anyway...
I guess I came full circle.



Side notes:
The existing code holds a reference on gendisk, but between (2) and (4).

Apparently block trace events can easily observe parts such as (1a)
which are difficult for traditional blktrace to be started via blk dev ioctl
(or for ftrace 'blk' tracer which can be enabled between (2) and (4)).



Independently, I started to wonder if there is a refcount leak in the early error out if kobject_add() failed:

> int blk_register_queue(struct gendisk *disk)
>         struct device *dev = disk_to_dev(disk);
>         struct request_queue *q = disk->queue;
>         ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
>                                     ^^^ref+1
> 	if (ret < 0) {
> 		blk_trace_remove_sysfs(dev);

Above kobject_get already did its side-effect to provide the second argument when calling kobject_add. Even if kobject_add failed, we still have the refcount on the gendisk but don't put it?

> 		goto unlock;
> 	}

> 	if (q->request_fn || (q->mq_ops && q->elevator)) {
> 		ret = elv_register_queue(q);
> 		if (ret) {
> 			mutex_unlock(&q->sysfs_lock);
> 			kobject_uevent(&q->kobj, KOBJ_REMOVE);
> 			kobject_del(&q->kobj);
> 			blk_trace_remove_sysfs(dev);
> 			kobject_put(&dev->kobj);

In contrast, this early return seems to do the pairwise put.

> 			return ret;
> 		}
> 	}

> unlock:
> 	mutex_unlock(&q->sysfs_lock);
> 	return ret;
> }



-- 
Mit freundlichen Grüßen / Kind regards
Steffen Maier

Linux on z Systems Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
  2018-04-24 14:49             ` Steffen Maier
@ 2018-04-27 16:38               ` Bart Van Assche
  -1 siblings, 0 replies; 16+ messages in thread
From: Bart Van Assche @ 2018-04-27 16:38 UTC (permalink / raw)
  To: maier, osandov
  Cc: lizefan, linux-kernel, linux-block, rostedt, mingo, axboe, gregkh

T24gVHVlLCAyMDE4LTA0LTI0IGF0IDE2OjQ5ICswMjAwLCBTdGVmZmVuIE1haWVyIHdyb3RlOg0K
PiBUaGUgb2JqZWN0IGxpZmUgY3ljbGUgc2VlbXMgdG8gYmU6DQo+IA0KPiAoMSkgYmxrX2FsbG9j
X3F1ZXVlKCkgYWxzbyBjcmVhdGVzIGdlbmRpc2sga29iag0KDQpJIHRoaW5rIHRoaXMgaXMgYSBt
aXNpbnRlcnByZXRhdGlvbi4gYmxrX2FsbG9jX3F1ZXVlX25vZGUoKSBpbml0aWFsaXplcyB0aGUN
CnJlcXVlc3QgcXVldWUga29iaiBhcyBmb2xsb3dzOg0KDQoJa29iamVjdF9pbml0KCZxLT5rb2Jq
LCAmYmxrX3F1ZXVlX2t0eXBlKTsNCg0KcmVnaXN0ZXJfZGlzaygpIGNyZWF0ZXMgdGhlIC9zeXMv
Y2xhc3MvYmxvY2svPGRpc2s+IG5vZGUgYW5kIC9zeXMvYmxvY2svPGRpc2s+DQpsaW5rIGFzIGZv
bGxvd3M6DQoNCglpZiAoZGV2aWNlX2FkZChkZGV2KSkNCgkJcmV0dXJuOw0KCWlmICghc3lzZnNf
ZGVwcmVjYXRlZCkgew0KCQllcnIgPSBzeXNmc19jcmVhdGVfbGluayhibG9ja19kZXByLCAmZGRl
di0+a29iaiwNCgkJCQkJa29iamVjdF9uYW1lKCZkZGV2LT5rb2JqKSk7DQoJCWlmIChlcnIpIHsN
CgkJCWRldmljZV9kZWwoZGRldik7DQoJCQlyZXR1cm47DQoJCX0NCgl9DQoNCj4gKDFhKSBTQ1NJ
IGRvZXMgTFVOIHByb2JpbmcgSS9PDQo+ICAgICAgbW9zdCBvZiB0aGF0IGlzIGJsa3RyYWNlIFJX
QlMgZmllbGQgdmFsdWUgJ04nIGkuZS4gbm9uLVIvVyB3aXRoIGRldl90PT0wDQo+ICAgICAgIHNp
bmNlIHEtPmtvYmoucGFyZW50IGlzIHN0aWxsIE5VTEwgd2UgY2Fubm90IGdldCBnZW5kaXNrIGFu
ZCB0aHVzIGRldl90DQo+ICAgICAgbmVhciB0aGUgZW5kIGlzIHRoZSBmaXJzdCByZWd1bGFyIHJl
YWQgYmxvY2sgSS9PIHdpdGggZGV2X3QhPTANCj4gICAgICAgYXMgYmlvIT1OVUxMIGFuZCBiaV9k
aXNrIG9yIHJxX2Rpc2sgYXJlIG5vbi1OVUxMDQo+ICgyKSBibGtfcmVnaXN0ZXJfcXVldWUoKSBh
bHNvIGNyZWF0ZXMgcXVldWUga29iaiB3aXRoIGdlbmRpc2sgcGFyZW50DQo+ICAgICBub3cgd2Ug
Y2FuIGZvbGxvdyBxLT5rb2JqLnBhcmVudCB0byBnZW5kaXNrIHRvIGdldCBkZXZfdCwNCj4gICAg
IGUuZy4gZm9yIFJXQlMgJ04nIHN1Y2ggYXMgaW1wbGljaXQgU0NTSSB0ZXN0IHVuaXQgcmVhZHkg
b24gYmxrIGRldiBjbG9zZQ0KDQpQbGVhc2UgaGF2ZSBhIGxvb2sgYXQgdGhlIGRldmljZV9hZGRf
ZGlzaygpIGNhbGwgaW4gc2RfcHJvYmVfYXN5bmMoKS4gSSB0aGluaw0KdGhhdCBmdW5jdGlvbiB0
cmlnZ2VycyBhIGNhbGwgb2YgYmxrX3JlZ2lzdGVyX3F1ZXVlKCkuIFRoYXQgbGFzdCBmdW5jdGlv
bg0KYXNzb2NpYXRlcyB0aGUgcmVxdWVzdCBxdWV1ZSB3aXRoIHRoZSBnZW5kaXNrIGFzIGZvbGxv
d3M6DQoNCglyZXQgPSBrb2JqZWN0X2FkZCgmcS0+a29iaiwga29iamVjdF9nZXQoJmRldi0+a29i
aiksICIlcyIsICJxdWV1ZSIpOw0KDQo+ICgzKSBvcHRpb25hbGx5ICJyZWd1bGFyIiBJL08NCj4g
KDQpIGJsa191bnJlZ2lzdGVyX3F1ZXVlKCkgY2FsbGVkIGJ5IGRlbF9nZW5kaXNrKCkNCj4gICAg
IGRvZXMga29iamVjdF9kZWwoJnEtPmtvYmopIHdoaWNoIE5VTExpZmllcyBxLT5rb2JqLnBhcmVu
dCBhZ2Fpbg0KPiAgICAgdGhlIGxhc3QgcHV0IG9mIGdlbmRpc2sgcmVmZXJlbmNlIHJlbGVhc2Vz
IGdlbmRpc2sga29iag0KPiAoNSkgYmxrX2NsZWFudXBfcXVldWUoKQ0KPiAgICAgdGhlIGxhc3Qg
cHV0IG9mIHF1ZXVlICgvbXEpIHJlZmVyZW5jZSByZWxlYXNlcyBxdWV1ZSAoL21xKSBrb2JqDQoN
ClRoZSAvbXEgZGlyZWN0b3J5IGNvcnJlc3BvbmRzIHRvIHEtPm1xX2tvYmouIFRoZSBibG9jayBs
YXllciBjb3JlIGRyb3BzIGl0cw0KcmVmZXJlbmNlIHRvIHRoZSBxdWV1ZSBrb2JqZWN0IChxLT5r
b2JqKSBieSBjYWxsaW5nIGJsa19wdXRfcXVldWUoKSBhdCB0aGUNCmVuZCBvZiBibGtfY2xlYW51
cF9xdWV1ZSgpLg0KDQo+IFNpbmNlIEkgY2Fubm90IHByb3ZlIGl0J3MgYWx3YXlzIGxpa2UgdGhp
cywgSSBmb2xsb3dlZCBCYXJ0J3Mgc3VnZ2VzdGlvbg0KPiBhbmQgYWRkZWQgYW5vdGhlciByZWZj
b3VudCBnZXQgYXQgKDIpIGJsa19yZWdpc3Rlcl9xdWV1ZSgpLg0KPiBIb3dldmVyLCB3aGVuIEkg
d2FudCB0byBwdXQgdGhhdCBhZGRpdGlvbmFsIHJlZmNvdW50IGF0ICg1KSBibGtfY2xlYW51cF9x
dWV1ZSgpLA0KPiB3ZSBvbmx5IGdldCBhIHF1ZXVlIHBvaW50ZXIgYXJndW1lbnQgYXMgY29udGV4
dA0KPiBhbmQgcS0+a29iai5wYXJlbnQ9PU5VTEwgc2luY2UgKDQpLA0KPiBzbyBJIGRvbid0IGtu
b3cgaG93IHRvIGdldCB0byB0aGUgZ2VuZGlzayBvYmplY3QgZm9yIHRoZSByZWZjb3VudCBwdXQu
DQoNCkhhdmUgeW91IHRyaWVkIHRvIG1vZGlmeSBibGtfcmVnaXN0ZXJfcXVldWUoKSBzdWNoIHRo
YXQgaXQgc3RvcmVzIHRoZSBkaXNrDQpwb2ludGVyIGluIGEgbmV3IHN0cnVjdCByZXF1ZXN0X3F1
ZXVlIG1lbWJlcj8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg0KDQoNCg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
@ 2018-04-27 16:38               ` Bart Van Assche
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Van Assche @ 2018-04-27 16:38 UTC (permalink / raw)
  To: maier, osandov
  Cc: lizefan, linux-kernel, linux-block, rostedt, mingo, axboe, gregkh

On Tue, 2018-04-24 at 16:49 +0200, Steffen Maier wrote:
> The object life cycle seems to be:
> 
> (1) blk_alloc_queue() also creates gendisk kobj

I think this is a misinterpretation. blk_alloc_queue_node() initializes the
request queue kobj as follows:

	kobject_init(&q->kobj, &blk_queue_ktype);

register_disk() creates the /sys/class/block/<disk> node and /sys/block/<disk>
link as follows:

	if (device_add(ddev))
		return;
	if (!sysfs_deprecated) {
		err = sysfs_create_link(block_depr, &ddev->kobj,
					kobject_name(&ddev->kobj));
		if (err) {
			device_del(ddev);
			return;
		}
	}

> (1a) SCSI does LUN probing I/O
>      most of that is blktrace RWBS field value 'N' i.e. non-R/W with dev_t==0
>       since q->kobj.parent is still NULL we cannot get gendisk and thus dev_t
>      near the end is the first regular read block I/O with dev_t!=0
>       as bio!=NULL and bi_disk or rq_disk are non-NULL
> (2) blk_register_queue() also creates queue kobj with gendisk parent
>     now we can follow q->kobj.parent to gendisk to get dev_t,
>     e.g. for RWBS 'N' such as implicit SCSI test unit ready on blk dev close

Please have a look at the device_add_disk() call in sd_probe_async(). I think
that function triggers a call of blk_register_queue(). That last function
associates the request queue with the gendisk as follows:

	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");

> (3) optionally "regular" I/O
> (4) blk_unregister_queue() called by del_gendisk()
>     does kobject_del(&q->kobj) which NULLifies q->kobj.parent again
>     the last put of gendisk reference releases gendisk kobj
> (5) blk_cleanup_queue()
>     the last put of queue (/mq) reference releases queue (/mq) kobj

The /mq directory corresponds to q->mq_kobj. The block layer core drops its
reference to the queue kobject (q->kobj) by calling blk_put_queue() at the
end of blk_cleanup_queue().

> Since I cannot prove it's always like this, I followed Bart's suggestion
> and added another refcount get at (2) blk_register_queue().
> However, when I want to put that additional refcount at (5) blk_cleanup_queue(),
> we only get a queue pointer argument as context
> and q->kobj.parent==NULL since (4),
> so I don't know how to get to the gendisk object for the refcount put.

Have you tried to modify blk_register_queue() such that it stores the disk
pointer in a new struct request_queue member?

Thanks,

Bart.







^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-04-27 16:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-13 13:07 [PATCH 0/2] tracing/events: block: bring more on a par with blktrace Steffen Maier
2018-04-13 13:07 ` [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule Steffen Maier
2018-04-13 14:16   ` Steven Rostedt
2018-04-13 13:07 ` [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events Steffen Maier
2018-04-15  8:31   ` Greg Kroah-Hartman
2018-04-16 16:33     ` Steffen Maier
2018-04-16 16:33       ` Steffen Maier
2018-04-19 19:24       ` Omar Sandoval
2018-04-19 19:24         ` Omar Sandoval
2018-04-19 20:56         ` Bart Van Assche
2018-04-19 20:56           ` Bart Van Assche
2018-04-24 14:49           ` Steffen Maier
2018-04-24 14:49             ` Steffen Maier
2018-04-27 16:38             ` Bart Van Assche
2018-04-27 16:38               ` Bart Van Assche
2018-04-13 13:07 ` [RFC] tracing/events: block: also try to get dev_t via driver core for some events Steffen Maier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.