linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization
@ 2020-03-24  6:39 Konstantin Khlebnikov
  2020-03-24  6:39 ` [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks Konstantin Khlebnikov
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-24  6:39 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-kernel
  Cc: Mikulas Patocka, Mike Snitzer, Ming Lei

Simplified estimation for io_ticks introduced in patch
https://lore.kernel.org/linux-block/20181206164122.2166-5-snitzer@redhat.com/
could be very inaccurate for request longer than jiffy (i.e. any HDD)

There is at least one another report about this:
https://lore.kernel.org/linux-block/20200324031942.GA3060@ming.t460p/
See detail in comment for first patch.

v1: https://lore.kernel.org/lkml/155413438394.3201.15211440151043943989.stgit@buzz/
v2: https://lore.kernel.org/lkml/158314549775.1788.6529015932237292177.stgit@buzz/
v3:
 * update documentation
 * rebase to current linux-next
 * fix compilation for CONFIG_SMP=n

---

Konstantin Khlebnikov (3):
      block/diskstats: more accurate approximation of io_ticks for slow disks
      block/diskstats: accumulate all per-cpu counters in one pass
      block/diskstats: replace time_in_queue with sum of request times


 Documentation/admin-guide/iostats.rst |    5 ++-
 block/bio.c                           |    9 ++---
 block/blk-core.c                      |    5 +--
 block/genhd.c                         |   64 +++++++++++++++++++++++++--------
 block/partition-generic.c             |   39 ++++++++++++--------
 include/linux/genhd.h                 |   14 +++++--
 6 files changed, 90 insertions(+), 46 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-24  6:39 [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Konstantin Khlebnikov
@ 2020-03-24  6:39 ` Konstantin Khlebnikov
  2020-03-24 14:06   ` Ming Lei
  2020-03-24  6:39 ` [PATCH v3 2/3] block/diskstats: accumulate all per-cpu counters in one pass Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-24  6:39 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-kernel
  Cc: Mikulas Patocka, Mike Snitzer, Ming Lei

Currently io_ticks is approximated by adding one at each start and end of
requests if jiffies counter has changed. This works perfectly for requests
shorter than a jiffy or if one of requests starts/ends at each jiffy.

If disk executes just one request at a time and they are longer than two
jiffies then only first and last jiffies will be accounted.

Fix is simple: at the end of request add up into io_ticks jiffies passed
since last update rather than just one jiffy.

Example: common HDD executes random read 4k requests around 12ms.

fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
iostat -x 10 sdb

Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

Before:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43

After:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99

For load estimation "%util" is not as useful as average queue length,
but it clearly shows how often disk queue is completely empty.

Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 Documentation/admin-guide/iostats.rst |    5 ++++-
 block/bio.c                           |    8 ++++----
 block/blk-core.c                      |    4 ++--
 include/linux/genhd.h                 |    2 +-
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
index df5b8345c41d..9b14b0c2c9c4 100644
--- a/Documentation/admin-guide/iostats.rst
+++ b/Documentation/admin-guide/iostats.rst
@@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
 
     Since 5.0 this field counts jiffies when at least one request was
     started or completed. If request runs more than 2 jiffies then some
-    I/O time will not be accounted unless there are other requests.
+    I/O time might be not accounted in case of concurrent requests.
 
 Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
     This field is incremented at each I/O start, I/O completion, I/O
@@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
 summed to) and the result given to the user.  There is no convenient
 user interface for accessing the per-CPU counters themselves.
 
+Since 4.19 request times are measured with nanoseconds precision and
+truncated to milliseconds before showing in this interface.
+
 Disks vs Partitions
 -------------------
 
diff --git a/block/bio.c b/block/bio.c
index 0985f3422556..b1053eb7af37 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
 	schedule_work(&bio_dirty_work);
 }
 
-void update_io_ticks(struct hd_struct *part, unsigned long now)
+void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
 {
 	unsigned long stamp;
 again:
 	stamp = READ_ONCE(part->stamp);
 	if (unlikely(stamp != now)) {
 		if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
-			__part_stat_add(part, io_ticks, 1);
+			__part_stat_add(part, io_ticks, end ? now - stamp : 1);
 		}
 	}
 	if (part->partno) {
@@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
 
 	part_stat_lock();
 
-	update_io_ticks(part, jiffies);
+	update_io_ticks(part, jiffies, false);
 	part_stat_inc(part, ios[sgrp]);
 	part_stat_add(part, sectors[sgrp], sectors);
 	part_inc_in_flight(q, part, op_is_write(op));
@@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
 
 	part_stat_lock();
 
-	update_io_ticks(part, now);
+	update_io_ticks(part, now, true);
 	part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
 	part_stat_add(part, time_in_queue, duration);
 	part_dec_in_flight(q, part, op_is_write(req_op));
diff --git a/block/blk-core.c b/block/blk-core.c
index abfdcf81a228..4401b30a1751 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
 		part_stat_lock();
 		part = req->part;
 
-		update_io_ticks(part, jiffies);
+		update_io_ticks(part, jiffies, true);
 		part_stat_inc(part, ios[sgrp]);
 		part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
 		part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
@@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
 		rq->part = part;
 	}
 
-	update_io_ticks(part, jiffies);
+	update_io_ticks(part, jiffies, false);
 
 	part_stat_unlock();
 }
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index d5c75df64bba..f1066f10b062 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
 	kfree(part->info);
 }
 
-void update_io_ticks(struct hd_struct *part, unsigned long now);
+void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
 
 /* block/genhd.c */
 extern void device_add_disk(struct device *parent, struct gendisk *disk,


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/3] block/diskstats: accumulate all per-cpu counters in one pass
  2020-03-24  6:39 [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Konstantin Khlebnikov
  2020-03-24  6:39 ` [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks Konstantin Khlebnikov
@ 2020-03-24  6:39 ` Konstantin Khlebnikov
  2020-03-24  6:39 ` [PATCH v3 3/3] block/diskstats: replace time_in_queue with sum of request times Konstantin Khlebnikov
  2020-03-24 16:06 ` [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Jens Axboe
  3 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-24  6:39 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-kernel
  Cc: Mikulas Patocka, Mike Snitzer, Ming Lei

Reading /proc/diskstats iterates over all cpus for summing each field.
It's faster to sum all fields in one pass.

Hammering /proc/diskstats with fio shows 2x performance improvement:

fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
    --size=1k --bs=1k --fallocate=none --create_on_open=1 \
    --time_based=1 --runtime=10 --invalidate=0 --group_report

	  JOBS=1	JOBS=10
Before:	  7k iops	64k iops
After:	 18k iops      120k iops

Also this way code is more compact:

add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
Function                                     old     new   delta
part_stat_read_all                             -     194    +194
diskstats_show                              1344     631    -713
part_stat_show                              1219     392    -827
Total: Before=14966947, After=14965601, chg -0.01%

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 block/genhd.c             |   61 +++++++++++++++++++++++++++++++++------------
 block/partition-generic.c |   35 ++++++++++++++------------
 include/linux/genhd.h     |   11 ++++++--
 3 files changed, 72 insertions(+), 35 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index b210c12c4870..606e8755f6ed 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -69,6 +69,28 @@ void set_capacity_revalidate_and_notify(struct gendisk *disk, sector_t size,
 
 EXPORT_SYMBOL_GPL(set_capacity_revalidate_and_notify);
 
+#ifdef CONFIG_SMP
+void part_stat_read_all(struct hd_struct *part, struct disk_stats *stat)
+{
+	int cpu;
+
+	memset(stat, 0, sizeof(struct disk_stats));
+	for_each_possible_cpu(cpu) {
+		struct disk_stats *ptr = per_cpu_ptr(part->dkstats, cpu);
+		int group;
+
+		for (group = 0; group < NR_STAT_GROUPS; group++) {
+			stat->nsecs[group] += ptr->nsecs[group];
+			stat->sectors[group] += ptr->sectors[group];
+			stat->ios[group] += ptr->ios[group];
+			stat->merges[group] += ptr->merges[group];
+		}
+
+		stat->io_ticks += ptr->io_ticks;
+		stat->time_in_queue += ptr->time_in_queue;
+	}
+}
+#endif /* CONFIG_SMP */
 
 void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
 {
@@ -1429,6 +1451,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 	struct hd_struct *hd;
 	char buf[BDEVNAME_SIZE];
 	unsigned int inflight;
+	struct disk_stats stat;
 
 	/*
 	if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
@@ -1440,7 +1463,9 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 
 	disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
 	while ((hd = disk_part_iter_next(&piter))) {
+		part_stat_read_all(hd, &stat);
 		inflight = part_in_flight(gp->queue, hd);
+
 		seq_printf(seqf, "%4d %7d %s "
 			   "%lu %lu %lu %u "
 			   "%lu %lu %lu %u "
@@ -1450,23 +1475,27 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 			   "\n",
 			   MAJOR(part_devt(hd)), MINOR(part_devt(hd)),
 			   disk_name(gp, hd->partno, buf),
-			   part_stat_read(hd, ios[STAT_READ]),
-			   part_stat_read(hd, merges[STAT_READ]),
-			   part_stat_read(hd, sectors[STAT_READ]),
-			   (unsigned int)part_stat_read_msecs(hd, STAT_READ),
-			   part_stat_read(hd, ios[STAT_WRITE]),
-			   part_stat_read(hd, merges[STAT_WRITE]),
-			   part_stat_read(hd, sectors[STAT_WRITE]),
-			   (unsigned int)part_stat_read_msecs(hd, STAT_WRITE),
+			   stat.ios[STAT_READ],
+			   stat.merges[STAT_READ],
+			   stat.sectors[STAT_READ],
+			   (unsigned int)div_u64(stat.nsecs[STAT_READ],
+							NSEC_PER_MSEC),
+			   stat.ios[STAT_WRITE],
+			   stat.merges[STAT_WRITE],
+			   stat.sectors[STAT_WRITE],
+			   (unsigned int)div_u64(stat.nsecs[STAT_WRITE],
+							NSEC_PER_MSEC),
 			   inflight,
-			   jiffies_to_msecs(part_stat_read(hd, io_ticks)),
-			   jiffies_to_msecs(part_stat_read(hd, time_in_queue)),
-			   part_stat_read(hd, ios[STAT_DISCARD]),
-			   part_stat_read(hd, merges[STAT_DISCARD]),
-			   part_stat_read(hd, sectors[STAT_DISCARD]),
-			   (unsigned int)part_stat_read_msecs(hd, STAT_DISCARD),
-			   part_stat_read(hd, ios[STAT_FLUSH]),
-			   (unsigned int)part_stat_read_msecs(hd, STAT_FLUSH)
+			   jiffies_to_msecs(stat.io_ticks),
+			   jiffies_to_msecs(stat.time_in_queue),
+			   stat.ios[STAT_DISCARD],
+			   stat.merges[STAT_DISCARD],
+			   stat.sectors[STAT_DISCARD],
+			   (unsigned int)div_u64(stat.nsecs[STAT_DISCARD],
+						 NSEC_PER_MSEC),
+			   stat.ios[STAT_FLUSH],
+			   (unsigned int)div_u64(stat.nsecs[STAT_FLUSH],
+						 NSEC_PER_MSEC)
 			);
 	}
 	disk_part_iter_exit(&piter);
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 564fae77711d..6d3fcb5187cb 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -120,9 +120,12 @@ ssize_t part_stat_show(struct device *dev,
 {
 	struct hd_struct *p = dev_to_part(dev);
 	struct request_queue *q = part_to_disk(p)->queue;
+	struct disk_stats stat;
 	unsigned int inflight;
 
+	part_stat_read_all(p, &stat);
 	inflight = part_in_flight(q, p);
+
 	return sprintf(buf,
 		"%8lu %8lu %8llu %8u "
 		"%8lu %8lu %8llu %8u "
@@ -130,23 +133,23 @@ ssize_t part_stat_show(struct device *dev,
 		"%8lu %8lu %8llu %8u "
 		"%8lu %8u"
 		"\n",
-		part_stat_read(p, ios[STAT_READ]),
-		part_stat_read(p, merges[STAT_READ]),
-		(unsigned long long)part_stat_read(p, sectors[STAT_READ]),
-		(unsigned int)part_stat_read_msecs(p, STAT_READ),
-		part_stat_read(p, ios[STAT_WRITE]),
-		part_stat_read(p, merges[STAT_WRITE]),
-		(unsigned long long)part_stat_read(p, sectors[STAT_WRITE]),
-		(unsigned int)part_stat_read_msecs(p, STAT_WRITE),
+		stat.ios[STAT_READ],
+		stat.merges[STAT_READ],
+		(unsigned long long)stat.sectors[STAT_READ],
+		(unsigned int)div_u64(stat.nsecs[STAT_READ], NSEC_PER_MSEC),
+		stat.ios[STAT_WRITE],
+		stat.merges[STAT_WRITE],
+		(unsigned long long)stat.sectors[STAT_WRITE],
+		(unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC),
 		inflight,
-		jiffies_to_msecs(part_stat_read(p, io_ticks)),
-		jiffies_to_msecs(part_stat_read(p, time_in_queue)),
-		part_stat_read(p, ios[STAT_DISCARD]),
-		part_stat_read(p, merges[STAT_DISCARD]),
-		(unsigned long long)part_stat_read(p, sectors[STAT_DISCARD]),
-		(unsigned int)part_stat_read_msecs(p, STAT_DISCARD),
-		part_stat_read(p, ios[STAT_FLUSH]),
-		(unsigned int)part_stat_read_msecs(p, STAT_FLUSH));
+		jiffies_to_msecs(stat.io_ticks),
+		jiffies_to_msecs(stat.time_in_queue),
+		stat.ios[STAT_DISCARD],
+		stat.merges[STAT_DISCARD],
+		(unsigned long long)stat.sectors[STAT_DISCARD],
+		(unsigned int)div_u64(stat.nsecs[STAT_DISCARD], NSEC_PER_MSEC),
+		stat.ios[STAT_FLUSH],
+		(unsigned int)div_u64(stat.nsecs[STAT_FLUSH], NSEC_PER_MSEC));
 }
 
 ssize_t part_inflight_show(struct device *dev, struct device_attribute *attr,
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index f1066f10b062..ead3ffb7f327 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -375,6 +375,8 @@ static inline void part_stat_set_all(struct hd_struct *part, int value)
 				sizeof(struct disk_stats));
 }
 
+void part_stat_read_all(struct hd_struct *part, struct disk_stats *stat);
+
 static inline int init_part_stats(struct hd_struct *part)
 {
 	part->dkstats = alloc_percpu(struct disk_stats);
@@ -401,6 +403,12 @@ static inline void part_stat_set_all(struct hd_struct *part, int value)
 	memset(&part->dkstats, value, sizeof(struct disk_stats));
 }
 
+static inline void part_stat_read_all(struct hd_struct *part,
+				      struct disk_stats *stat)
+{
+	memcpy(stat, &part->dkstats, sizeof(struct disk_stats));
+}
+
 static inline int init_part_stats(struct hd_struct *part)
 {
 	return 1;
@@ -412,9 +420,6 @@ static inline void free_part_stats(struct hd_struct *part)
 
 #endif /* CONFIG_SMP */
 
-#define part_stat_read_msecs(part, which)				\
-	div_u64(part_stat_read(part, nsecs[which]), NSEC_PER_MSEC)
-
 #define part_stat_read_accum(part, field)				\
 	(part_stat_read(part, field[STAT_READ]) +			\
 	 part_stat_read(part, field[STAT_WRITE]) +			\


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/3] block/diskstats: replace time_in_queue with sum of request times
  2020-03-24  6:39 [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Konstantin Khlebnikov
  2020-03-24  6:39 ` [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks Konstantin Khlebnikov
  2020-03-24  6:39 ` [PATCH v3 2/3] block/diskstats: accumulate all per-cpu counters in one pass Konstantin Khlebnikov
@ 2020-03-24  6:39 ` Konstantin Khlebnikov
  2020-03-24 16:06 ` [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Jens Axboe
  3 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-24  6:39 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-kernel
  Cc: Mikulas Patocka, Mike Snitzer, Ming Lei

Column "time_in_queue" in diskstats is supposed to show total waiting time
of all requests. I.e. value should be equal to the sum of times from other
columns. But this is not true, because column "time_in_queue" is counted
separately in jiffies rather than in nanoseconds as other times.

This patch removes redundant counter for "time_in_queue" and shows total
time of read, write, discard and flush requests.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 block/bio.c               |    1 -
 block/blk-core.c          |    1 -
 block/genhd.c             |    7 +++++--
 block/partition-generic.c |    6 +++++-
 include/linux/genhd.h     |    1 -
 5 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b1053eb7af37..a353d5cdf589 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1805,7 +1805,6 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
 
 	update_io_ticks(part, now, true);
 	part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
-	part_stat_add(part, time_in_queue, duration);
 	part_dec_in_flight(q, part, op_is_write(req_op));
 
 	part_stat_unlock();
diff --git a/block/blk-core.c b/block/blk-core.c
index 4401b30a1751..eaf6cb3887e6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1340,7 +1340,6 @@ void blk_account_io_done(struct request *req, u64 now)
 		update_io_ticks(part, jiffies, true);
 		part_stat_inc(part, ios[sgrp]);
 		part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
-		part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
 		part_dec_in_flight(req->q, part, rq_data_dir(req));
 
 		hd_struct_put(part);
diff --git a/block/genhd.c b/block/genhd.c
index 606e8755f6ed..2b49d997ac72 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -87,7 +87,6 @@ void part_stat_read_all(struct hd_struct *part, struct disk_stats *stat)
 		}
 
 		stat->io_ticks += ptr->io_ticks;
-		stat->time_in_queue += ptr->time_in_queue;
 	}
 }
 #endif /* CONFIG_SMP */
@@ -1487,7 +1486,11 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 							NSEC_PER_MSEC),
 			   inflight,
 			   jiffies_to_msecs(stat.io_ticks),
-			   jiffies_to_msecs(stat.time_in_queue),
+			   (unsigned int)div_u64(stat.nsecs[STAT_READ] +
+						 stat.nsecs[STAT_WRITE] +
+						 stat.nsecs[STAT_DISCARD] +
+						 stat.nsecs[STAT_FLUSH],
+							NSEC_PER_MSEC),
 			   stat.ios[STAT_DISCARD],
 			   stat.merges[STAT_DISCARD],
 			   stat.sectors[STAT_DISCARD],
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 6d3fcb5187cb..2a9bc78a116f 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -143,7 +143,11 @@ ssize_t part_stat_show(struct device *dev,
 		(unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC),
 		inflight,
 		jiffies_to_msecs(stat.io_ticks),
-		jiffies_to_msecs(stat.time_in_queue),
+		(unsigned int)div_u64(stat.nsecs[STAT_READ] +
+				      stat.nsecs[STAT_WRITE] +
+				      stat.nsecs[STAT_DISCARD] +
+				      stat.nsecs[STAT_FLUSH],
+						NSEC_PER_MSEC),
 		stat.ios[STAT_DISCARD],
 		stat.merges[STAT_DISCARD],
 		(unsigned long long)stat.sectors[STAT_DISCARD],
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index ead3ffb7f327..019ede22ebd2 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -89,7 +89,6 @@ struct disk_stats {
 	unsigned long ios[NR_STAT_GROUPS];
 	unsigned long merges[NR_STAT_GROUPS];
 	unsigned long io_ticks;
-	unsigned long time_in_queue;
 	local_t in_flight[2];
 };
 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-24  6:39 ` [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks Konstantin Khlebnikov
@ 2020-03-24 14:06   ` Ming Lei
  2020-03-25  3:40     ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2020-03-24 14:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-block, Jens Axboe, linux-kernel, Mikulas Patocka, Mike Snitzer

On Tue, Mar 24, 2020 at 09:39:40AM +0300, Konstantin Khlebnikov wrote:
> Currently io_ticks is approximated by adding one at each start and end of
> requests if jiffies counter has changed. This works perfectly for requests
> shorter than a jiffy or if one of requests starts/ends at each jiffy.
> 
> If disk executes just one request at a time and they are longer than two
> jiffies then only first and last jiffies will be accounted.
> 
> Fix is simple: at the end of request add up into io_ticks jiffies passed
> since last update rather than just one jiffy.
> 
> Example: common HDD executes random read 4k requests around 12ms.
> 
> fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
> iostat -x 10 sdb
> 
> Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
> 
> Before:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
> 
> After:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
> 
> For load estimation "%util" is not as useful as average queue length,
> but it clearly shows how often disk queue is completely empty.
> 
> Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  Documentation/admin-guide/iostats.rst |    5 ++++-
>  block/bio.c                           |    8 ++++----
>  block/blk-core.c                      |    4 ++--
>  include/linux/genhd.h                 |    2 +-
>  4 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
> index df5b8345c41d..9b14b0c2c9c4 100644
> --- a/Documentation/admin-guide/iostats.rst
> +++ b/Documentation/admin-guide/iostats.rst
> @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
>  
>      Since 5.0 this field counts jiffies when at least one request was
>      started or completed. If request runs more than 2 jiffies then some
> -    I/O time will not be accounted unless there are other requests.
> +    I/O time might be not accounted in case of concurrent requests.
>  
>  Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
>      This field is incremented at each I/O start, I/O completion, I/O
> @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
>  summed to) and the result given to the user.  There is no convenient
>  user interface for accessing the per-CPU counters themselves.
>  
> +Since 4.19 request times are measured with nanoseconds precision and
> +truncated to milliseconds before showing in this interface.
> +
>  Disks vs Partitions
>  -------------------
>  
> diff --git a/block/bio.c b/block/bio.c
> index 0985f3422556..b1053eb7af37 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
>  	schedule_work(&bio_dirty_work);
>  }
>  
> -void update_io_ticks(struct hd_struct *part, unsigned long now)
> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
>  {
>  	unsigned long stamp;
>  again:
>  	stamp = READ_ONCE(part->stamp);
>  	if (unlikely(stamp != now)) {
>  		if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> -			__part_stat_add(part, io_ticks, 1);
> +			__part_stat_add(part, io_ticks, end ? now - stamp : 1);
>  		}
>  	}
>  	if (part->partno) {
> @@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
>  
>  	part_stat_lock();
>  
> -	update_io_ticks(part, jiffies);
> +	update_io_ticks(part, jiffies, false);
>  	part_stat_inc(part, ios[sgrp]);
>  	part_stat_add(part, sectors[sgrp], sectors);
>  	part_inc_in_flight(q, part, op_is_write(op));
> @@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
>  
>  	part_stat_lock();
>  
> -	update_io_ticks(part, now);
> +	update_io_ticks(part, now, true);
>  	part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
>  	part_stat_add(part, time_in_queue, duration);
>  	part_dec_in_flight(q, part, op_is_write(req_op));
> diff --git a/block/blk-core.c b/block/blk-core.c
> index abfdcf81a228..4401b30a1751 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
>  		part_stat_lock();
>  		part = req->part;
>  
> -		update_io_ticks(part, jiffies);
> +		update_io_ticks(part, jiffies, true);
>  		part_stat_inc(part, ios[sgrp]);
>  		part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
>  		part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
> @@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
>  		rq->part = part;
>  	}
>  
> -	update_io_ticks(part, jiffies);
> +	update_io_ticks(part, jiffies, false);
>  
>  	part_stat_unlock();
>  }
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index d5c75df64bba..f1066f10b062 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
>  	kfree(part->info);
>  }
>  
> -void update_io_ticks(struct hd_struct *part, unsigned long now);
> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
>  
>  /* block/genhd.c */
>  extern void device_add_disk(struct device *parent, struct gendisk *disk,
> 

Looks fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization
  2020-03-24  6:39 [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2020-03-24  6:39 ` [PATCH v3 3/3] block/diskstats: replace time_in_queue with sum of request times Konstantin Khlebnikov
@ 2020-03-24 16:06 ` Jens Axboe
  3 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2020-03-24 16:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-block, linux-kernel
  Cc: Mikulas Patocka, Mike Snitzer, Ming Lei

On 3/24/20 12:39 AM, Konstantin Khlebnikov wrote:
> Simplified estimation for io_ticks introduced in patch
> https://lore.kernel.org/linux-block/20181206164122.2166-5-snitzer@redhat.com/
> could be very inaccurate for request longer than jiffy (i.e. any HDD)
> 
> There is at least one another report about this:
> https://lore.kernel.org/linux-block/20200324031942.GA3060@ming.t460p/
> See detail in comment for first patch.
> 
> v1: https://lore.kernel.org/lkml/155413438394.3201.15211440151043943989.stgit@buzz/
> v2: https://lore.kernel.org/lkml/158314549775.1788.6529015932237292177.stgit@buzz/
> v3:
>  * update documentation
>  * rebase to current linux-next
>  * fix compilation for CONFIG_SMP=n

Can you respin this against my for-5.7/block branch?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-24 14:06   ` Ming Lei
@ 2020-03-25  3:40     ` Ming Lei
  2020-03-25  6:28       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2020-03-25  3:40 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-block, Jens Axboe, linux-kernel, Mikulas Patocka, Mike Snitzer

On Tue, Mar 24, 2020 at 10:06:56PM +0800, Ming Lei wrote:
> On Tue, Mar 24, 2020 at 09:39:40AM +0300, Konstantin Khlebnikov wrote:
> > Currently io_ticks is approximated by adding one at each start and end of
> > requests if jiffies counter has changed. This works perfectly for requests
> > shorter than a jiffy or if one of requests starts/ends at each jiffy.
> > 
> > If disk executes just one request at a time and they are longer than two
> > jiffies then only first and last jiffies will be accounted.
> > 
> > Fix is simple: at the end of request add up into io_ticks jiffies passed
> > since last update rather than just one jiffy.
> > 
> > Example: common HDD executes random read 4k requests around 12ms.
> > 
> > fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
> > iostat -x 10 sdb
> > 
> > Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
> > 
> > Before:
> > 
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
> > 
> > After:
> > 
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
> > 
> > For load estimation "%util" is not as useful as average queue length,
> > but it clearly shows how often disk queue is completely empty.
> > 
> > Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
> > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> > ---
> >  Documentation/admin-guide/iostats.rst |    5 ++++-
> >  block/bio.c                           |    8 ++++----
> >  block/blk-core.c                      |    4 ++--
> >  include/linux/genhd.h                 |    2 +-
> >  4 files changed, 11 insertions(+), 8 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
> > index df5b8345c41d..9b14b0c2c9c4 100644
> > --- a/Documentation/admin-guide/iostats.rst
> > +++ b/Documentation/admin-guide/iostats.rst
> > @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
> >  
> >      Since 5.0 this field counts jiffies when at least one request was
> >      started or completed. If request runs more than 2 jiffies then some
> > -    I/O time will not be accounted unless there are other requests.
> > +    I/O time might be not accounted in case of concurrent requests.
> >  
> >  Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
> >      This field is incremented at each I/O start, I/O completion, I/O
> > @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
> >  summed to) and the result given to the user.  There is no convenient
> >  user interface for accessing the per-CPU counters themselves.
> >  
> > +Since 4.19 request times are measured with nanoseconds precision and
> > +truncated to milliseconds before showing in this interface.
> > +
> >  Disks vs Partitions
> >  -------------------
> >  
> > diff --git a/block/bio.c b/block/bio.c
> > index 0985f3422556..b1053eb7af37 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
> >  	schedule_work(&bio_dirty_work);
> >  }
> >  
> > -void update_io_ticks(struct hd_struct *part, unsigned long now)
> > +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> >  {
> >  	unsigned long stamp;
> >  again:
> >  	stamp = READ_ONCE(part->stamp);
> >  	if (unlikely(stamp != now)) {
> >  		if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> > -			__part_stat_add(part, io_ticks, 1);
> > +			__part_stat_add(part, io_ticks, end ? now - stamp : 1);
> >  		}
> >  	}
> >  	if (part->partno) {
> > @@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
> >  
> >  	part_stat_lock();
> >  
> > -	update_io_ticks(part, jiffies);
> > +	update_io_ticks(part, jiffies, false);
> >  	part_stat_inc(part, ios[sgrp]);
> >  	part_stat_add(part, sectors[sgrp], sectors);
> >  	part_inc_in_flight(q, part, op_is_write(op));
> > @@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
> >  
> >  	part_stat_lock();
> >  
> > -	update_io_ticks(part, now);
> > +	update_io_ticks(part, now, true);
> >  	part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
> >  	part_stat_add(part, time_in_queue, duration);
> >  	part_dec_in_flight(q, part, op_is_write(req_op));
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index abfdcf81a228..4401b30a1751 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
> >  		part_stat_lock();
> >  		part = req->part;
> >  
> > -		update_io_ticks(part, jiffies);
> > +		update_io_ticks(part, jiffies, true);
> >  		part_stat_inc(part, ios[sgrp]);
> >  		part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
> >  		part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
> > @@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
> >  		rq->part = part;
> >  	}
> >  
> > -	update_io_ticks(part, jiffies);
> > +	update_io_ticks(part, jiffies, false);
> >  
> >  	part_stat_unlock();
> >  }
> > diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> > index d5c75df64bba..f1066f10b062 100644
> > --- a/include/linux/genhd.h
> > +++ b/include/linux/genhd.h
> > @@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
> >  	kfree(part->info);
> >  }
> >  
> > -void update_io_ticks(struct hd_struct *part, unsigned long now);
> > +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
> >  
> >  /* block/genhd.c */
> >  extern void device_add_disk(struct device *parent, struct gendisk *disk,
> > 
> 
> Looks fine:
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>

BTW, there is still some gap(%65 vs. 99%) between this fix and the original
accounting(before applying Mike/Mikulas's 7 patches), and it might be
one thing to improve in future.

1) test, sda is single queue virtio-scsi, which is emulated by one HDD
image

2) fio test script:
fio --direct=1 --size=128G --bsrange=4k-4k \
		--runtime=20 --numjobs=1 \
		--ioengine=libaio --iodepth=16 \
		--iodepth_batch_submit=16 \
		--iodepth_batch_complete_min=16 \
		--group_reporting=1 --filename=/dev/sda \
		--name=seq-test --rw=read

3) result:
- v5.6-rc with this patch
Run status group 0 (all jobs):
   READ: bw=79.4MiB/s (83.3MB/s), 79.4MiB/s-79.4MiB/s (83.3MB/s-83.3MB/s), io=155
88MiB (1665MB), run=20001-20001msec

Disk stats (read/write):
  sda: ios=25039/0, merge=375596/0, ticks=18823/0, in_queue=4330, util=99.43%


- commit 112f158f66cb (which is previous commit of 5b18b5a73760)
Run status group 0 (all jobs):
   READ: bw=81.4MiB/s (85.3MB/s), 81.4MiB/s-81.4MiB/s (85.3MB/s-85.3MB/s), io=166
28MiB (1707MB), run=20001-20001msec

Disk stats (read/write):
  sda: ios=25749/0, merge=386236/0, ticks=17963/0, in_queue=12976, util=65.20%


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-25  3:40     ` Ming Lei
@ 2020-03-25  6:28       ` Konstantin Khlebnikov
  2020-03-25  8:02         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-25  6:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, linux-kernel, Mikulas Patocka, Mike Snitzer



On 25/03/2020 06.40, Ming Lei wrote:
> On Tue, Mar 24, 2020 at 10:06:56PM +0800, Ming Lei wrote:
>> On Tue, Mar 24, 2020 at 09:39:40AM +0300, Konstantin Khlebnikov wrote:
>>> Currently io_ticks is approximated by adding one at each start and end of
>>> requests if jiffies counter has changed. This works perfectly for requests
>>> shorter than a jiffy or if one of requests starts/ends at each jiffy.
>>>
>>> If disk executes just one request at a time and they are longer than two
>>> jiffies then only first and last jiffies will be accounted.
>>>
>>> Fix is simple: at the end of request add up into io_ticks jiffies passed
>>> since last update rather than just one jiffy.
>>>
>>> Example: common HDD executes random read 4k requests around 12ms.
>>>
>>> fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
>>> iostat -x 10 sdb
>>>
>>> Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
>>>
>>> Before:
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
>>>
>>> After:
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
>>>
>>> For load estimation "%util" is not as useful as average queue length,
>>> but it clearly shows how often disk queue is completely empty.
>>>
>>> Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>> ---
>>>   Documentation/admin-guide/iostats.rst |    5 ++++-
>>>   block/bio.c                           |    8 ++++----
>>>   block/blk-core.c                      |    4 ++--
>>>   include/linux/genhd.h                 |    2 +-
>>>   4 files changed, 11 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
>>> index df5b8345c41d..9b14b0c2c9c4 100644
>>> --- a/Documentation/admin-guide/iostats.rst
>>> +++ b/Documentation/admin-guide/iostats.rst
>>> @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
>>>   
>>>       Since 5.0 this field counts jiffies when at least one request was
>>>       started or completed. If request runs more than 2 jiffies then some
>>> -    I/O time will not be accounted unless there are other requests.
>>> +    I/O time might be not accounted in case of concurrent requests.
>>>   
>>>   Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
>>>       This field is incremented at each I/O start, I/O completion, I/O
>>> @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
>>>   summed to) and the result given to the user.  There is no convenient
>>>   user interface for accessing the per-CPU counters themselves.
>>>   
>>> +Since 4.19 request times are measured with nanoseconds precision and
>>> +truncated to milliseconds before showing in this interface.
>>> +
>>>   Disks vs Partitions
>>>   -------------------
>>>   
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 0985f3422556..b1053eb7af37 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
>>>   	schedule_work(&bio_dirty_work);
>>>   }
>>>   
>>> -void update_io_ticks(struct hd_struct *part, unsigned long now)
>>> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
>>>   {
>>>   	unsigned long stamp;
>>>   again:
>>>   	stamp = READ_ONCE(part->stamp);
>>>   	if (unlikely(stamp != now)) {
>>>   		if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
>>> -			__part_stat_add(part, io_ticks, 1);
>>> +			__part_stat_add(part, io_ticks, end ? now - stamp : 1);
>>>   		}
>>>   	}
>>>   	if (part->partno) {
>>> @@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
>>>   
>>>   	part_stat_lock();
>>>   
>>> -	update_io_ticks(part, jiffies);
>>> +	update_io_ticks(part, jiffies, false);
>>>   	part_stat_inc(part, ios[sgrp]);
>>>   	part_stat_add(part, sectors[sgrp], sectors);
>>>   	part_inc_in_flight(q, part, op_is_write(op));
>>> @@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
>>>   
>>>   	part_stat_lock();
>>>   
>>> -	update_io_ticks(part, now);
>>> +	update_io_ticks(part, now, true);
>>>   	part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
>>>   	part_stat_add(part, time_in_queue, duration);
>>>   	part_dec_in_flight(q, part, op_is_write(req_op));
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index abfdcf81a228..4401b30a1751 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
>>>   		part_stat_lock();
>>>   		part = req->part;
>>>   
>>> -		update_io_ticks(part, jiffies);
>>> +		update_io_ticks(part, jiffies, true);
>>>   		part_stat_inc(part, ios[sgrp]);
>>>   		part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
>>>   		part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
>>> @@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
>>>   		rq->part = part;
>>>   	}
>>>   
>>> -	update_io_ticks(part, jiffies);
>>> +	update_io_ticks(part, jiffies, false);
>>>   
>>>   	part_stat_unlock();
>>>   }
>>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>>> index d5c75df64bba..f1066f10b062 100644
>>> --- a/include/linux/genhd.h
>>> +++ b/include/linux/genhd.h
>>> @@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
>>>   	kfree(part->info);
>>>   }
>>>   
>>> -void update_io_ticks(struct hd_struct *part, unsigned long now);
>>> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
>>>   
>>>   /* block/genhd.c */
>>>   extern void device_add_disk(struct device *parent, struct gendisk *disk,
>>>
>>
>> Looks fine:
>>
>> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> 
> BTW, there is still some gap(%65 vs. 99%) between this fix and the original
> accounting(before applying Mike/Mikulas's 7 patches), and it might be
> one thing to improve in future.
> 
> 1) test, sda is single queue virtio-scsi, which is emulated by one HDD
> image
> 
> 2) fio test script:
> fio --direct=1 --size=128G --bsrange=4k-4k \
> 		--runtime=20 --numjobs=1 \
> 		--ioengine=libaio --iodepth=16 \
> 		--iodepth_batch_submit=16 \
> 		--iodepth_batch_complete_min=16 \
> 		--group_reporting=1 --filename=/dev/sda \
> 		--name=seq-test --rw=read
> 
> 3) result:
> - v5.6-rc with this patch
> Run status group 0 (all jobs):
>     READ: bw=79.4MiB/s (83.3MB/s), 79.4MiB/s-79.4MiB/s (83.3MB/s-83.3MB/s), io=155
> 88MiB (1665MB), run=20001-20001msec
> 
> Disk stats (read/write):
>    sda: ios=25039/0, merge=375596/0, ticks=18823/0, in_queue=4330, util=99.43%
> 
> 
> - commit 112f158f66cb (which is previous commit of 5b18b5a73760)
> Run status group 0 (all jobs):
>     READ: bw=81.4MiB/s (85.3MB/s), 81.4MiB/s-81.4MiB/s (85.3MB/s-85.3MB/s), io=166
> 28MiB (1707MB), run=20001-20001msec
> 
> Disk stats (read/write):
>    sda: ios=25749/0, merge=386236/0, ticks=17963/0, in_queue=12976, util=65.20%
> 

Yep, my patch guarantees correct time only for queue-depth=1
it accounts jiffies between start -> end (and end -> end).
For queue-depth > 1 jiffies between start -> start are still lost.

Without exact counter for in-flight requests we cannot distinguish
cases start -> start and end -> start

All  we could could is smoothing statistics even more and
account all short gaps as activity:

--- a/block/bio.c
+++ b/block/bio.c
@@ -1769,6 +1769,7 @@ void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
         stamp = READ_ONCE(part->stamp);
         if (unlikely(stamp != now)) {
                 if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
+                       end |= time_before(now, stamp + HZ / 100);
                         __part_stat_add(part, io_ticks, end ? now - stamp : 1);
                 }
         }


> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-25  6:28       ` Konstantin Khlebnikov
@ 2020-03-25  8:02         ` Konstantin Khlebnikov
  2020-03-25  8:54           ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-25  8:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, linux-kernel, Mikulas Patocka, Mike Snitzer



On 25/03/2020 09.28, Konstantin Khlebnikov wrote:
> 
> 
> On 25/03/2020 06.40, Ming Lei wrote:
>> On Tue, Mar 24, 2020 at 10:06:56PM +0800, Ming Lei wrote:
>>> On Tue, Mar 24, 2020 at 09:39:40AM +0300, Konstantin Khlebnikov wrote:
>>>> Currently io_ticks is approximated by adding one at each start and end of
>>>> requests if jiffies counter has changed. This works perfectly for requests
>>>> shorter than a jiffy or if one of requests starts/ends at each jiffy.
>>>>
>>>> If disk executes just one request at a time and they are longer than two
>>>> jiffies then only first and last jiffies will be accounted.
>>>>
>>>> Fix is simple: at the end of request add up into io_ticks jiffies passed
>>>> since last update rather than just one jiffy.
>>>>
>>>> Example: common HDD executes random read 4k requests around 12ms.
>>>>
>>>> fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
>>>> iostat -x 10 sdb
>>>>
>>>> Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
>>>>
>>>> Before:
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
>>>>
>>>> After:
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
>>>>
>>>> For load estimation "%util" is not as useful as average queue length,
>>>> but it clearly shows how often disk queue is completely empty.
>>>>
>>>> Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>> ---
>>>>   Documentation/admin-guide/iostats.rst |    5 ++++-
>>>>   block/bio.c                           |    8 ++++----
>>>>   block/blk-core.c                      |    4 ++--
>>>>   include/linux/genhd.h                 |    2 +-
>>>>   4 files changed, 11 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
>>>> index df5b8345c41d..9b14b0c2c9c4 100644
>>>> --- a/Documentation/admin-guide/iostats.rst
>>>> +++ b/Documentation/admin-guide/iostats.rst
>>>> @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
>>>>       Since 5.0 this field counts jiffies when at least one request was
>>>>       started or completed. If request runs more than 2 jiffies then some
>>>> -    I/O time will not be accounted unless there are other requests.
>>>> +    I/O time might be not accounted in case of concurrent requests.
>>>>   Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
>>>>       This field is incremented at each I/O start, I/O completion, I/O
>>>> @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
>>>>   summed to) and the result given to the user.  There is no convenient
>>>>   user interface for accessing the per-CPU counters themselves.
>>>> +Since 4.19 request times are measured with nanoseconds precision and
>>>> +truncated to milliseconds before showing in this interface.
>>>> +
>>>>   Disks vs Partitions
>>>>   -------------------
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 0985f3422556..b1053eb7af37 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
>>>>       schedule_work(&bio_dirty_work);
>>>>   }
>>>> -void update_io_ticks(struct hd_struct *part, unsigned long now)
>>>> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
>>>>   {
>>>>       unsigned long stamp;
>>>>   again:
>>>>       stamp = READ_ONCE(part->stamp);
>>>>       if (unlikely(stamp != now)) {
>>>>           if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
>>>> -            __part_stat_add(part, io_ticks, 1);
>>>> +            __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>>>>           }
>>>>       }
>>>>       if (part->partno) {
>>>> @@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
>>>>       part_stat_lock();
>>>> -    update_io_ticks(part, jiffies);
>>>> +    update_io_ticks(part, jiffies, false);
>>>>       part_stat_inc(part, ios[sgrp]);
>>>>       part_stat_add(part, sectors[sgrp], sectors);
>>>>       part_inc_in_flight(q, part, op_is_write(op));
>>>> @@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
>>>>       part_stat_lock();
>>>> -    update_io_ticks(part, now);
>>>> +    update_io_ticks(part, now, true);
>>>>       part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
>>>>       part_stat_add(part, time_in_queue, duration);
>>>>       part_dec_in_flight(q, part, op_is_write(req_op));
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index abfdcf81a228..4401b30a1751 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
>>>>           part_stat_lock();
>>>>           part = req->part;
>>>> -        update_io_ticks(part, jiffies);
>>>> +        update_io_ticks(part, jiffies, true);
>>>>           part_stat_inc(part, ios[sgrp]);
>>>>           part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
>>>>           part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
>>>> @@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
>>>>           rq->part = part;
>>>>       }
>>>> -    update_io_ticks(part, jiffies);
>>>> +    update_io_ticks(part, jiffies, false);
>>>>       part_stat_unlock();
>>>>   }
>>>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>>>> index d5c75df64bba..f1066f10b062 100644
>>>> --- a/include/linux/genhd.h
>>>> +++ b/include/linux/genhd.h
>>>> @@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
>>>>       kfree(part->info);
>>>>   }
>>>> -void update_io_ticks(struct hd_struct *part, unsigned long now);
>>>> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
>>>>   /* block/genhd.c */
>>>>   extern void device_add_disk(struct device *parent, struct gendisk *disk,
>>>>
>>>
>>> Looks fine:
>>>
>>> Reviewed-by: Ming Lei <ming.lei@redhat.com>
>>
>> BTW, there is still some gap(%65 vs. 99%) between this fix and the original
>> accounting(before applying Mike/Mikulas's 7 patches), and it might be
>> one thing to improve in future.
>>
>> 1) test, sda is single queue virtio-scsi, which is emulated by one HDD
>> image
>>
>> 2) fio test script:
>> fio --direct=1 --size=128G --bsrange=4k-4k \
>>         --runtime=20 --numjobs=1 \
>>         --ioengine=libaio --iodepth=16 \
>>         --iodepth_batch_submit=16 \
>>         --iodepth_batch_complete_min=16 \
>>         --group_reporting=1 --filename=/dev/sda \
>>         --name=seq-test --rw=read
>>
>> 3) result:
>> - v5.6-rc with this patch
>> Run status group 0 (all jobs):
>>     READ: bw=79.4MiB/s (83.3MB/s), 79.4MiB/s-79.4MiB/s (83.3MB/s-83.3MB/s), io=155
>> 88MiB (1665MB), run=20001-20001msec
>>
>> Disk stats (read/write):
>>    sda: ios=25039/0, merge=375596/0, ticks=18823/0, in_queue=4330, util=99.43%
>>
>>
>> - commit 112f158f66cb (which is previous commit of 5b18b5a73760)
>> Run status group 0 (all jobs):
>>     READ: bw=81.4MiB/s (85.3MB/s), 81.4MiB/s-81.4MiB/s (85.3MB/s-85.3MB/s), io=166
>> 28MiB (1707MB), run=20001-20001msec
>>
>> Disk stats (read/write):
>>    sda: ios=25749/0, merge=386236/0, ticks=17963/0, in_queue=12976, util=65.20%
>>

Oh, no. Your result is opposite.

Well, 99.43% with this patch is actually more correct result:
every millisecond there is at least one request in disk.

Old code sampled in-flight at start and end of each request, not regularly every jiffy.
And accounted whole jiffy as inactive if in-flight currently is zero.
This way statistics was biased to samples where queue is empty.

> 
> Yep, my patch guarantees correct time only for queue-depth=1
> it accounts jiffies between start -> end (and end -> end).
> For queue-depth > 1 jiffies between start -> start are still lost.
> 
> Without exact counter for in-flight requests we cannot distinguish
> cases start -> start and end -> start
> 
> All  we could could is smoothing statistics even more and
> account all short gaps as activity:
> 
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1769,6 +1769,7 @@ void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
>          stamp = READ_ONCE(part->stamp);
>          if (unlikely(stamp != now)) {
>                  if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> +                       end |= time_before(now, stamp + HZ / 100);
>                          __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>                  }
>          }
> 
> 
>>
>> Thanks,
>> Ming
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-25  8:02         ` Konstantin Khlebnikov
@ 2020-03-25  8:54           ` Ming Lei
  2020-03-25 13:02             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2020-03-25  8:54 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-block, Jens Axboe, linux-kernel, Mikulas Patocka, Mike Snitzer

On Wed, Mar 25, 2020 at 11:02:39AM +0300, Konstantin Khlebnikov wrote:
> 
> 
> On 25/03/2020 09.28, Konstantin Khlebnikov wrote:
> > 
> > 
> > On 25/03/2020 06.40, Ming Lei wrote:
> > > On Tue, Mar 24, 2020 at 10:06:56PM +0800, Ming Lei wrote:
> > > > On Tue, Mar 24, 2020 at 09:39:40AM +0300, Konstantin Khlebnikov wrote:
> > > > > Currently io_ticks is approximated by adding one at each start and end of
> > > > > requests if jiffies counter has changed. This works perfectly for requests
> > > > > shorter than a jiffy or if one of requests starts/ends at each jiffy.
> > > > > 
> > > > > If disk executes just one request at a time and they are longer than two
> > > > > jiffies then only first and last jiffies will be accounted.
> > > > > 
> > > > > Fix is simple: at the end of request add up into io_ticks jiffies passed
> > > > > since last update rather than just one jiffy.
> > > > > 
> > > > > Example: common HDD executes random read 4k requests around 12ms.
> > > > > 
> > > > > fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
> > > > > iostat -x 10 sdb
> > > > > 
> > > > > Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
> > > > > 
> > > > > Before:
> > > > > 
> > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > > > sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
> > > > > 
> > > > > After:
> > > > > 
> > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > > > sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
> > > > > 
> > > > > For load estimation "%util" is not as useful as average queue length,
> > > > > but it clearly shows how often disk queue is completely empty.
> > > > > 
> > > > > Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
> > > > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> > > > > ---
> > > > >   Documentation/admin-guide/iostats.rst |    5 ++++-
> > > > >   block/bio.c                           |    8 ++++----
> > > > >   block/blk-core.c                      |    4 ++--
> > > > >   include/linux/genhd.h                 |    2 +-
> > > > >   4 files changed, 11 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
> > > > > index df5b8345c41d..9b14b0c2c9c4 100644
> > > > > --- a/Documentation/admin-guide/iostats.rst
> > > > > +++ b/Documentation/admin-guide/iostats.rst
> > > > > @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
> > > > >       Since 5.0 this field counts jiffies when at least one request was
> > > > >       started or completed. If request runs more than 2 jiffies then some
> > > > > -    I/O time will not be accounted unless there are other requests.
> > > > > +    I/O time might be not accounted in case of concurrent requests.
> > > > >   Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
> > > > >       This field is incremented at each I/O start, I/O completion, I/O
> > > > > @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
> > > > >   summed to) and the result given to the user.  There is no convenient
> > > > >   user interface for accessing the per-CPU counters themselves.
> > > > > +Since 4.19 request times are measured with nanoseconds precision and
> > > > > +truncated to milliseconds before showing in this interface.
> > > > > +
> > > > >   Disks vs Partitions
> > > > >   -------------------
> > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > index 0985f3422556..b1053eb7af37 100644
> > > > > --- a/block/bio.c
> > > > > +++ b/block/bio.c
> > > > > @@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
> > > > >       schedule_work(&bio_dirty_work);
> > > > >   }
> > > > > -void update_io_ticks(struct hd_struct *part, unsigned long now)
> > > > > +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> > > > >   {
> > > > >       unsigned long stamp;
> > > > >   again:
> > > > >       stamp = READ_ONCE(part->stamp);
> > > > >       if (unlikely(stamp != now)) {
> > > > >           if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> > > > > -            __part_stat_add(part, io_ticks, 1);
> > > > > +            __part_stat_add(part, io_ticks, end ? now - stamp : 1);
> > > > >           }
> > > > >       }
> > > > >       if (part->partno) {
> > > > > @@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
> > > > >       part_stat_lock();
> > > > > -    update_io_ticks(part, jiffies);
> > > > > +    update_io_ticks(part, jiffies, false);
> > > > >       part_stat_inc(part, ios[sgrp]);
> > > > >       part_stat_add(part, sectors[sgrp], sectors);
> > > > >       part_inc_in_flight(q, part, op_is_write(op));
> > > > > @@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
> > > > >       part_stat_lock();
> > > > > -    update_io_ticks(part, now);
> > > > > +    update_io_ticks(part, now, true);
> > > > >       part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
> > > > >       part_stat_add(part, time_in_queue, duration);
> > > > >       part_dec_in_flight(q, part, op_is_write(req_op));
> > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > index abfdcf81a228..4401b30a1751 100644
> > > > > --- a/block/blk-core.c
> > > > > +++ b/block/blk-core.c
> > > > > @@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
> > > > >           part_stat_lock();
> > > > >           part = req->part;
> > > > > -        update_io_ticks(part, jiffies);
> > > > > +        update_io_ticks(part, jiffies, true);
> > > > >           part_stat_inc(part, ios[sgrp]);
> > > > >           part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
> > > > >           part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
> > > > > @@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
> > > > >           rq->part = part;
> > > > >       }
> > > > > -    update_io_ticks(part, jiffies);
> > > > > +    update_io_ticks(part, jiffies, false);
> > > > >       part_stat_unlock();
> > > > >   }
> > > > > diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> > > > > index d5c75df64bba..f1066f10b062 100644
> > > > > --- a/include/linux/genhd.h
> > > > > +++ b/include/linux/genhd.h
> > > > > @@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
> > > > >       kfree(part->info);
> > > > >   }
> > > > > -void update_io_ticks(struct hd_struct *part, unsigned long now);
> > > > > +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
> > > > >   /* block/genhd.c */
> > > > >   extern void device_add_disk(struct device *parent, struct gendisk *disk,
> > > > > 
> > > > 
> > > > Looks fine:
> > > > 
> > > > Reviewed-by: Ming Lei <ming.lei@redhat.com>
> > > 
> > > BTW, there is still some gap(%65 vs. 99%) between this fix and the original
> > > accounting(before applying Mike/Mikulas's 7 patches), and it might be
> > > one thing to improve in future.
> > > 
> > > 1) test, sda is single queue virtio-scsi, which is emulated by one HDD
> > > image
> > > 
> > > 2) fio test script:
> > > fio --direct=1 --size=128G --bsrange=4k-4k \
> > >         --runtime=20 --numjobs=1 \
> > >         --ioengine=libaio --iodepth=16 \
> > >         --iodepth_batch_submit=16 \
> > >         --iodepth_batch_complete_min=16 \
> > >         --group_reporting=1 --filename=/dev/sda \
> > >         --name=seq-test --rw=read
> > > 
> > > 3) result:
> > > - v5.6-rc with this patch
> > > Run status group 0 (all jobs):
> > >     READ: bw=79.4MiB/s (83.3MB/s), 79.4MiB/s-79.4MiB/s (83.3MB/s-83.3MB/s), io=155
> > > 88MiB (1665MB), run=20001-20001msec
> > > 
> > > Disk stats (read/write):
> > >    sda: ios=25039/0, merge=375596/0, ticks=18823/0, in_queue=4330, util=99.43%
> > > 
> > > 
> > > - commit 112f158f66cb (which is previous commit of 5b18b5a73760)
> > > Run status group 0 (all jobs):
> > >     READ: bw=81.4MiB/s (85.3MB/s), 81.4MiB/s-81.4MiB/s (85.3MB/s-85.3MB/s), io=166
> > > 28MiB (1707MB), run=20001-20001msec
> > > 
> > > Disk stats (read/write):
> > >    sda: ios=25749/0, merge=386236/0, ticks=17963/0, in_queue=12976, util=65.20%
> > > 
> 
> Oh, no. Your result is opposite.
> 
> Well, 99.43% with this patch is actually more correct result:
> every millisecond there is at least one request in disk.
> 
> Old code sampled in-flight at start and end of each request, not regularly every jiffy.

It doesn't matter if it is regularly every jiffy, or the sample point.

It is perfect to just sample at start and end, and not necessary to
do it in merge.

What matters is that only IO time is accounted. And when there isn't
any IO in-flight, the time shouldn't be accounted into io_ticks.
That is it, however, the new approach can't do that at all.

> And accounted whole jiffy as inactive if in-flight currently is zero.
> This way statistics was biased to samples where queue is empty.

It is just one sequential read test, single job, and there are lots of
merge, so disk utilization shouldn't be close to 100%, should it?

With your patch, now it is much easy to observe 100% utilization.

More ticks are counted in v5.6-rc with this patch than old kernel, the
reason is that new approach counts 1 tick for IDLE time(no any IO)
in blk_account_io_start(). Old approach knows if there is any in-flight
IO, if there isn't, the period since last stamp is thought as IO idle,
and that time won't be accounted.

However, the new approach doesn't know IO IDLE period at all, just
add 1 tick for this period.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-25  8:54           ` Ming Lei
@ 2020-03-25 13:02             ` Konstantin Khlebnikov
  2020-03-26  7:53               ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-25 13:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, linux-kernel, Mikulas Patocka, Mike Snitzer

On 25/03/2020 11.54, Ming Lei wrote:
> On Wed, Mar 25, 2020 at 11:02:39AM +0300, Konstantin Khlebnikov wrote:
>>
>>
>> On 25/03/2020 09.28, Konstantin Khlebnikov wrote:
>>>
>>>
>>> On 25/03/2020 06.40, Ming Lei wrote:
>>>> On Tue, Mar 24, 2020 at 10:06:56PM +0800, Ming Lei wrote:
>>>>> On Tue, Mar 24, 2020 at 09:39:40AM +0300, Konstantin Khlebnikov wrote:
>>>>>> Currently io_ticks is approximated by adding one at each start and end of
>>>>>> requests if jiffies counter has changed. This works perfectly for requests
>>>>>> shorter than a jiffy or if one of requests starts/ends at each jiffy.
>>>>>>
>>>>>> If disk executes just one request at a time and they are longer than two
>>>>>> jiffies then only first and last jiffies will be accounted.
>>>>>>
>>>>>> Fix is simple: at the end of request add up into io_ticks jiffies passed
>>>>>> since last update rather than just one jiffy.
>>>>>>
>>>>>> Example: common HDD executes random read 4k requests around 12ms.
>>>>>>
>>>>>> fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
>>>>>> iostat -x 10 sdb
>>>>>>
>>>>>> Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
>>>>>>
>>>>>> Before:
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
>>>>>>
>>>>>> After:
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
>>>>>>
>>>>>> For load estimation "%util" is not as useful as average queue length,
>>>>>> but it clearly shows how often disk queue is completely empty.
>>>>>>
>>>>>> Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>>>> ---
>>>>>>    Documentation/admin-guide/iostats.rst |    5 ++++-
>>>>>>    block/bio.c                           |    8 ++++----
>>>>>>    block/blk-core.c                      |    4 ++--
>>>>>>    include/linux/genhd.h                 |    2 +-
>>>>>>    4 files changed, 11 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
>>>>>> index df5b8345c41d..9b14b0c2c9c4 100644
>>>>>> --- a/Documentation/admin-guide/iostats.rst
>>>>>> +++ b/Documentation/admin-guide/iostats.rst
>>>>>> @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
>>>>>>        Since 5.0 this field counts jiffies when at least one request was
>>>>>>        started or completed. If request runs more than 2 jiffies then some
>>>>>> -    I/O time will not be accounted unless there are other requests.
>>>>>> +    I/O time might be not accounted in case of concurrent requests.
>>>>>>    Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
>>>>>>        This field is incremented at each I/O start, I/O completion, I/O
>>>>>> @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
>>>>>>    summed to) and the result given to the user.  There is no convenient
>>>>>>    user interface for accessing the per-CPU counters themselves.
>>>>>> +Since 4.19 request times are measured with nanoseconds precision and
>>>>>> +truncated to milliseconds before showing in this interface.
>>>>>> +
>>>>>>    Disks vs Partitions
>>>>>>    -------------------
>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>> index 0985f3422556..b1053eb7af37 100644
>>>>>> --- a/block/bio.c
>>>>>> +++ b/block/bio.c
>>>>>> @@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
>>>>>>        schedule_work(&bio_dirty_work);
>>>>>>    }
>>>>>> -void update_io_ticks(struct hd_struct *part, unsigned long now)
>>>>>> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
>>>>>>    {
>>>>>>        unsigned long stamp;
>>>>>>    again:
>>>>>>        stamp = READ_ONCE(part->stamp);
>>>>>>        if (unlikely(stamp != now)) {
>>>>>>            if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
>>>>>> -            __part_stat_add(part, io_ticks, 1);
>>>>>> +            __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>>>>>>            }
>>>>>>        }
>>>>>>        if (part->partno) {
>>>>>> @@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
>>>>>>        part_stat_lock();
>>>>>> -    update_io_ticks(part, jiffies);
>>>>>> +    update_io_ticks(part, jiffies, false);
>>>>>>        part_stat_inc(part, ios[sgrp]);
>>>>>>        part_stat_add(part, sectors[sgrp], sectors);
>>>>>>        part_inc_in_flight(q, part, op_is_write(op));
>>>>>> @@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
>>>>>>        part_stat_lock();
>>>>>> -    update_io_ticks(part, now);
>>>>>> +    update_io_ticks(part, now, true);
>>>>>>        part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
>>>>>>        part_stat_add(part, time_in_queue, duration);
>>>>>>        part_dec_in_flight(q, part, op_is_write(req_op));
>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>> index abfdcf81a228..4401b30a1751 100644
>>>>>> --- a/block/blk-core.c
>>>>>> +++ b/block/blk-core.c
>>>>>> @@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
>>>>>>            part_stat_lock();
>>>>>>            part = req->part;
>>>>>> -        update_io_ticks(part, jiffies);
>>>>>> +        update_io_ticks(part, jiffies, true);
>>>>>>            part_stat_inc(part, ios[sgrp]);
>>>>>>            part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
>>>>>>            part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
>>>>>> @@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
>>>>>>            rq->part = part;
>>>>>>        }
>>>>>> -    update_io_ticks(part, jiffies);
>>>>>> +    update_io_ticks(part, jiffies, false);
>>>>>>        part_stat_unlock();
>>>>>>    }
>>>>>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>>>>>> index d5c75df64bba..f1066f10b062 100644
>>>>>> --- a/include/linux/genhd.h
>>>>>> +++ b/include/linux/genhd.h
>>>>>> @@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
>>>>>>        kfree(part->info);
>>>>>>    }
>>>>>> -void update_io_ticks(struct hd_struct *part, unsigned long now);
>>>>>> +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
>>>>>>    /* block/genhd.c */
>>>>>>    extern void device_add_disk(struct device *parent, struct gendisk *disk,
>>>>>>
>>>>>
>>>>> Looks fine:
>>>>>
>>>>> Reviewed-by: Ming Lei <ming.lei@redhat.com>
>>>>
>>>> BTW, there is still some gap(%65 vs. 99%) between this fix and the original
>>>> accounting(before applying Mike/Mikulas's 7 patches), and it might be
>>>> one thing to improve in future.
>>>>
>>>> 1) test, sda is single queue virtio-scsi, which is emulated by one HDD
>>>> image
>>>>
>>>> 2) fio test script:
>>>> fio --direct=1 --size=128G --bsrange=4k-4k \
>>>>          --runtime=20 --numjobs=1 \
>>>>          --ioengine=libaio --iodepth=16 \
>>>>          --iodepth_batch_submit=16 \
>>>>          --iodepth_batch_complete_min=16 \
>>>>          --group_reporting=1 --filename=/dev/sda \
>>>>          --name=seq-test --rw=read
>>>>
>>>> 3) result:
>>>> - v5.6-rc with this patch
>>>> Run status group 0 (all jobs):
>>>>      READ: bw=79.4MiB/s (83.3MB/s), 79.4MiB/s-79.4MiB/s (83.3MB/s-83.3MB/s), io=155
>>>> 88MiB (1665MB), run=20001-20001msec
>>>>
>>>> Disk stats (read/write):
>>>>     sda: ios=25039/0, merge=375596/0, ticks=18823/0, in_queue=4330, util=99.43%
>>>>
>>>>
>>>> - commit 112f158f66cb (which is previous commit of 5b18b5a73760)
>>>> Run status group 0 (all jobs):
>>>>      READ: bw=81.4MiB/s (85.3MB/s), 81.4MiB/s-81.4MiB/s (85.3MB/s-85.3MB/s), io=166
>>>> 28MiB (1707MB), run=20001-20001msec
>>>>
>>>> Disk stats (read/write):
>>>>     sda: ios=25749/0, merge=386236/0, ticks=17963/0, in_queue=12976, util=65.20%
>>>>
>>
>> Oh, no. Your result is opposite.
>>
>> Well, 99.43% with this patch is actually more correct result:
>> every millisecond there is at least one request in disk.
>>
>> Old code sampled in-flight at start and end of each request, not regularly every jiffy.
> 
> It doesn't matter if it is regularly every jiffy, or the sample point.
> 
> It is perfect to just sample at start and end, and not necessary to
> do it in merge.
> 
> What matters is that only IO time is accounted. And when there isn't
> any IO in-flight, the time shouldn't be accounted into io_ticks.
> That is it, however, the new approach can't do that at all.

Yeah, should be right but something fishy in old statistics anyway.

It looks timestamp (part->stamp) sometimes updated non-atomically
without queue_lock.

> 
>> And accounted whole jiffy as inactive if in-flight currently is zero.
>> This way statistics was biased to samples where queue is empty.
> 
> It is just one sequential read test, single job, and there are lots of
> merge, so disk utilization shouldn't be close to 100%, should it?

Why not? It doesn't took a long time to complete request and issue another.

Look, for single-thread fio iostat show %util 100% but avgqu-sz is 1 too
it's estimated using total I/O time which counted in nanoseconds.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   83,30    0,00   333,20     0,00     8,00     1,00   11,98   11,98    0,00  12,00 100,00

Probably old blk stack was generally slower and delayed requests somewhere.

> 
> With your patch, now it is much easy to observe 100% utilization.
> 
> More ticks are counted in v5.6-rc with this patch than old kernel, the
> reason is that new approach counts 1 tick for IDLE time(no any IO)
> in blk_account_io_start(). Old approach knows if there is any in-flight
> IO, if there isn't, the period since last stamp is thought as IO idle,
> and that time won't be accounted. >
> However, the new approach doesn't know IO IDLE period at all, just
> add 1 tick for this period.

There is I/O during jiffy when request starts =) So it's counted as 1.
This kind of rounding up I/O shorter than jiffy.

> 
> 
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks
  2020-03-25 13:02             ` Konstantin Khlebnikov
@ 2020-03-26  7:53               ` Ming Lei
  0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2020-03-26  7:53 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-block, Jens Axboe, linux-kernel, Mikulas Patocka, Mike Snitzer

On Wed, Mar 25, 2020 at 04:02:25PM +0300, Konstantin Khlebnikov wrote:
> On 25/03/2020 11.54, Ming Lei wrote:
> > On Wed, Mar 25, 2020 at 11:02:39AM +0300, Konstantin Khlebnikov wrote:
> > > 
> > > 
> > > On 25/03/2020 09.28, Konstantin Khlebnikov wrote:
> > > > 
> > > > 
> > > > On 25/03/2020 06.40, Ming Lei wrote:
> > > > > On Tue, Mar 24, 2020 at 10:06:56PM +0800, Ming Lei wrote:
> > > > > > On Tue, Mar 24, 2020 at 09:39:40AM +0300, Konstantin Khlebnikov wrote:
> > > > > > > Currently io_ticks is approximated by adding one at each start and end of
> > > > > > > requests if jiffies counter has changed. This works perfectly for requests
> > > > > > > shorter than a jiffy or if one of requests starts/ends at each jiffy.
> > > > > > > 
> > > > > > > If disk executes just one request at a time and they are longer than two
> > > > > > > jiffies then only first and last jiffies will be accounted.
> > > > > > > 
> > > > > > > Fix is simple: at the end of request add up into io_ticks jiffies passed
> > > > > > > since last update rather than just one jiffy.
> > > > > > > 
> > > > > > > Example: common HDD executes random read 4k requests around 12ms.
> > > > > > > 
> > > > > > > fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
> > > > > > > iostat -x 10 sdb
> > > > > > > 
> > > > > > > Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
> > > > > > > 
> > > > > > > Before:
> > > > > > > 
> > > > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > > > > > sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
> > > > > > > 
> > > > > > > After:
> > > > > > > 
> > > > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > > > > > sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
> > > > > > > 
> > > > > > > For load estimation "%util" is not as useful as average queue length,
> > > > > > > but it clearly shows how often disk queue is completely empty.
> > > > > > > 
> > > > > > > Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
> > > > > > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> > > > > > > ---
> > > > > > >    Documentation/admin-guide/iostats.rst |    5 ++++-
> > > > > > >    block/bio.c                           |    8 ++++----
> > > > > > >    block/blk-core.c                      |    4 ++--
> > > > > > >    include/linux/genhd.h                 |    2 +-
> > > > > > >    4 files changed, 11 insertions(+), 8 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
> > > > > > > index df5b8345c41d..9b14b0c2c9c4 100644
> > > > > > > --- a/Documentation/admin-guide/iostats.rst
> > > > > > > +++ b/Documentation/admin-guide/iostats.rst
> > > > > > > @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
> > > > > > >        Since 5.0 this field counts jiffies when at least one request was
> > > > > > >        started or completed. If request runs more than 2 jiffies then some
> > > > > > > -    I/O time will not be accounted unless there are other requests.
> > > > > > > +    I/O time might be not accounted in case of concurrent requests.
> > > > > > >    Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
> > > > > > >        This field is incremented at each I/O start, I/O completion, I/O
> > > > > > > @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are
> > > > > > >    summed to) and the result given to the user.  There is no convenient
> > > > > > >    user interface for accessing the per-CPU counters themselves.
> > > > > > > +Since 4.19 request times are measured with nanoseconds precision and
> > > > > > > +truncated to milliseconds before showing in this interface.
> > > > > > > +
> > > > > > >    Disks vs Partitions
> > > > > > >    -------------------
> > > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > > index 0985f3422556..b1053eb7af37 100644
> > > > > > > --- a/block/bio.c
> > > > > > > +++ b/block/bio.c
> > > > > > > @@ -1762,14 +1762,14 @@ void bio_check_pages_dirty(struct bio *bio)
> > > > > > >        schedule_work(&bio_dirty_work);
> > > > > > >    }
> > > > > > > -void update_io_ticks(struct hd_struct *part, unsigned long now)
> > > > > > > +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> > > > > > >    {
> > > > > > >        unsigned long stamp;
> > > > > > >    again:
> > > > > > >        stamp = READ_ONCE(part->stamp);
> > > > > > >        if (unlikely(stamp != now)) {
> > > > > > >            if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> > > > > > > -            __part_stat_add(part, io_ticks, 1);
> > > > > > > +            __part_stat_add(part, io_ticks, end ? now - stamp : 1);
> > > > > > >            }
> > > > > > >        }
> > > > > > >        if (part->partno) {
> > > > > > > @@ -1785,7 +1785,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
> > > > > > >        part_stat_lock();
> > > > > > > -    update_io_ticks(part, jiffies);
> > > > > > > +    update_io_ticks(part, jiffies, false);
> > > > > > >        part_stat_inc(part, ios[sgrp]);
> > > > > > >        part_stat_add(part, sectors[sgrp], sectors);
> > > > > > >        part_inc_in_flight(q, part, op_is_write(op));
> > > > > > > @@ -1803,7 +1803,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
> > > > > > >        part_stat_lock();
> > > > > > > -    update_io_ticks(part, now);
> > > > > > > +    update_io_ticks(part, now, true);
> > > > > > >        part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
> > > > > > >        part_stat_add(part, time_in_queue, duration);
> > > > > > >        part_dec_in_flight(q, part, op_is_write(req_op));
> > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > > > index abfdcf81a228..4401b30a1751 100644
> > > > > > > --- a/block/blk-core.c
> > > > > > > +++ b/block/blk-core.c
> > > > > > > @@ -1337,7 +1337,7 @@ void blk_account_io_done(struct request *req, u64 now)
> > > > > > >            part_stat_lock();
> > > > > > >            part = req->part;
> > > > > > > -        update_io_ticks(part, jiffies);
> > > > > > > +        update_io_ticks(part, jiffies, true);
> > > > > > >            part_stat_inc(part, ios[sgrp]);
> > > > > > >            part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
> > > > > > >            part_stat_add(part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
> > > > > > > @@ -1379,7 +1379,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
> > > > > > >            rq->part = part;
> > > > > > >        }
> > > > > > > -    update_io_ticks(part, jiffies);
> > > > > > > +    update_io_ticks(part, jiffies, false);
> > > > > > >        part_stat_unlock();
> > > > > > >    }
> > > > > > > diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> > > > > > > index d5c75df64bba..f1066f10b062 100644
> > > > > > > --- a/include/linux/genhd.h
> > > > > > > +++ b/include/linux/genhd.h
> > > > > > > @@ -467,7 +467,7 @@ static inline void free_part_info(struct hd_struct *part)
> > > > > > >        kfree(part->info);
> > > > > > >    }
> > > > > > > -void update_io_ticks(struct hd_struct *part, unsigned long now);
> > > > > > > +void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
> > > > > > >    /* block/genhd.c */
> > > > > > >    extern void device_add_disk(struct device *parent, struct gendisk *disk,
> > > > > > > 
> > > > > > 
> > > > > > Looks fine:
> > > > > > 
> > > > > > Reviewed-by: Ming Lei <ming.lei@redhat.com>
> > > > > 
> > > > > BTW, there is still some gap(%65 vs. 99%) between this fix and the original
> > > > > accounting(before applying Mike/Mikulas's 7 patches), and it might be
> > > > > one thing to improve in future.
> > > > > 
> > > > > 1) test, sda is single queue virtio-scsi, which is emulated by one HDD
> > > > > image
> > > > > 
> > > > > 2) fio test script:
> > > > > fio --direct=1 --size=128G --bsrange=4k-4k \
> > > > >          --runtime=20 --numjobs=1 \
> > > > >          --ioengine=libaio --iodepth=16 \
> > > > >          --iodepth_batch_submit=16 \
> > > > >          --iodepth_batch_complete_min=16 \
> > > > >          --group_reporting=1 --filename=/dev/sda \
> > > > >          --name=seq-test --rw=read
> > > > > 
> > > > > 3) result:
> > > > > - v5.6-rc with this patch
> > > > > Run status group 0 (all jobs):
> > > > >      READ: bw=79.4MiB/s (83.3MB/s), 79.4MiB/s-79.4MiB/s (83.3MB/s-83.3MB/s), io=155
> > > > > 88MiB (1665MB), run=20001-20001msec
> > > > > 
> > > > > Disk stats (read/write):
> > > > >     sda: ios=25039/0, merge=375596/0, ticks=18823/0, in_queue=4330, util=99.43%
> > > > > 
> > > > > 
> > > > > - commit 112f158f66cb (which is previous commit of 5b18b5a73760)
> > > > > Run status group 0 (all jobs):
> > > > >      READ: bw=81.4MiB/s (85.3MB/s), 81.4MiB/s-81.4MiB/s (85.3MB/s-85.3MB/s), io=166
> > > > > 28MiB (1707MB), run=20001-20001msec
> > > > > 
> > > > > Disk stats (read/write):
> > > > >     sda: ios=25749/0, merge=386236/0, ticks=17963/0, in_queue=12976, util=65.20%
> > > > > 
> > > 
> > > Oh, no. Your result is opposite.
> > > 
> > > Well, 99.43% with this patch is actually more correct result:
> > > every millisecond there is at least one request in disk.
> > > 
> > > Old code sampled in-flight at start and end of each request, not regularly every jiffy.
> > 
> > It doesn't matter if it is regularly every jiffy, or the sample point.
> > 
> > It is perfect to just sample at start and end, and not necessary to
> > do it in merge.
> > 
> > What matters is that only IO time is accounted. And when there isn't
> > any IO in-flight, the time shouldn't be accounted into io_ticks.
> > That is it, however, the new approach can't do that at all.
> 
> Yeah, should be right but something fishy in old statistics anyway.
> 
> It looks timestamp (part->stamp) sometimes updated non-atomically
> without queue_lock.

part->stamp is defined as 'unsigned long', and it is updated atomically.

> 
> > 
> > > And accounted whole jiffy as inactive if in-flight currently is zero.
> > > This way statistics was biased to samples where queue is empty.
> > 
> > It is just one sequential read test, single job, and there are lots of
> > merge, so disk utilization shouldn't be close to 100%, should it?
> 
> Why not? It doesn't took a long time to complete request and issue another.

It still takes a bit time to complete request, and there are lots of
thing to do: dio completion, aio completion, wakeup fio to reap
the aio, fio prepare & submission. In my VM, this time is often < 1ms,
actually completing one sequential IO often takes less 1ms too, so
ratio between the two is really visible. Adding extra one tick in
starting request can cause bigger utilization, given one tick is
1ms with 1000HZ.

I do have such IO trace, and I can share to you if you want to take
a look.

Similar result can be observed in single queue depth too, given it
is sequential IO, plug merge will merge all these batch submission
into one request.

I believe ~60% represents the correct util%, and the data can be
figured from userspace easily, follows the approach:

1) attache two probes on blk_account_io_start/blk_account_io_complete or
done via bcc/bpftrace

2) use the timestamp collected in above two probes to calculate %util,
and the algorithm is simple to figure out max io ranges

I have written a python script to verify the correct disk utilization,
and it shows that old kernel's result is correct.

> 
> Look, for single-thread fio iostat show %util 100% but avgqu-sz is 1 too
> it's estimated using total I/O time which counted in nanoseconds.
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00     0,00   83,30    0,00   333,20     0,00     8,00     1,00   11,98   11,98    0,00  12,00 100,00
> 
> Probably old blk stack was generally slower and delayed requests somewhere.

No, as you see, the throughput data is basically same.

> 
> > 
> > With your patch, now it is much easy to observe 100% utilization.
> > 
> > More ticks are counted in v5.6-rc with this patch than old kernel, the
> > reason is that new approach counts 1 tick for IDLE time(no any IO)
> > in blk_account_io_start(). Old approach knows if there is any in-flight
> > IO, if there isn't, the period since last stamp is thought as IO idle,
> > and that time won't be accounted. >
> > However, the new approach doesn't know IO IDLE period at all, just
> > add 1 tick for this period.
> 
> There is I/O during jiffy when request starts =) So it's counted as 1.
> This kind of rounding up I/O shorter than jiffy.

Firstly if the current request is the 1st in-flight IO, the 1 tick
shouldn't be counted.

Secondly if the current started request isn't the 1st in-flight IO,
1 tick may be too much if timestamp isn't updated in last IO completion.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-03-26  7:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24  6:39 [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Konstantin Khlebnikov
2020-03-24  6:39 ` [PATCH v3 1/3] block/diskstats: more accurate approximation of io_ticks for slow disks Konstantin Khlebnikov
2020-03-24 14:06   ` Ming Lei
2020-03-25  3:40     ` Ming Lei
2020-03-25  6:28       ` Konstantin Khlebnikov
2020-03-25  8:02         ` Konstantin Khlebnikov
2020-03-25  8:54           ` Ming Lei
2020-03-25 13:02             ` Konstantin Khlebnikov
2020-03-26  7:53               ` Ming Lei
2020-03-24  6:39 ` [PATCH v3 2/3] block/diskstats: accumulate all per-cpu counters in one pass Konstantin Khlebnikov
2020-03-24  6:39 ` [PATCH v3 3/3] block/diskstats: replace time_in_queue with sum of request times Konstantin Khlebnikov
2020-03-24 16:06 ` [PATCH v3 0/3] block/diskstats: more accurate io_ticks and optimization Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).