linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: small performance optimization
@ 2014-05-09  9:17 Matias Bjørling
  2014-05-09  9:17 ` [PATCH] block: per-cpu counters for in-flight IO accounting Matias Bjørling
  0 siblings, 1 reply; 14+ messages in thread
From: Matias Bjørling @ 2014-05-09  9:17 UTC (permalink / raw)
  To: axboe, sbradshaw; +Cc: linux-kernel, Matias Bjørling

Hi Jens,

The IO accounting stats contribute significant overhead on high IO/multi-node
workloads. By moving the in_flight IO tracking to per-cpu counters, we can
elevate some of the cache coherence overhead, at the cost of setup/teardown
and percpu counters structure within the hd_struct structure.

Sam Bradshaw reported an IOPS increase of 20% on his fio workloads.

Thanks,
Matias

Matias Bjørling (1):
  blk: per-cpu counters for in-flight IO accounting

 block/partition-generic.c |  4 ++--
 include/linux/genhd.h     | 25 ++++++++++++++++++-------
 2 files changed, 20 insertions(+), 9 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-05-09  9:17 [PATCH] block: small performance optimization Matias Bjørling
@ 2014-05-09  9:17 ` Matias Bjørling
  2014-05-09 14:12   ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Matias Bjørling @ 2014-05-09  9:17 UTC (permalink / raw)
  To: axboe, sbradshaw; +Cc: linux-kernel, Matias Bjørling

With multi-million IOPS and multi-node workloads, the atomic_t in_flight
tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
counters to elevate.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 block/partition-generic.c |  4 ++--
 include/linux/genhd.h     | 25 ++++++++++++++++++-------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 789cdea..b21937a 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -140,8 +140,8 @@ ssize_t part_inflight_show(struct device *dev,
 {
 	struct hd_struct *p = dev_to_part(dev);
 
-	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
-		atomic_read(&p->in_flight[1]));
+	return sprintf(buf, "%8d %8d\n", percpu_counter_sum(&p->in_flight[0]),
+		percpu_counter_sum(&p->in_flight[1]));
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 9f3c275..8487d70 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -118,7 +118,6 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	unsigned long stamp;
-	atomic_t in_flight[2];
 #ifdef	CONFIG_SMP
 	struct disk_stats __percpu *dkstats;
 #else
@@ -126,6 +125,7 @@ struct hd_struct {
 #endif
 	atomic_t ref;
 	struct rcu_head rcu_head;
+	struct percpu_counter in_flight[2];
 };
 
 #define GENHD_FL_REMOVABLE			1
@@ -330,15 +330,26 @@ static inline void part_stat_set_all(struct hd_struct *part, int value)
 
 static inline int init_part_stats(struct hd_struct *part)
 {
+	if (percpu_counter_init(&part->in_flight[0], 0))
+		return 0;
+	if (percpu_counter_init(&part->in_flight[1], 0))
+		goto err_cnt;
 	part->dkstats = alloc_percpu(struct disk_stats);
 	if (!part->dkstats)
-		return 0;
+		goto err_stats;
 	return 1;
+err_stats:
+	percpu_counter_destroy(&part->in_flight[1]);
+err_cnt:
+	percpu_counter_destroy(&part->in_flight[0]);
+	return 0;
 }
 
 static inline void free_part_stats(struct hd_struct *part)
 {
 	free_percpu(part->dkstats);
+	percpu_counter_destroy(&part->in_flight[1]);
+	percpu_counter_destroy(&part->in_flight[0]);
 }
 
 #else /* !CONFIG_SMP */
@@ -382,21 +393,21 @@ static inline void free_part_stats(struct hd_struct *part)
 
 static inline void part_inc_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_inc(&part->in_flight[rw]);
+	__percpu_counter_add(&part->in_flight[rw], 1, 1000000);
 	if (part->partno)
-		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+		__percpu_counter_add(&part_to_disk(part)->part0.in_flight[rw], 1, 1000000);
 }
 
 static inline void part_dec_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_dec(&part->in_flight[rw]);
+	__percpu_counter_add(&part->in_flight[rw], -1, 1000000);
 	if (part->partno)
-		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+		__percpu_counter_add(&part_to_disk(part)->part0.in_flight[rw], -1, 1000000);
 }
 
 static inline int part_in_flight(struct hd_struct *part)
 {
-	return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
+	return percpu_counter_sum(&part->in_flight[0]) + percpu_counter_sum(&part->in_flight[1]);
 }
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-05-09  9:17 ` [PATCH] block: per-cpu counters for in-flight IO accounting Matias Bjørling
@ 2014-05-09 14:12   ` Jens Axboe
  2014-05-09 16:41     ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2014-05-09 14:12 UTC (permalink / raw)
  To: Matias Bjørling, sbradshaw; +Cc: linux-kernel

On 05/09/2014 03:17 AM, Matias Bjørling wrote:
> With multi-million IOPS and multi-node workloads, the atomic_t in_flight
> tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
> counters to elevate.

The part stats are a pain in the butt, I've tried to come up with a
great fix for them too. But I don't think the percpu conversion is
necessarily the right one. The summing is part of the hotpath, so percpu
counters aren't necessarily the right way to go. I don't have a better
answer right now, otherwise it would have been fixed :-)

There's some low hanging fruit (like doing part_in_flight() twice in
part_round_stats_single()), though.

So I'm not going to apply this one as-is, lets see if we can find a
better solution. Perhaps local_t would be a good solution.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-05-09 14:12   ` Jens Axboe
@ 2014-05-09 16:41     ` Jens Axboe
  2014-05-30 12:11       ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2014-05-09 16:41 UTC (permalink / raw)
  To: Matias Bjørling, sbradshaw; +Cc: linux-kernel

On 05/09/2014 08:12 AM, Jens Axboe wrote:
> On 05/09/2014 03:17 AM, Matias Bjørling wrote:
>> With multi-million IOPS and multi-node workloads, the atomic_t in_flight
>> tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
>> counters to elevate.
> 
> The part stats are a pain in the butt, I've tried to come up with a
> great fix for them too. But I don't think the percpu conversion is
> necessarily the right one. The summing is part of the hotpath, so percpu
> counters aren't necessarily the right way to go. I don't have a better
> answer right now, otherwise it would have been fixed :-)

Actual data point - this slows my test down ~14% compared to the stock
kernel. Also, if you experiment with this, you need to watch for the
out-of-core users of the part stats (like DM).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-05-09 16:41     ` Jens Axboe
@ 2014-05-30 12:11       ` Shaohua Li
  2014-05-30 13:49         ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2014-05-30 12:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matias Bjørling, sbradshaw, linux-kernel

On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
> On 05/09/2014 08:12 AM, Jens Axboe wrote:
> > On 05/09/2014 03:17 AM, Matias Bjørling wrote:
> >> With multi-million IOPS and multi-node workloads, the atomic_t in_flight
> >> tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
> >> counters to elevate.
> > 
> > The part stats are a pain in the butt, I've tried to come up with a
> > great fix for them too. But I don't think the percpu conversion is
> > necessarily the right one. The summing is part of the hotpath, so percpu
> > counters aren't necessarily the right way to go. I don't have a better
> > answer right now, otherwise it would have been fixed :-)
> 
> Actual data point - this slows my test down ~14% compared to the stock
> kernel. Also, if you experiment with this, you need to watch for the
> out-of-core users of the part stats (like DM).

I had a try with Matias's patch. Performance actually boost significantly.
(there are other cache line issue though, eg, hd_struct_get). Jens, what did
you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
path. part_round_stats_single() uses it too, but it's a cold path too as we
simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
every 10ms instead of every jiffy?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-05-30 12:11       ` Shaohua Li
@ 2014-05-30 13:49         ` Jens Axboe
  2014-06-04 10:39           ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2014-05-30 13:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Matias Bjørling, sbradshaw, linux-kernel

On 2014-05-30 06:11, Shaohua Li wrote:
> On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
>> On 05/09/2014 08:12 AM, Jens Axboe wrote:
>>> On 05/09/2014 03:17 AM, Matias Bjørling wrote:
>>>> With multi-million IOPS and multi-node workloads, the atomic_t in_flight
>>>> tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
>>>> counters to elevate.
>>>
>>> The part stats are a pain in the butt, I've tried to come up with a
>>> great fix for them too. But I don't think the percpu conversion is
>>> necessarily the right one. The summing is part of the hotpath, so percpu
>>> counters aren't necessarily the right way to go. I don't have a better
>>> answer right now, otherwise it would have been fixed :-)
>>
>> Actual data point - this slows my test down ~14% compared to the stock
>> kernel. Also, if you experiment with this, you need to watch for the
>> out-of-core users of the part stats (like DM).
>
> I had a try with Matias's patch. Performance actually boost significantly.
> (there are other cache line issue though, eg, hd_struct_get). Jens, what did
> you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
> path. part_round_stats_single() uses it too, but it's a cold path too as we
> simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
> every 10ms instead of every jiffy?

I ran peak and normal benchmarks on a p320, on a 4 socket box (64 
cores). The problem is the one hot path of part_in_flight(), summing 
percpu for that is too expensive. On bigger systems than mine, it'd be 
even worse.

But the stats are definitely an issue. The part references suck as well, 
as you mention, those need fixing up as well. And changing it to be 
every 10ms regardless of HZ is probably a good idea, at least then the 
granularity is fixed as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-05-30 13:49         ` Jens Axboe
@ 2014-06-04 10:39           ` Shaohua Li
  2014-06-04 11:29             ` Matias Bjørling
  2014-06-04 14:29             ` Jens Axboe
  0 siblings, 2 replies; 14+ messages in thread
From: Shaohua Li @ 2014-06-04 10:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matias Bjørling, sbradshaw, linux-kernel

On Fri, May 30, 2014 at 07:49:52AM -0600, Jens Axboe wrote:
> On 2014-05-30 06:11, Shaohua Li wrote:
> >On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
> >>On 05/09/2014 08:12 AM, Jens Axboe wrote:
> >>>On 05/09/2014 03:17 AM, Matias Bjørling wrote:
> >>>>With multi-million IOPS and multi-node workloads, the atomic_t in_flight
> >>>>tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
> >>>>counters to elevate.
> >>>
> >>>The part stats are a pain in the butt, I've tried to come up with a
> >>>great fix for them too. But I don't think the percpu conversion is
> >>>necessarily the right one. The summing is part of the hotpath, so percpu
> >>>counters aren't necessarily the right way to go. I don't have a better
> >>>answer right now, otherwise it would have been fixed :-)
> >>
> >>Actual data point - this slows my test down ~14% compared to the stock
> >>kernel. Also, if you experiment with this, you need to watch for the
> >>out-of-core users of the part stats (like DM).
> >
> >I had a try with Matias's patch. Performance actually boost significantly.
> >(there are other cache line issue though, eg, hd_struct_get). Jens, what did
> >you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
> >path. part_round_stats_single() uses it too, but it's a cold path too as we
> >simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
> >every 10ms instead of every jiffy?
> 
> I ran peak and normal benchmarks on a p320, on a 4 socket box (64
> cores). The problem is the one hot path of part_in_flight(), summing
> percpu for that is too expensive. On bigger systems than mine, it'd
> be even worse.

I run a null_blk test with 4 sockets, Matias has improvement. And I didn't find
part_in_flight() is called in any hot path.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-06-04 10:39           ` Shaohua Li
@ 2014-06-04 11:29             ` Matias Bjørling
  2014-06-04 20:08               ` Jens Axboe
  2014-06-04 14:29             ` Jens Axboe
  1 sibling, 1 reply; 14+ messages in thread
From: Matias Bjørling @ 2014-06-04 11:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jens Axboe, Sam Bradshaw (sbradshaw), LKML

It's in

blk_io_account_start
  part_round_stats
    part_round_state_single
      part_in_flight

I like the granularity idea.

Thanks,
Matias

On Wed, Jun 4, 2014 at 12:39 PM, Shaohua Li <shli@kernel.org> wrote:
> On Fri, May 30, 2014 at 07:49:52AM -0600, Jens Axboe wrote:
>> On 2014-05-30 06:11, Shaohua Li wrote:
>> >On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
>> >>On 05/09/2014 08:12 AM, Jens Axboe wrote:
>> >>>On 05/09/2014 03:17 AM, Matias Bjørling wrote:
>> >>>>With multi-million IOPS and multi-node workloads, the atomic_t in_flight
>> >>>>tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
>> >>>>counters to elevate.
>> >>>
>> >>>The part stats are a pain in the butt, I've tried to come up with a
>> >>>great fix for them too. But I don't think the percpu conversion is
>> >>>necessarily the right one. The summing is part of the hotpath, so percpu
>> >>>counters aren't necessarily the right way to go. I don't have a better
>> >>>answer right now, otherwise it would have been fixed :-)
>> >>
>> >>Actual data point - this slows my test down ~14% compared to the stock
>> >>kernel. Also, if you experiment with this, you need to watch for the
>> >>out-of-core users of the part stats (like DM).
>> >
>> >I had a try with Matias's patch. Performance actually boost significantly.
>> >(there are other cache line issue though, eg, hd_struct_get). Jens, what did
>> >you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
>> >path. part_round_stats_single() uses it too, but it's a cold path too as we
>> >simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
>> >every 10ms instead of every jiffy?
>>
>> I ran peak and normal benchmarks on a p320, on a 4 socket box (64
>> cores). The problem is the one hot path of part_in_flight(), summing
>> percpu for that is too expensive. On bigger systems than mine, it'd
>> be even worse.
>
> I run a null_blk test with 4 sockets, Matias has improvement. And I didn't find
> part_in_flight() is called in any hot path.
>
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-06-04 10:39           ` Shaohua Li
  2014-06-04 11:29             ` Matias Bjørling
@ 2014-06-04 14:29             ` Jens Axboe
  1 sibling, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2014-06-04 14:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Matias Bjørling, sbradshaw, linux-kernel

On 2014-06-04 04:39, Shaohua Li wrote:
> On Fri, May 30, 2014 at 07:49:52AM -0600, Jens Axboe wrote:
>> On 2014-05-30 06:11, Shaohua Li wrote:
>>> On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
>>>> On 05/09/2014 08:12 AM, Jens Axboe wrote:
>>>>> On 05/09/2014 03:17 AM, Matias Bjørling wrote:
>>>>>> With multi-million IOPS and multi-node workloads, the atomic_t in_flight
>>>>>> tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
>>>>>> counters to elevate.
>>>>>
>>>>> The part stats are a pain in the butt, I've tried to come up with a
>>>>> great fix for them too. But I don't think the percpu conversion is
>>>>> necessarily the right one. The summing is part of the hotpath, so percpu
>>>>> counters aren't necessarily the right way to go. I don't have a better
>>>>> answer right now, otherwise it would have been fixed :-)
>>>>
>>>> Actual data point - this slows my test down ~14% compared to the stock
>>>> kernel. Also, if you experiment with this, you need to watch for the
>>>> out-of-core users of the part stats (like DM).
>>>
>>> I had a try with Matias's patch. Performance actually boost significantly.
>>> (there are other cache line issue though, eg, hd_struct_get). Jens, what did
>>> you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
>>> path. part_round_stats_single() uses it too, but it's a cold path too as we
>>> simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
>>> every 10ms instead of every jiffy?
>>
>> I ran peak and normal benchmarks on a p320, on a 4 socket box (64
>> cores). The problem is the one hot path of part_in_flight(), summing
>> percpu for that is too expensive. On bigger systems than mine, it'd
>> be even worse.
>
> I run a null_blk test with 4 sockets, Matias has improvement. And I didn't find
> part_in_flight() is called in any hot path.

It's done for every IO completion, that is (by definition) a hot path. I 
tested on two devices here, and it was definitely slower. And my system 
only had just the right number of NR_CPUS, I suspect it'd be much worse 
on bigger systems.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-06-04 11:29             ` Matias Bjørling
@ 2014-06-04 20:08               ` Jens Axboe
  2014-06-05  2:09                 ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2014-06-04 20:08 UTC (permalink / raw)
  To: Matias Bjørling, Shaohua Li; +Cc: Sam Bradshaw (sbradshaw), LKML

On 06/04/2014 05:29 AM, Matias Bjørling wrote:
> It's in
> 
> blk_io_account_start
>   part_round_stats
>     part_round_state_single
>       part_in_flight
> 
> I like the granularity idea.

And similarly from blk_io_account_done() - which makes it even worse,
since it at both ends of the IO chain.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-06-04 20:08               ` Jens Axboe
@ 2014-06-05  2:09                 ` Shaohua Li
  2014-06-05  2:16                   ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2014-06-05  2:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matias Bjørling, Sam Bradshaw (sbradshaw), LKML

On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
> On 06/04/2014 05:29 AM, Matias Bjørling wrote:
> > It's in
> > 
> > blk_io_account_start
> >   part_round_stats
> >     part_round_state_single
> >       part_in_flight
> > 
> > I like the granularity idea.
> 
> And similarly from blk_io_account_done() - which makes it even worse,
> since it at both ends of the IO chain.

But part_round_state_single is supposed to only call part_in_flight every
jiffery. Maybe we need something below:
1. set part->stamp immediately
2. fixed granularity
Untested though.


diff --git a/block/blk-core.c b/block/blk-core.c
index 40d6548..5f0acaa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
 				    unsigned long now)
 {
 	int inflight;
+	unsigned long old_stamp;
 
-	if (now == part->stamp)
+	if (time_before(now, part->stamp + msecs_to_jiffies(10)))
 		return;
+	old_stamp = part->stamp;
+	part->stamp = now;
 
 	inflight = part_in_flight(part);
 	if (inflight) {
 		__part_stat_add(cpu, part, time_in_queue,
-				inflight * (now - part->stamp));
-		__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
+				inflight * (now - old_stamp));
+		__part_stat_add(cpu, part, io_ticks, (now - old_stamp));
 	}
-	part->stamp = now;
 }
 
 /**

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-06-05  2:09                 ` Shaohua Li
@ 2014-06-05  2:16                   ` Jens Axboe
  2014-06-05  2:33                     ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2014-06-05  2:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Matias Bjørling, Sam Bradshaw (sbradshaw), LKML

On 2014-06-04 20:09, Shaohua Li wrote:
> On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
>> On 06/04/2014 05:29 AM, Matias Bjørling wrote:
>>> It's in
>>>
>>> blk_io_account_start
>>>    part_round_stats
>>>      part_round_state_single
>>>        part_in_flight
>>>
>>> I like the granularity idea.
>>
>> And similarly from blk_io_account_done() - which makes it even worse,
>> since it at both ends of the IO chain.
>
> But part_round_state_single is supposed to only call part_in_flight every
> jiffery. Maybe we need something below:
> 1. set part->stamp immediately
> 2. fixed granularity
> Untested though.
>
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 40d6548..5f0acaa 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
>   				    unsigned long now)
>   {
>   	int inflight;
> +	unsigned long old_stamp;
>
> -	if (now == part->stamp)
> +	if (time_before(now, part->stamp + msecs_to_jiffies(10)))
>   		return;
> +	old_stamp = part->stamp;
> +	part->stamp = now;
>
>   	inflight = part_in_flight(part);
>   	if (inflight) {
>   		__part_stat_add(cpu, part, time_in_queue,
> -				inflight * (now - part->stamp));
> -		__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
> +				inflight * (now - old_stamp));
> +		__part_stat_add(cpu, part, io_ticks, (now - old_stamp));
>   	}
> -	part->stamp = now;
>   }
>
>   /**

It'd be a good improvement, and one we should be able to do without 
screwing anything up. It'd be identical to anyone running at HZ==100 
right now.

So the above we can easily do, and arguably should just do. We wont see 
real scaling in the IO stats path before we fixup the hd_struct 
referencing as well, however.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-06-05  2:16                   ` Jens Axboe
@ 2014-06-05  2:33                     ` Shaohua Li
  2014-06-05  2:42                       ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2014-06-05  2:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matias Bjørling, Sam Bradshaw (sbradshaw), LKML

On Wed, Jun 04, 2014 at 08:16:32PM -0600, Jens Axboe wrote:
> On 2014-06-04 20:09, Shaohua Li wrote:
> >On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
> >>On 06/04/2014 05:29 AM, Matias Bjørling wrote:
> >>>It's in
> >>>
> >>>blk_io_account_start
> >>>   part_round_stats
> >>>     part_round_state_single
> >>>       part_in_flight
> >>>
> >>>I like the granularity idea.
> >>
> >>And similarly from blk_io_account_done() - which makes it even worse,
> >>since it at both ends of the IO chain.
> >
> >But part_round_state_single is supposed to only call part_in_flight every
> >jiffery. Maybe we need something below:
> >1. set part->stamp immediately
> >2. fixed granularity
> >Untested though.
> >
> >
> >diff --git a/block/blk-core.c b/block/blk-core.c
> >index 40d6548..5f0acaa 100644
> >--- a/block/blk-core.c
> >+++ b/block/blk-core.c
> >@@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
> >  				    unsigned long now)
> >  {
> >  	int inflight;
> >+	unsigned long old_stamp;
> >
> >-	if (now == part->stamp)
> >+	if (time_before(now, part->stamp + msecs_to_jiffies(10)))
> >  		return;
> >+	old_stamp = part->stamp;
> >+	part->stamp = now;
> >
> >  	inflight = part_in_flight(part);
> >  	if (inflight) {
> >  		__part_stat_add(cpu, part, time_in_queue,
> >-				inflight * (now - part->stamp));
> >-		__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
> >+				inflight * (now - old_stamp));
> >+		__part_stat_add(cpu, part, io_ticks, (now - old_stamp));
> >  	}
> >-	part->stamp = now;
> >  }
> >
> >  /**
> 
> It'd be a good improvement, and one we should be able to do without
> screwing anything up. It'd be identical to anyone running at HZ==100
> right now.
> 
> So the above we can easily do, and arguably should just do. We wont
> see real scaling in the IO stats path before we fixup the hd_struct
> referencing as well, however.

That's true. maybe a percpu_ref works here.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: per-cpu counters for in-flight IO accounting
  2014-06-05  2:33                     ` Shaohua Li
@ 2014-06-05  2:42                       ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2014-06-05  2:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Matias Bjørling, Sam Bradshaw (sbradshaw), LKML

On 2014-06-04 20:33, Shaohua Li wrote:
> On Wed, Jun 04, 2014 at 08:16:32PM -0600, Jens Axboe wrote:
>> On 2014-06-04 20:09, Shaohua Li wrote:
>>> On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
>>>> On 06/04/2014 05:29 AM, Matias Bjørling wrote:
>>>>> It's in
>>>>>
>>>>> blk_io_account_start
>>>>>    part_round_stats
>>>>>      part_round_state_single
>>>>>        part_in_flight
>>>>>
>>>>> I like the granularity idea.
>>>>
>>>> And similarly from blk_io_account_done() - which makes it even worse,
>>>> since it at both ends of the IO chain.
>>>
>>> But part_round_state_single is supposed to only call part_in_flight every
>>> jiffery. Maybe we need something below:
>>> 1. set part->stamp immediately
>>> 2. fixed granularity
>>> Untested though.
>>>
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 40d6548..5f0acaa 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
>>>   				    unsigned long now)
>>>   {
>>>   	int inflight;
>>> +	unsigned long old_stamp;
>>>
>>> -	if (now == part->stamp)
>>> +	if (time_before(now, part->stamp + msecs_to_jiffies(10)))
>>>   		return;
>>> +	old_stamp = part->stamp;
>>> +	part->stamp = now;
>>>
>>>   	inflight = part_in_flight(part);
>>>   	if (inflight) {
>>>   		__part_stat_add(cpu, part, time_in_queue,
>>> -				inflight * (now - part->stamp));
>>> -		__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
>>> +				inflight * (now - old_stamp));
>>> +		__part_stat_add(cpu, part, io_ticks, (now - old_stamp));
>>>   	}
>>> -	part->stamp = now;
>>>   }
>>>
>>>   /**
>>
>> It'd be a good improvement, and one we should be able to do without
>> screwing anything up. It'd be identical to anyone running at HZ==100
>> right now.
>>
>> So the above we can easily do, and arguably should just do. We wont
>> see real scaling in the IO stats path before we fixup the hd_struct
>> referencing as well, however.
>
> That's true. maybe a percpu_ref works here.

Maybe, but it would require more than a direct replacement. The 
hd_struct stuff currently relies on things like atomic_inc_not_zero(), 
which would not be cheap to do. And this does happen for every new IO, 
so can't be amortized over time like the part stats rounding.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-06-05  2:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-09  9:17 [PATCH] block: small performance optimization Matias Bjørling
2014-05-09  9:17 ` [PATCH] block: per-cpu counters for in-flight IO accounting Matias Bjørling
2014-05-09 14:12   ` Jens Axboe
2014-05-09 16:41     ` Jens Axboe
2014-05-30 12:11       ` Shaohua Li
2014-05-30 13:49         ` Jens Axboe
2014-06-04 10:39           ` Shaohua Li
2014-06-04 11:29             ` Matias Bjørling
2014-06-04 20:08               ` Jens Axboe
2014-06-05  2:09                 ` Shaohua Li
2014-06-05  2:16                   ` Jens Axboe
2014-06-05  2:33                     ` Shaohua Li
2014-06-05  2:42                       ` Jens Axboe
2014-06-04 14:29             ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).