From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <brking@linux.vnet.ibm.com>
Subject: Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from
 atomic to percpu
To: Jens Axboe <axboe@kernel.dk>, Ming Lei <tom.leiming@gmail.com>
Cc: linux-block <linux-block@vger.kernel.org>,
        "open list:DEVICE-MAPPER (LVM)" <dm-devel@redhat.com>,
        Alasdair Kergon <agk@redhat.com>, Mike Snitzer <snitzer@redhat.com>
References: <20170628211010.4C8C9124035@b01ledav002.gho.pok.ibm.com>
 <c91629be-26db-96bb-9a55-18a6861888b2@kernel.dk>
 <CACVXFVMG=gD1Dq2SRKxhfPS33mzCC_dkSpVoYskaLE7PXV7xGQ@mail.gmail.com>
 <7f0a852e-5f90-4c63-9a43-a4180557530c@kernel.dk>
 <07ba10a8-6369-c1bc-dc9a-b550d9394c22@kernel.dk>
 <8f4ff428-e158-0df5-cf54-ae3cdea7ad1f@kernel.dk>
 <CACVXFVOW21vSOM0RGa-g0UFPRm91b9U16bruqXXYd=D+u7GKFQ@mail.gmail.com>
 <b462c167-53e9-11f8-583a-9984ef81fe0c@kernel.dk>
From: Brian King <brking@linux.vnet.ibm.com>
Date: Fri, 30 Jun 2017 08:05:06 -0500
MIME-Version: 1.0
In-Reply-To: <b462c167-53e9-11f8-583a-9984ef81fe0c@kernel.dk>
Content-Type: text/plain; charset=utf-8
Message-Id: <3729482c-fcca-2af5-4d05-7e44bcd71159@linux.vnet.ibm.com>
List-ID: <linux-block@vger.kernel.org>

On 06/29/2017 09:17 PM, Jens Axboe wrote:
> On 06/29/2017 07:20 PM, Ming Lei wrote:
>> On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>> On 06/29/2017 10:00 AM, Jens Axboe wrote:
>>>> On 06/29/2017 09:58 AM, Jens Axboe wrote:
>>>>> On 06/29/2017 02:40 AM, Ming Lei wrote:
>>>>>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 06/28/2017 03:12 PM, Brian King wrote:
>>>>>>>> This patch converts the in_flight counter in struct hd_struct from a
>>>>>>>> pair of atomics to a pair of percpu counters. This eliminates a couple
>>>>>>>> of atomics from the hot path. When running this on a Power system, to
>>>>>>>> a single null_blk device with 80 submission queues, irq mode 0, with
>>>>>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>>>>>>
>>>>>>> This has been done before, but I've never really liked it. The reason is
>>>>>>> that it means that reading the part stat inflight count now has to
>>>>>>> iterate over every possible CPU. Did you use partitions in your testing?
>>>>>>> How many CPUs were configured? When I last tested this a few years ago
>>>>>>> on even a quad core nehalem (which is notoriously shitty for cross-node
>>>>>>> latencies), it was a net loss.
>>>>>>
>>>>>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>>>>>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>>>>>> 96 cores, and dual numa nodes) too, the performance can be
>>>>>> recovered basically if per numa-node counter is introduced and
>>>>>> used in this case, but the patch was never posted out.
>>>>>> If anyone is interested in that, I can rebase the patch on current
>>>>>> block tree and post out. I guess the performance issue might be
>>>>>> related with system cache coherency implementation more or less.
>>>>>> This issue on ARM64 can be observed with the following userspace
>>>>>> atomic counting test too:
>>>>>>
>>>>>>        http://kernel.ubuntu.com/~ming/test/cache/
>>>>>
>>>>> How well did the per-node thing work? Doesn't seem to me like it would
>>>>> go far enough. And per CPU is too much. One potential improvement would
>>>>> be to change the part_stat_read() to just loop online CPUs, instead of
>>>>> all possible CPUs. When CPUs go on/offline, use that as the slow path to
>>>>> ensure the stats are sane. Often there's a huge difference between
>>>>> NR_CPUS configured and what the system has. As Brian states, RH ships
>>>>> with 2048, while I doubt a lot of customers actually run that...
>>>>>
>>>>> Outside of coming up with a more clever data structure that is fully
>>>>> CPU topology aware, one thing that could work is just having X cache
>>>>> line separated read/write inflight counters per node, where X is some
>>>>> suitable value (like 4). That prevents us from having cross node
>>>>> traffic, and it also keeps the cross cpu traffic fairly low. That should
>>>>> provide a nice balance between cost of incrementing the inflight
>>>>> counting, and the cost of looping for reading it.
>>>>>
>>>>> And that brings me to the next part...
>>>>>
>>>>>>> I do agree that we should do something about it, and it's one of those
>>>>>>> items I've highlighted in talks about blk-mq on pending issues to fix
>>>>>>> up. It's just not great as it currently stands, but I don't think per
>>>>>>> CPU counters is the right way to fix it, at least not for the inflight
>>>>>>> counter.
>>>>>>
>>>>>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>>>>>> we can use some blk-mq knowledge(tagset?) to figure out the
>>>>>> 'in_flight' counter. I thought about it before, but never got a
>>>>>> perfect solution, and looks it is a bit hard, :-)
>>>>>
>>>>> The tags are already a bit spread out, so it's worth a shot. That would
>>>>> remove the need to do anything in the inc/dec path, as the tags already
>>>>> do that. The inlight count could be easily retrieved with
>>>>> sbitmap_weight(). The only issue here is that we need separate read and
>>>>> write counters, and the weight would obviously only get us the total
>>>>> count. But we can have a slower path for that, just iterate the tags and
>>>>> count them. The fast path only cares about total count.
>>>>>
>>>>> Let me try that out real quick.
>>>>
>>>> Well, that only works for whole disk stats, of course... There's no way
>>>> around iterating the tags and checking for this to truly work.
>>>
>>> Totally untested proof of concept for using the tags for this. I based
>>> this on top of Brian's patch, so it includes his patch plus the
>>> _double() stuff I did which is no longer really needed.
>>>
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 9cf98b29588a..ec99d9ba0f33 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long sectors,
>>>         part_round_stats(cpu, part);
>>>         part_stat_inc(cpu, part, ios[rw]);
>>>         part_stat_add(cpu, part, sectors[rw], sectors);
>>> -       part_inc_in_flight(part, rw);
>>> +       part_inc_in_flight(cpu, part, rw);
>>>
>>>         part_stat_unlock();
>>>  }
>>> @@ -1751,7 +1751,7 @@ void generic_end_io_acct(int rw, struct hd_struct *part,
>>>
>>>         part_stat_add(cpu, part, ticks[rw], duration);
>>>         part_round_stats(cpu, part);
>>> -       part_dec_in_flight(part, rw);
>>> +       part_dec_in_flight(cpu, part, rw);
>>>
>>>         part_stat_unlock();
>>>  }
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index af393d5a9680..6ab2efbe940b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2434,8 +2434,13 @@ void blk_account_io_done(struct request *req)
>>>
>>>                 part_stat_inc(cpu, part, ios[rw]);
>>>                 part_stat_add(cpu, part, ticks[rw], duration);
>>> -               part_round_stats(cpu, part);
>>> -               part_dec_in_flight(part, rw);
>>> +
>>> +               if (req->q->mq_ops)
>>> +                       part_round_stats_mq(req->q, cpu, part);
>>> +               else {
>>> +                       part_round_stats(cpu, part);
>>> +                       part_dec_in_flight(cpu, part, rw);
>>> +               }
>>>
>>>                 hd_struct_put(part);
>>>                 part_stat_unlock();
>>> @@ -2492,8 +2497,12 @@ void blk_account_io_start(struct request *rq, bool new_io)
>>>                         part = &rq->rq_disk->part0;
>>>                         hd_struct_get(part);
>>>                 }
>>> -               part_round_stats(cpu, part);
>>> -               part_inc_in_flight(part, rw);
>>> +               if (rq->q->mq_ops)
>>> +                       part_round_stats_mq(rq->q, cpu, part);
>>> +               else {
>>> +                       part_round_stats(cpu, part);
>>> +                       part_inc_in_flight(cpu, part, rw);
>>> +               }
>>>                 rq->part = part;
>>>         }
>>>
>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>> index 99038830fb42..3b5eb2d4b964 100644
>>> --- a/block/blk-merge.c
>>> +++ b/block/blk-merge.c
>>> @@ -634,7 +634,7 @@ static void blk_account_io_merge(struct request *req)
>>>                 part = req->part;
>>>
>>>                 part_round_stats(cpu, part);
>>> -               part_dec_in_flight(part, rq_data_dir(req));
>>> +               part_dec_in_flight(cpu, part, rq_data_dir(req));
>>>
>>>                 hd_struct_put(part);
>>>                 part_stat_unlock();
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index d0be72ccb091..a7b897740c47 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>>                 bitnr += tags->nr_reserved_tags;
>>>         rq = tags->rqs[bitnr];
>>>
>>> -       if (rq->q == hctx->queue)
>>> +       if (rq && rq->q == hctx->queue)
>>>                 iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>         return true;
>>>  }
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 05dfa3f270ae..cad4d2c26285 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -43,6 +43,58 @@ static LIST_HEAD(all_q_list);
>>>  static void blk_mq_poll_stats_start(struct request_queue *q);
>>>  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
>>>
>>> +struct mq_inflight {
>>> +       struct hd_struct *part;
>>> +       unsigned int inflight;
>>> +};
>>> +
>>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>> +                                 struct request *rq, void *priv,
>>> +                                 bool reserved)
>>> +{
>>> +       struct mq_inflight *mi = priv;
>>> +
>>> +       if (rq->part == mi->part &&
>>> +           test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>>> +               mi->inflight++;
>>> +}
>>> +
>>> +unsigned long part_in_flight_mq(struct request_queue *q,
>>> +                               struct hd_struct *part)
>>> +{
>>> +       struct mq_inflight mi = { .part = part, .inflight = 0 };
>>> +
>>> +       blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
>>> +       return mi.inflight;
>>> +}
>>
>> Compared with the totally percpu approach, this way might help 1:M or
>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
>> each CPU(especially there are huge hw queues on a big system), :-(
> 
> Not disagreeing with that, without having some mechanism to only
> loop queues that have pending requests. That would be similar to the
> ctx_map for sw to hw queues. But I don't think that would be worthwhile
> doing, I like your pnode approach better. However, I'm still not fully
> convinced that one per node is enough to get the scalability we need.
> 
> Would be great if Brian could re-test with your updated patch, so we
> know how it works for him at least.

I'll try running with both approaches today and see how they compare.

Thanks!!!

Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Brian King <brking@linux.vnet.ibm.com>
Subject: Re: [PATCH 1/1] block: Convert hd_struct in_flight from
 atomic to percpu
Date: Fri, 30 Jun 2017 08:05:06 -0500
Message-ID: <3729482c-fcca-2af5-4d05-7e44bcd71159@linux.vnet.ibm.com>
References: <20170628211010.4C8C9124035@b01ledav002.gho.pok.ibm.com>
	<c91629be-26db-96bb-9a55-18a6861888b2@kernel.dk>
	<CACVXFVMG=gD1Dq2SRKxhfPS33mzCC_dkSpVoYskaLE7PXV7xGQ@mail.gmail.com>
	<7f0a852e-5f90-4c63-9a43-a4180557530c@kernel.dk>
	<07ba10a8-6369-c1bc-dc9a-b550d9394c22@kernel.dk>
	<8f4ff428-e158-0df5-cf54-ae3cdea7ad1f@kernel.dk>
	<CACVXFVOW21vSOM0RGa-g0UFPRm91b9U16bruqXXYd=D+u7GKFQ@mail.gmail.com>
	<b462c167-53e9-11f8-583a-9984ef81fe0c@kernel.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <dm-devel-bounces@redhat.com>
In-Reply-To: <b462c167-53e9-11f8-583a-9984ef81fe0c@kernel.dk>
Content-Language: en-US
List-Unsubscribe: <https://www.redhat.com/mailman/options/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/dm-devel>
List-Post: <mailto:dm-devel@redhat.com>
List-Help: <mailto:dm-devel-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=subscribe>
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com
To: Jens Axboe <axboe@kernel.dk>, Ming Lei <tom.leiming@gmail.com>
Cc: linux-block <linux-block@vger.kernel.org>, "open list:DEVICE-MAPPER (LVM)" <dm-devel@redhat.com>, Mike Snitzer <snitzer@redhat.com>, Alasdair Kergon <agk@redhat.com>
List-Id: dm-devel.ids

On 06/29/2017 09:17 PM, Jens Axboe wrote:
> On 06/29/2017 07:20 PM, Ming Lei wrote:
>> On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>> On 06/29/2017 10:00 AM, Jens Axboe wrote:
>>>> On 06/29/2017 09:58 AM, Jens Axboe wrote:
>>>>> On 06/29/2017 02:40 AM, Ming Lei wrote:
>>>>>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 06/28/2017 03:12 PM, Brian King wrote:
>>>>>>>> This patch converts the in_flight counter in struct hd_struct from a
>>>>>>>> pair of atomics to a pair of percpu counters. This eliminates a couple
>>>>>>>> of atomics from the hot path. When running this on a Power system, to
>>>>>>>> a single null_blk device with 80 submission queues, irq mode 0, with
>>>>>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>>>>>>
>>>>>>> This has been done before, but I've never really liked it. The reason is
>>>>>>> that it means that reading the part stat inflight count now has to
>>>>>>> iterate over every possible CPU. Did you use partitions in your testing?
>>>>>>> How many CPUs were configured? When I last tested this a few years ago
>>>>>>> on even a quad core nehalem (which is notoriously shitty for cross-node
>>>>>>> latencies), it was a net loss.
>>>>>>
>>>>>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>>>>>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>>>>>> 96 cores, and dual numa nodes) too, the performance can be
>>>>>> recovered basically if per numa-node counter is introduced and
>>>>>> used in this case, but the patch was never posted out.
>>>>>> If anyone is interested in that, I can rebase the patch on current
>>>>>> block tree and post out. I guess the performance issue might be
>>>>>> related with system cache coherency implementation more or less.
>>>>>> This issue on ARM64 can be observed with the following userspace
>>>>>> atomic counting test too:
>>>>>>
>>>>>>        http://kernel.ubuntu.com/~ming/test/cache/
>>>>>
>>>>> How well did the per-node thing work? Doesn't seem to me like it would
>>>>> go far enough. And per CPU is too much. One potential improvement would
>>>>> be to change the part_stat_read() to just loop online CPUs, instead of
>>>>> all possible CPUs. When CPUs go on/offline, use that as the slow path to
>>>>> ensure the stats are sane. Often there's a huge difference between
>>>>> NR_CPUS configured and what the system has. As Brian states, RH ships
>>>>> with 2048, while I doubt a lot of customers actually run that...
>>>>>
>>>>> Outside of coming up with a more clever data structure that is fully
>>>>> CPU topology aware, one thing that could work is just having X cache
>>>>> line separated read/write inflight counters per node, where X is some
>>>>> suitable value (like 4). That prevents us from having cross node
>>>>> traffic, and it also keeps the cross cpu traffic fairly low. That should
>>>>> provide a nice balance between cost of incrementing the inflight
>>>>> counting, and the cost of looping for reading it.
>>>>>
>>>>> And that brings me to the next part...
>>>>>
>>>>>>> I do agree that we should do something about it, and it's one of those
>>>>>>> items I've highlighted in talks about blk-mq on pending issues to fix
>>>>>>> up. It's just not great as it currently stands, but I don't think per
>>>>>>> CPU counters is the right way to fix it, at least not for the inflight
>>>>>>> counter.
>>>>>>
>>>>>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>>>>>> we can use some blk-mq knowledge(tagset?) to figure out the
>>>>>> 'in_flight' counter. I thought about it before, but never got a
>>>>>> perfect solution, and looks it is a bit hard, :-)
>>>>>
>>>>> The tags are already a bit spread out, so it's worth a shot. That would
>>>>> remove the need to do anything in the inc/dec path, as the tags already
>>>>> do that. The inlight count could be easily retrieved with
>>>>> sbitmap_weight(). The only issue here is that we need separate read and
>>>>> write counters, and the weight would obviously only get us the total
>>>>> count. But we can have a slower path for that, just iterate the tags and
>>>>> count them. The fast path only cares about total count.
>>>>>
>>>>> Let me try that out real quick.
>>>>
>>>> Well, that only works for whole disk stats, of course... There's no way
>>>> around iterating the tags and checking for this to truly work.
>>>
>>> Totally untested proof of concept for using the tags for this. I based
>>> this on top of Brian's patch, so it includes his patch plus the
>>> _double() stuff I did which is no longer really needed.
>>>
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 9cf98b29588a..ec99d9ba0f33 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long sectors,
>>>         part_round_stats(cpu, part);
>>>         part_stat_inc(cpu, part, ios[rw]);
>>>         part_stat_add(cpu, part, sectors[rw], sectors);
>>> -       part_inc_in_flight(part, rw);
>>> +       part_inc_in_flight(cpu, part, rw);
>>>
>>>         part_stat_unlock();
>>>  }
>>> @@ -1751,7 +1751,7 @@ void generic_end_io_acct(int rw, struct hd_struct *part,
>>>
>>>         part_stat_add(cpu, part, ticks[rw], duration);
>>>         part_round_stats(cpu, part);
>>> -       part_dec_in_flight(part, rw);
>>> +       part_dec_in_flight(cpu, part, rw);
>>>
>>>         part_stat_unlock();
>>>  }
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index af393d5a9680..6ab2efbe940b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2434,8 +2434,13 @@ void blk_account_io_done(struct request *req)
>>>
>>>                 part_stat_inc(cpu, part, ios[rw]);
>>>                 part_stat_add(cpu, part, ticks[rw], duration);
>>> -               part_round_stats(cpu, part);
>>> -               part_dec_in_flight(part, rw);
>>> +
>>> +               if (req->q->mq_ops)
>>> +                       part_round_stats_mq(req->q, cpu, part);
>>> +               else {
>>> +                       part_round_stats(cpu, part);
>>> +                       part_dec_in_flight(cpu, part, rw);
>>> +               }
>>>
>>>                 hd_struct_put(part);
>>>                 part_stat_unlock();
>>> @@ -2492,8 +2497,12 @@ void blk_account_io_start(struct request *rq, bool new_io)
>>>                         part = &rq->rq_disk->part0;
>>>                         hd_struct_get(part);
>>>                 }
>>> -               part_round_stats(cpu, part);
>>> -               part_inc_in_flight(part, rw);
>>> +               if (rq->q->mq_ops)
>>> +                       part_round_stats_mq(rq->q, cpu, part);
>>> +               else {
>>> +                       part_round_stats(cpu, part);
>>> +                       part_inc_in_flight(cpu, part, rw);
>>> +               }
>>>                 rq->part = part;
>>>         }
>>>
>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>> index 99038830fb42..3b5eb2d4b964 100644
>>> --- a/block/blk-merge.c
>>> +++ b/block/blk-merge.c
>>> @@ -634,7 +634,7 @@ static void blk_account_io_merge(struct request *req)
>>>                 part = req->part;
>>>
>>>                 part_round_stats(cpu, part);
>>> -               part_dec_in_flight(part, rq_data_dir(req));
>>> +               part_dec_in_flight(cpu, part, rq_data_dir(req));
>>>
>>>                 hd_struct_put(part);
>>>                 part_stat_unlock();
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index d0be72ccb091..a7b897740c47 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>>                 bitnr += tags->nr_reserved_tags;
>>>         rq = tags->rqs[bitnr];
>>>
>>> -       if (rq->q == hctx->queue)
>>> +       if (rq && rq->q == hctx->queue)
>>>                 iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>         return true;
>>>  }
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 05dfa3f270ae..cad4d2c26285 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -43,6 +43,58 @@ static LIST_HEAD(all_q_list);
>>>  static void blk_mq_poll_stats_start(struct request_queue *q);
>>>  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
>>>
>>> +struct mq_inflight {
>>> +       struct hd_struct *part;
>>> +       unsigned int inflight;
>>> +};
>>> +
>>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>> +                                 struct request *rq, void *priv,
>>> +                                 bool reserved)
>>> +{
>>> +       struct mq_inflight *mi = priv;
>>> +
>>> +       if (rq->part == mi->part &&
>>> +           test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>>> +               mi->inflight++;
>>> +}
>>> +
>>> +unsigned long part_in_flight_mq(struct request_queue *q,
>>> +                               struct hd_struct *part)
>>> +{
>>> +       struct mq_inflight mi = { .part = part, .inflight = 0 };
>>> +
>>> +       blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
>>> +       return mi.inflight;
>>> +}
>>
>> Compared with the totally percpu approach, this way might help 1:M or
>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
>> each CPU(especially there are huge hw queues on a big system), :-(
> 
> Not disagreeing with that, without having some mechanism to only
> loop queues that have pending requests. That would be similar to the
> ctx_map for sw to hw queues. But I don't think that would be worthwhile
> doing, I like your pnode approach better. However, I'm still not fully
> convinced that one per node is enough to get the scalability we need.
> 
> Would be great if Brian could re-test with your updated patch, so we
> know how it works for him at least.

I'll try running with both approaches today and see how they compare.

Thanks!!!

Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center