All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Query] increased latency observed in cpu hotplug path
       [not found]     ` <faa0cfa9-46e5-f580-654c-5ec7dcc8f977@codeaurora.org>
@ 2016-09-21 14:06       ` Khan, Imran
  2016-09-21 15:56         ` Akinobu Mita
  0 siblings, 1 reply; 8+ messages in thread
From: Khan, Imran @ 2016-09-21 14:06 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Jens Axboe, Christoph Hellwig, tsoni, kaushalk, linux-arm-msm

On 9/16/2016 11:12 AM, Khan, Imran wrote:
> On 9/15/2016 8:44 PM, Khan, Imran wrote:
>> On 8/23/2016 9:17 PM, Khan, Imran wrote:
>>> On 8/20/2016 8:14 AM, Akinobu Mita wrote:
>>>> 2016-08-18 1:47 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:
>>>>> On 8/16/2016 11:23 AM, Khan, Imran wrote:
>>>>>> On 8/5/2016 12:49 PM, Khan, Imran wrote:
>>>>>>> On 8/1/2016 2:58 PM, Khan, Imran wrote:
>>>>>>>> On 7/30/2016 7:54 AM, Akinobu Mita wrote:
>>>>>>>>> 2016-07-28 22:18 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Recently we have observed some increased latency in CPU hotplug
>>>>>>>>>> event in CPU online path. For online latency we see that block
>>>>>>>>>> layer is executing notification handler for CPU_UP_PREPARE event
>>>>>>>>>> and this in turn waits for RCU grace period resulting (sometimes)
>>>>>>>>>> in an execution time of 15-20 ms for this notification handler.
>>>>>>>>>> This change was not there in 3.18 kernel but is present in 4.4
>>>>>>>>>> kernel and was introduced by following commit:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
>>>>>>>>>> Author: Akinobu Mita <akinobu.mita@gmail.com>
>>>>>>>>>> Date:   Sun Sep 27 02:09:23 2015 +0900
>>>>>>>>>>
>>>>>>>>>>     blk-mq: avoid inserting requests before establishing new mapping
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>> Upon reverting this commit I could see an improvement of 15-20 ms in
>>>>>>>>>> online latency. So I am looking for some help in analyzing the effects
>>>>>>>>>> of reverting this or should some other approach to reduce the online
>>>>>>>>>> latency must be taken.
>>>>>>>>>
>>>>>>>>> Can you observe the difference in online latency by removing
>>>>>>>>> get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue()
>>>>>>>>> instead of full reverting the commit?
>>>>>>>>>
>>>>>>>> Hi Akinobu,
>>>>>>>> I tried your suggestion but could not achieve any improvement. Actually the snippet that is causing the change in latency is the following one :
>>>>>>>>
>>>>>>>> list_for_each_entry(q, &all_q_list, all_q_node) {
>>>>>>>>                 blk_mq_freeze_queue_wait(q);
>>>>>>>>
>>>>>>>>                 /*
>>>>>>>>                  * timeout handler can't touch hw queue during the
>>>>>>>>                  * reinitialization
>>>>>>>>                  */
>>>>>>>>                 del_timer_sync(&q->timeout);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> I understand that this is getting executed now for CPU_UP_PREPARE as well resulting in
>>>>>>>> increased latency in the cpu online path. I am trying to reduce this latency while keeping the
>>>>>>>> purpose of this commit intact. I would welcome further suggestions/feedback in this regard.
>>>>>>>>
>>>>>>> Hi Akinobu,
>>>>>>>
>>>>>>> I am not able to reduce the cpu online latency with this patch, could you please let me know what
>>>>>>> functionality will be broken, if we avoid this patch in our kernel. Also if you have some other
>>>>>>> suggestions towards improving this patch please let me know.
>>>>>>>
>>>>>> After moving the remapping of queues to block layer's kworker I see that online latency has improved
>>>>>> while offline latency remains the same. As the freezing of queues happens in the context of block layer's
>>>>>> worker, I think it would be better to do the remapping in the same context and then go ahead with freezing.
>>>>>> In this regard I have made following change:
>>>>>>
>>>>>> commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624
>>>>>> Author: Imran Khan <kimran@codeaurora.org>
>>>>>> Date:   Fri Aug 12 19:59:47 2016 +0530
>>>>>>
>>>>>>     blk-mq: Move block queue remapping from cpu hotplug path
>>>>>>
>>>>>>     During a cpu hotplug, the hardware and software contexts mappings
>>>>>>     need to be updated in order to take into account requests
>>>>>>     submitted for the hotadded CPU. But if this mapping is done
>>>>>>     in hotplug notifier, it deteriorates the hotplug latency.
>>>>>>     So move the block queue remapping to block layer worker which
>>>>>>     results in significant improvements in hotplug latency.
>>>>>>
>>>>>>     Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4
>>>>>>     Signed-off-by: Imran Khan <kimran@codeaurora.org>
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index 6d6f8fe..06fcf89 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -22,7 +22,11 @@
>>>>>>  #include <linux/sched/sysctl.h>
>>>>>>  #include <linux/delay.h>
>>>>>>  #include <linux/crash_dump.h>
>>>>>> -
>>>>>> +#include <linux/kthread.h>
>>>>>> +#include <linux/mutex.h>
>>>>>> +#include <linux/sched.h>
>>>>>> +#include <linux/kthread.h>
>>>>>> +#include <linux/mutex.h>
>>>>>>  #include <trace/events/block.h>
>>>>>>
>>>>>>  #include <linux/blk-mq.h>
>>>>>> @@ -32,10 +36,18 @@
>>>>>>
>>>>>>  static DEFINE_MUTEX(all_q_mutex);
>>>>>>  static LIST_HEAD(all_q_list);
>>>>>> -
>>>>>>  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx);
>>>>>>
>>>>>>  /*
>>>>>> + * New online cpumask which is going to be set in this hotplug event.
>>>>>> + * Declare this cpumasks as global as cpu-hotplug operation is invoked
>>>>>> + * one-by-one and dynamically allocating this could result in a failure.
>>>>>> + */
>>>>>> +static struct cpumask online_new;
>>>>>> +
>>>>>> +static struct work_struct blk_mq_remap_work;
>>>>>> +
>>>>>> +/*
>>>>>>   * Check if any of the ctx's have pending work in this hardware queue
>>>>>>   */
>>>>>>  static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
>>>>>> @@ -2125,14 +2137,7 @@ static void blk_mq_queue_reinit(struct request_queue *q,
>>>>>>  static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>>>>>                                       unsigned long action, void *hcpu)
>>>>>>  {
>>>>>> -       struct request_queue *q;
>>>>>>         int cpu = (unsigned long)hcpu;
>>>>>> -       /*
>>>>>> -        * New online cpumask which is going to be set in this hotplug event.
>>>>>> -        * Declare this cpumasks as global as cpu-hotplug operation is invoked
>>>>>> -        * one-by-one and dynamically allocating this could result in a failure.
>>>>>> -        */
>>>>>> -       static struct cpumask online_new;
>>>>>>
>>>>>>         /*
>>>>>>          * Before hotadded cpu starts handling requests, new mappings must
>>>>>> @@ -2155,43 +2160,17 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>>>>>         case CPU_DEAD:
>>>>>>         case CPU_UP_CANCELED:
>>>>>>                 cpumask_copy(&online_new, cpu_online_mask);
>>>>>> +               kblockd_schedule_work(&blk_mq_remap_work);
>>>>>>                 break;
>>>>>>         case CPU_UP_PREPARE:
>>>>>>                 cpumask_copy(&online_new, cpu_online_mask);
>>>>>>                 cpumask_set_cpu(cpu, &online_new);
>>>>>> +               kblockd_schedule_work(&blk_mq_remap_work);
>>>>>>                 break;
>>>>>>         default:
>>>>>>                 return NOTIFY_OK;
>>>>>>         }
>>>>>>
>>>>>> -       mutex_lock(&all_q_mutex);
>>>>>> -
>>>>>> -       /*
>>>>>> -        * We need to freeze and reinit all existing queues.  Freezing
>>>>>> -        * involves synchronous wait for an RCU grace period and doing it
>>>>>> -        * one by one may take a long time.  Start freezing all queues in
>>>>>> -        * one swoop and then wait for the completions so that freezing can
>>>>>> -        * take place in parallel.
>>>>>> -        */
>>>>>> -       list_for_each_entry(q, &all_q_list, all_q_node)
>>>>>> -               blk_mq_freeze_queue_start(q);
>>>>>> -       list_for_each_entry(q, &all_q_list, all_q_node) {
>>>>>> -               blk_mq_freeze_queue_wait(q);
>>>>>> -
>>>>>> -               /*
>>>>>> -                * timeout handler can't touch hw queue during the
>>>>>> -                * reinitialization
>>>>>> -                */
>>>>>> -               del_timer_sync(&q->timeout);
>>>>>> -       }
>>>>>> -
>>>>>> -       list_for_each_entry(q, &all_q_list, all_q_node)
>>>>>> -               blk_mq_queue_reinit(q, &online_new);
>>>>>> -
>>>>>> -       list_for_each_entry(q, &all_q_list, all_q_node)
>>>>>> -               blk_mq_unfreeze_queue(q);
>>>>>> -
>>>>>> -       mutex_unlock(&all_q_mutex);
>>>>>>         return NOTIFY_OK;
>>>>>>  }
>>>>>>
>>>>>> @@ -2357,11 +2336,48 @@ void blk_mq_enable_hotplug(void)
>>>>>>         mutex_unlock(&all_q_mutex);
>>>>>>  }
>>>>>>
>>>>>> +static void blk_remap_work(struct work_struct *work)
>>>>>> +{
>>>>>> +       struct request_queue *q;
>>>>>> +
>>>>>> +       /* Start the task of queue remapping */
>>>>>> +       mutex_lock(&all_q_mutex);
>>>>>> +
>>>>>> +       /*
>>>>>> +        * We need to freeze and reinit all existing queues.  Freezing
>>>>>> +        * involves synchronous wait for an RCU grace period and doing it
>>>>>> +        * one by one may take a long time.  Start freezing all queues in
>>>>>> +        * one swoop and then wait for the completions so that freezing can
>>>>>> +        * take place in parallel.
>>>>>> +        */
>>>>>> +       list_for_each_entry(q, &all_q_list, all_q_node)
>>>>>> +               blk_mq_freeze_queue_start(q);
>>>>>> +       list_for_each_entry(q, &all_q_list, all_q_node) {
>>>>>> +               blk_mq_freeze_queue_wait(q);
>>>>>> +
>>>>>> +               /*
>>>>>> +                * timeout handler can't touch hw queue during the
>>>>>> +                * reinitialization
>>>>>> +                */
>>>>>> +               del_timer_sync(&q->timeout);
>>>>>> +       }
>>>>>> +
>>>>>> +       list_for_each_entry(q, &all_q_list, all_q_node)
>>>>>> +               blk_mq_queue_reinit(q, &online_new);
>>>>>> +
>>>>>> +       list_for_each_entry(q, &all_q_list, all_q_node)
>>>>>> +               blk_mq_unfreeze_queue(q);
>>>>>> +
>>>>>> +       mutex_unlock(&all_q_mutex);
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>>  static int __init blk_mq_init(void)
>>>>>>  {
>>>>>>         blk_mq_cpu_init();
>>>>>>
>>>>>>         hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
>>>>>> +       INIT_WORK(&blk_mq_remap_work, blk_remap_work);
>>>>>>
>>>>>>         return 0;
>>>>>>  }
>>>>>>
>>>>>> I have run overnight tests of fsstress and multiple dd commands along with cpu hotplugging.
>>>>>> Could you please review this change and let me know if you see any issues with this change or if I need
>>>>>> to test some specific corner cases to validate this change.
>>>>>>
>>>>> Could someone kindly review this change
>>>>> and provide feedback?
>>>>
>>>> There is no flush_work() call for blk_remap_work.  But I suppose that
>>>> this work must be finished before newly onlined CPU actually starts
>>>> running.
>>>>
>>> I will make this correction and test again. Regarding the second point do we 
>>> need to ensure that this work finishes before the newly onlined CPU starts running or
>>> it is just that we finish the pending work, before entertaining any new requests for ctx
>>> corresponding to newly onlined CPU. If it is the second case do you see some problem
>>> with the current approach. Also could you kindly suggest some test cases that would help
>>> in finding out the corner cases.
>>>
>> I tried to make the flush_work() for blk_remap_work complete before the newly onlined CPU
>> starts running, but ensuring the completion of flush_work results in similar latency
>> issues. As the part that is contributing most to the latency, i.e freezing of the queues,
>> happens because we need to perform a remapping, I think we can avoid freezing for CPU
>> hotplug events if we can make a static mapping work. The one problem I see with this 
>> solution is that a h/w ctx may be mapped to a particular CPU which  may go down
>> and never come up, making that h/w ctx useless but again we can keep this persistent 
>> mapping feature limited to specific use cases, depending on some kernel config flag.
>> I have done some tests with the below patch on null-blk device, with different number of
>> h/w ctxs and possible/online CPUs. Can someone please review the patch and provide some
>> feedback.
>>
>> commit 084ee793ec1ff4e2171b481642bfbdaa2e10664c
>> Author: Imran Khan <kimran@codeaurora.org>
>> Date:   Thu Sep 15 16:44:02 2016 +0530
>>
>>     blk-mq: use static mapping
>>     
>>     blk-mq layer performs a remapping between s/w and h/w contexts and also
>>     between h/w contexts and CPUs, whenever a CPU hotplug event happens.
>>     This remapping has to wait for queue freezing which may take tens of
>>     miliseconds, resulting in a high latency in CPU hotplug path.
>>     This patch makes the above mentioned mappings static so that we can
>>     avoid remapping when CPU hotplug event happens and this results in
>>     improved CPU hotplug latency.
>>     
>>     Change-Id: Idf38cb6c4e78c91fda3c86608c6d0441f01ab435
>>     Signed-off-by: Imran Khan <kimran@codeaurora.org>
>>
>> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
>> index 8764c24..85fdb88 100644
>> --- a/block/blk-mq-cpumap.c
>> +++ b/block/blk-mq-cpumap.c
>> @@ -52,18 +52,13 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
>>  
>>  	queue = 0;
>>  	for_each_possible_cpu(i) {
>> -		if (!cpumask_test_cpu(i, online_mask)) {
>> -			map[i] = 0;
>> -			continue;
>> -		}
>> -
>>  		/*
>>  		 * Easy case - we have equal or more hardware queues. Or
>>  		 * there are no thread siblings to take into account. Do
>>  		 * 1:1 if enough, or sequential mapping if less.
>>  		 */
>> -		if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
>> -			map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
>> +		if (nr_queues >= nr_cpu_ids) {
>> +			map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues, queue);
>>  			queue++;
>>  			continue;
>>  		}
>> @@ -75,7 +70,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
>>  		 */
>>  		first_sibling = get_first_sibling(i);
>>  		if (first_sibling == i) {
>> -			map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
>> +			map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues,
>>  							queue);
>>  			queue++;
>>  		} else
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 6d6f8fe..0753fdf 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1783,10 +1783,6 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>>  		INIT_LIST_HEAD(&__ctx->rq_list);
>>  		__ctx->queue = q;
>>  
>> -		/* If the cpu isn't online, the cpu is mapped to first hctx */
>> -		if (!cpu_online(i))
>> -			continue;
>> -
>>  		hctx = q->mq_ops->map_queue(q, i);
>>  
>>  		/*
>> @@ -1820,12 +1816,9 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>  	 * Map software to hardware queues
>>  	 */
>>  	queue_for_each_ctx(q, ctx, i) {
>> -		/* If the cpu isn't online, the cpu is mapped to first hctx */
>> -		if (!cpumask_test_cpu(i, online_mask))
>> -			continue;
>> -
>>  		hctx = q->mq_ops->map_queue(q, i);
>> -		cpumask_set_cpu(i, hctx->cpumask);
>> +		if (cpumask_test_cpu(i, online_mask))
>> +			cpumask_set_cpu(i, hctx->cpumask);
>>  		ctx->index_hw = hctx->nr_ctx;
>>  		hctx->ctxs[hctx->nr_ctx++] = ctx;
>>  	}
>> @@ -1863,17 +1856,22 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>  
>>  		/*
>>  		 * Initialize batch roundrobin counts
>> +		 * Set next_cpu for only those hctxs that have an online CPU
>> +		 * in their cpumask field. For hctxs that belong to few online
>> +		 * and few offline CPUs, this will always provide one CPU from
>> +		 * online ones. For hctxs belonging to all offline CPUs, their
>> +		 * cpumask will be updated in reinit_notify.
>>  		 */
>> -		hctx->next_cpu = cpumask_first(hctx->cpumask);
>> -		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>> +		if(cpumask_first(hctx->cpumask) < nr_cpu_ids) {
>> +			hctx->next_cpu = cpumask_first(hctx->cpumask);
>> +			hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>> +		}
>>  	}
>>  
>>  	queue_for_each_ctx(q, ctx, i) {
>> -		if (!cpumask_test_cpu(i, online_mask))
>> -			continue;
>> -
>>  		hctx = q->mq_ops->map_queue(q, i);
>> -		cpumask_set_cpu(i, hctx->tags->cpumask);
>> +		if (cpumask_test_cpu(i, online_mask))
>> +			cpumask_set_cpu(i, hctx->tags->cpumask);
>>  	}
>>  }
>>  
>> @@ -2101,31 +2099,12 @@ void blk_mq_free_queue(struct request_queue *q)
>>  	blk_mq_free_hw_queues(q, set);
>>  }
>>  
>> -/* Basically redo blk_mq_init_queue with queue frozen */
>> -static void blk_mq_queue_reinit(struct request_queue *q,
>> -				const struct cpumask *online_mask)
>> -{
>> -	WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
>> -
>> -	blk_mq_sysfs_unregister(q);
>> -
>> -	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
>> -
>> -	/*
>> -	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
>> -	 * we should change hctx numa_node according to new topology (this
>> -	 * involves free and re-allocate memory, worthy doing?)
>> -	 */
>> -
>> -	blk_mq_map_swqueue(q, online_mask);
>> -
>> -	blk_mq_sysfs_register(q);
>> -}
>> -
>>  static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>  				      unsigned long action, void *hcpu)
>>  {
>>  	struct request_queue *q;
>> +	struct blk_mq_hw_ctx *hctx;
>> +	int i;
>>  	int cpu = (unsigned long)hcpu;
>>  	/*
>>  	 * New online cpumask which is going to be set in this hotplug event.
>> @@ -2155,43 +2134,29 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>  	case CPU_DEAD:
>>  	case CPU_UP_CANCELED:
>>  		cpumask_copy(&online_new, cpu_online_mask);
>> +		list_for_each_entry(q, &all_q_list, all_q_node) {
>> +			queue_for_each_hw_ctx(q, hctx, i) {
>> +				cpumask_clear_cpu(cpu, hctx->cpumask);
>> +				cpumask_clear_cpu(cpu, hctx->tags->cpumask);
>> +			}
>> +		}
>>  		break;
>>  	case CPU_UP_PREPARE:
>>  		cpumask_copy(&online_new, cpu_online_mask);
>>  		cpumask_set_cpu(cpu, &online_new);
>> +		/* Update hctx->cpumask for newly onlined CPUs */
>> +		list_for_each_entry(q, &all_q_list, all_q_node) {
>> +			queue_for_each_hw_ctx(q, hctx, i) {
>> +				cpumask_set_cpu(cpu, hctx->cpumask);
>> +				hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>> +				cpumask_set_cpu(cpu, hctx->tags->cpumask);
>> +			}
>> +		}
>>  		break;
>>  	default:
>>  		return NOTIFY_OK;
>>  	}
>>  
>> -	mutex_lock(&all_q_mutex);
>> -
>> -	/*
>> -	 * We need to freeze and reinit all existing queues.  Freezing
>> -	 * involves synchronous wait for an RCU grace period and doing it
>> -	 * one by one may take a long time.  Start freezing all queues in
>> -	 * one swoop and then wait for the completions so that freezing can
>> -	 * take place in parallel.
>> -	 */
>> -	list_for_each_entry(q, &all_q_list, all_q_node)
>> -		blk_mq_freeze_queue_start(q);
>> -	list_for_each_entry(q, &all_q_list, all_q_node) {
>> -		blk_mq_freeze_queue_wait(q);
>> -
>> -		/*
>> -		 * timeout handler can't touch hw queue during the
>> -		 * reinitialization
>> -		 */
>> -		del_timer_sync(&q->timeout);
>> -	}
>> -
>> -	list_for_each_entry(q, &all_q_list, all_q_node)
>> -		blk_mq_queue_reinit(q, &online_new);
>> -
>> -	list_for_each_entry(q, &all_q_list, all_q_node)
>> -		blk_mq_unfreeze_queue(q);
>> -
>> -	mutex_unlock(&all_q_mutex);
>>  	return NOTIFY_OK;
>>  }
>>  
>>
>>
>>
> With the patch I have collected some numbers, running some android monkey tests along with 
> an app that does random hootplugging of CPUs. The numbers obtained are as below: 
> 
> CPU UP time(in micro seconds):
>          default blk-mq: 58629 54207 61792 58458 59286    
> blk-mq with above patch: 2755  2588  3412  3239  3010 
> 
> CPU DOWN time(in micros seconds):
>          default blk-mq: 95990  103379 103686 93214  128274   
> blk-mq with above patch: 67178  49876  70027  61261  59656
>  
Could someone please review the patch and provide some feedback?
Even if the feedback is about the approach we are trying here or some 
suggestions to move ahead in that direction.
>>>> Did you tried Tejun's percpu_ref patchset that I said in the last email?
>>>> (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg984823.html)
>>>>
>>>>
>>> I have tried the suggested patch but did not see any improvements in cpu
>>> hotplug latencies
>>>
>>
>>
> 
> 


-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Query] increased latency observed in cpu hotplug path
  2016-09-21 14:06       ` [Query] increased latency observed in cpu hotplug path Khan, Imran
@ 2016-09-21 15:56         ` Akinobu Mita
  2016-09-22 10:06           ` Khan, Imran
  0 siblings, 1 reply; 8+ messages in thread
From: Akinobu Mita @ 2016-09-21 15:56 UTC (permalink / raw)
  To: Khan, Imran; +Cc: Jens Axboe, Christoph Hellwig, tsoni, kaushalk, linux-arm-msm

2016-09-21 23:06 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:

>>> commit 084ee793ec1ff4e2171b481642bfbdaa2e10664c
>>> Author: Imran Khan <kimran@codeaurora.org>
>>> Date:   Thu Sep 15 16:44:02 2016 +0530
>>>
>>>     blk-mq: use static mapping
>>>
>>>     blk-mq layer performs a remapping between s/w and h/w contexts and also
>>>     between h/w contexts and CPUs, whenever a CPU hotplug event happens.
>>>     This remapping has to wait for queue freezing which may take tens of
>>>     miliseconds, resulting in a high latency in CPU hotplug path.
>>>     This patch makes the above mentioned mappings static so that we can
>>>     avoid remapping when CPU hotplug event happens and this results in
>>>     improved CPU hotplug latency.
>>>
>>>     Change-Id: Idf38cb6c4e78c91fda3c86608c6d0441f01ab435
>>>     Signed-off-by: Imran Khan <kimran@codeaurora.org>
>>>
>>> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
>>> index 8764c24..85fdb88 100644
>>> --- a/block/blk-mq-cpumap.c
>>> +++ b/block/blk-mq-cpumap.c
>>> @@ -52,18 +52,13 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
>>>
>>>      queue = 0;
>>>      for_each_possible_cpu(i) {
>>> -            if (!cpumask_test_cpu(i, online_mask)) {
>>> -                    map[i] = 0;
>>> -                    continue;
>>> -            }
>>> -
>>>              /*
>>>               * Easy case - we have equal or more hardware queues. Or
>>>               * there are no thread siblings to take into account. Do
>>>               * 1:1 if enough, or sequential mapping if less.
>>>               */
>>> -            if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
>>> -                    map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
>>> +            if (nr_queues >= nr_cpu_ids) {
>>> +                    map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues, queue);
>>>                      queue++;
>>>                      continue;
>>>              }
>>> @@ -75,7 +70,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
>>>               */
>>>              first_sibling = get_first_sibling(i);
>>>              if (first_sibling == i) {
>>> -                    map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
>>> +                    map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues,
>>>                                                      queue);
>>>                      queue++;
>>>              } else
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 6d6f8fe..0753fdf 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1783,10 +1783,6 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>>>              INIT_LIST_HEAD(&__ctx->rq_list);
>>>              __ctx->queue = q;
>>>
>>> -            /* If the cpu isn't online, the cpu is mapped to first hctx */
>>> -            if (!cpu_online(i))
>>> -                    continue;
>>> -
>>>              hctx = q->mq_ops->map_queue(q, i);
>>>
>>>              /*
>>> @@ -1820,12 +1816,9 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>>       * Map software to hardware queues
>>>       */
>>>      queue_for_each_ctx(q, ctx, i) {
>>> -            /* If the cpu isn't online, the cpu is mapped to first hctx */
>>> -            if (!cpumask_test_cpu(i, online_mask))
>>> -                    continue;
>>> -
>>>              hctx = q->mq_ops->map_queue(q, i);
>>> -            cpumask_set_cpu(i, hctx->cpumask);
>>> +            if (cpumask_test_cpu(i, online_mask))
>>> +                    cpumask_set_cpu(i, hctx->cpumask);
>>>              ctx->index_hw = hctx->nr_ctx;
>>>              hctx->ctxs[hctx->nr_ctx++] = ctx;
>>>      }
>>> @@ -1863,17 +1856,22 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>>
>>>              /*
>>>               * Initialize batch roundrobin counts
>>> +             * Set next_cpu for only those hctxs that have an online CPU
>>> +             * in their cpumask field. For hctxs that belong to few online
>>> +             * and few offline CPUs, this will always provide one CPU from
>>> +             * online ones. For hctxs belonging to all offline CPUs, their
>>> +             * cpumask will be updated in reinit_notify.
>>>               */
>>> -            hctx->next_cpu = cpumask_first(hctx->cpumask);
>>> -            hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>>> +            if(cpumask_first(hctx->cpumask) < nr_cpu_ids) {
>>> +                    hctx->next_cpu = cpumask_first(hctx->cpumask);
>>> +                    hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>>> +            }
>>>      }
>>>
>>>      queue_for_each_ctx(q, ctx, i) {
>>> -            if (!cpumask_test_cpu(i, online_mask))
>>> -                    continue;
>>> -
>>>              hctx = q->mq_ops->map_queue(q, i);
>>> -            cpumask_set_cpu(i, hctx->tags->cpumask);
>>> +            if (cpumask_test_cpu(i, online_mask))
>>> +                    cpumask_set_cpu(i, hctx->tags->cpumask);
>>>      }
>>>  }
>>>
>>> @@ -2101,31 +2099,12 @@ void blk_mq_free_queue(struct request_queue *q)
>>>      blk_mq_free_hw_queues(q, set);
>>>  }
>>>
>>> -/* Basically redo blk_mq_init_queue with queue frozen */
>>> -static void blk_mq_queue_reinit(struct request_queue *q,
>>> -                            const struct cpumask *online_mask)
>>> -{
>>> -    WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
>>> -
>>> -    blk_mq_sysfs_unregister(q);
>>> -
>>> -    blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);

Now blk_mq_update_queue_map() is only used in block/blk-mq-cpumap.c,
so you can make blk_mq_update_queue_map() static function.

>>> -
>>> -    /*
>>> -     * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
>>> -     * we should change hctx numa_node according to new topology (this
>>> -     * involves free and re-allocate memory, worthy doing?)
>>> -     */
>>> -
>>> -    blk_mq_map_swqueue(q, online_mask);
>>> -
>>> -    blk_mq_sysfs_register(q);
>>> -}
>>> -
>>>  static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>>                                    unsigned long action, void *hcpu)
>>>  {
>>>      struct request_queue *q;
>>> +    struct blk_mq_hw_ctx *hctx;
>>> +    int i;
>>>      int cpu = (unsigned long)hcpu;
>>>      /*
>>>       * New online cpumask which is going to be set in this hotplug event.
>>> @@ -2155,43 +2134,29 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>>      case CPU_DEAD:
>>>      case CPU_UP_CANCELED:
>>>              cpumask_copy(&online_new, cpu_online_mask);
>>> +            list_for_each_entry(q, &all_q_list, all_q_node) {
>>> +                    queue_for_each_hw_ctx(q, hctx, i) {
>>> +                            cpumask_clear_cpu(cpu, hctx->cpumask);
>>> +                            cpumask_clear_cpu(cpu, hctx->tags->cpumask);
>>> +                    }
>>> +            }

Do we need to hold all_q_mutex while iterating through all_q_list?

>>>              break;
>>>      case CPU_UP_PREPARE:
>>>              cpumask_copy(&online_new, cpu_online_mask);
>>>              cpumask_set_cpu(cpu, &online_new);
>>> +            /* Update hctx->cpumask for newly onlined CPUs */
>>> +            list_for_each_entry(q, &all_q_list, all_q_node) {
>>> +                    queue_for_each_hw_ctx(q, hctx, i) {
>>> +                            cpumask_set_cpu(cpu, hctx->cpumask);
>>> +                            hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>>> +                            cpumask_set_cpu(cpu, hctx->tags->cpumask);
>>> +                    }
>>> +            }
>>>              break;
>>>      default:
>>>              return NOTIFY_OK;
>>>      }
>>>
>>> -    mutex_lock(&all_q_mutex);
>>> -
>>> -    /*
>>> -     * We need to freeze and reinit all existing queues.  Freezing
>>> -     * involves synchronous wait for an RCU grace period and doing it
>>> -     * one by one may take a long time.  Start freezing all queues in
>>> -     * one swoop and then wait for the completions so that freezing can
>>> -     * take place in parallel.
>>> -     */
>>> -    list_for_each_entry(q, &all_q_list, all_q_node)
>>> -            blk_mq_freeze_queue_start(q);
>>> -    list_for_each_entry(q, &all_q_list, all_q_node) {
>>> -            blk_mq_freeze_queue_wait(q);
>>> -
>>> -            /*
>>> -             * timeout handler can't touch hw queue during the
>>> -             * reinitialization
>>> -             */
>>> -            del_timer_sync(&q->timeout);
>>> -    }
>>> -
>>> -    list_for_each_entry(q, &all_q_list, all_q_node)
>>> -            blk_mq_queue_reinit(q, &online_new);
>>> -
>>> -    list_for_each_entry(q, &all_q_list, all_q_node)
>>> -            blk_mq_unfreeze_queue(q);
>>> -
>>> -    mutex_unlock(&all_q_mutex);

This change makes 'online_new' cpumask useless in blk_mq_queue_reinit_notify(),
so you can remove it completely.

>>>      return NOTIFY_OK;
>>>  }
>>>
>>>
>>>
>>>
>> With the patch I have collected some numbers, running some android monkey tests along with
>> an app that does random hootplugging of CPUs. The numbers obtained are as below:
>>
>> CPU UP time(in micro seconds):
>>          default blk-mq: 58629 54207 61792 58458 59286
>> blk-mq with above patch: 2755  2588  3412  3239  3010
>>
>> CPU DOWN time(in micros seconds):
>>          default blk-mq: 95990  103379 103686 93214  128274
>> blk-mq with above patch: 67178  49876  70027  61261  59656
>>
> Could someone please review the patch and provide some feedback?
> Even if the feedback is about the approach we are trying here or some
> suggestions to move ahead in that direction.
>>>>> Did you tried Tejun's percpu_ref patchset that I said in the last email?
>>>>> (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg984823.html)
>>>>>
>>>>>
>>>> I have tried the suggested patch but did not see any improvements in cpu
>>>> hotplug latencies
>>>>
>>>
>>>
>>
>>
>
>
> --
> Imran Khan
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Query] increased latency observed in cpu hotplug path
  2016-09-21 15:56         ` Akinobu Mita
@ 2016-09-22 10:06           ` Khan, Imran
  0 siblings, 0 replies; 8+ messages in thread
From: Khan, Imran @ 2016-09-22 10:06 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Jens Axboe, Christoph Hellwig, tsoni, kaushalk, linux-arm-msm,
	linux-block, linux-kernel

On 9/21/2016 9:26 PM, Akinobu Mita wrote:
> 2016-09-21 23:06 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:
> 
>>>> commit 084ee793ec1ff4e2171b481642bfbdaa2e10664c
>>>> Author: Imran Khan <kimran@codeaurora.org>
>>>> Date:   Thu Sep 15 16:44:02 2016 +0530
>>>>
>>>>     blk-mq: use static mapping
>>>>
>>>>     blk-mq layer performs a remapping between s/w and h/w contexts and also
>>>>     between h/w contexts and CPUs, whenever a CPU hotplug event happens.
>>>>     This remapping has to wait for queue freezing which may take tens of
>>>>     miliseconds, resulting in a high latency in CPU hotplug path.
>>>>     This patch makes the above mentioned mappings static so that we can
>>>>     avoid remapping when CPU hotplug event happens and this results in
>>>>     improved CPU hotplug latency.
>>>>
>>>>     Change-Id: Idf38cb6c4e78c91fda3c86608c6d0441f01ab435
>>>>     Signed-off-by: Imran Khan <kimran@codeaurora.org>
>>>>
>>>> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
>>>> index 8764c24..85fdb88 100644
>>>> --- a/block/blk-mq-cpumap.c
>>>> +++ b/block/blk-mq-cpumap.c
>>>> @@ -52,18 +52,13 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
>>>>
>>>>      queue = 0;
>>>>      for_each_possible_cpu(i) {
>>>> -            if (!cpumask_test_cpu(i, online_mask)) {
>>>> -                    map[i] = 0;
>>>> -                    continue;
>>>> -            }
>>>> -
>>>>              /*
>>>>               * Easy case - we have equal or more hardware queues. Or
>>>>               * there are no thread siblings to take into account. Do
>>>>               * 1:1 if enough, or sequential mapping if less.
>>>>               */
>>>> -            if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
>>>> -                    map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
>>>> +            if (nr_queues >= nr_cpu_ids) {
>>>> +                    map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues, queue);
>>>>                      queue++;
>>>>                      continue;
>>>>              }
>>>> @@ -75,7 +70,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
>>>>               */
>>>>              first_sibling = get_first_sibling(i);
>>>>              if (first_sibling == i) {
>>>> -                    map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
>>>> +                    map[i] = cpu_to_queue_index(nr_cpu_ids, nr_queues,
>>>>                                                      queue);
>>>>                      queue++;
>>>>              } else
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 6d6f8fe..0753fdf 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -1783,10 +1783,6 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>>>>              INIT_LIST_HEAD(&__ctx->rq_list);
>>>>              __ctx->queue = q;
>>>>
>>>> -            /* If the cpu isn't online, the cpu is mapped to first hctx */
>>>> -            if (!cpu_online(i))
>>>> -                    continue;
>>>> -
>>>>              hctx = q->mq_ops->map_queue(q, i);
>>>>
>>>>              /*
>>>> @@ -1820,12 +1816,9 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>>>       * Map software to hardware queues
>>>>       */
>>>>      queue_for_each_ctx(q, ctx, i) {
>>>> -            /* If the cpu isn't online, the cpu is mapped to first hctx */
>>>> -            if (!cpumask_test_cpu(i, online_mask))
>>>> -                    continue;
>>>> -
>>>>              hctx = q->mq_ops->map_queue(q, i);
>>>> -            cpumask_set_cpu(i, hctx->cpumask);
>>>> +            if (cpumask_test_cpu(i, online_mask))
>>>> +                    cpumask_set_cpu(i, hctx->cpumask);
>>>>              ctx->index_hw = hctx->nr_ctx;
>>>>              hctx->ctxs[hctx->nr_ctx++] = ctx;
>>>>      }
>>>> @@ -1863,17 +1856,22 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>>>
>>>>              /*
>>>>               * Initialize batch roundrobin counts
>>>> +             * Set next_cpu for only those hctxs that have an online CPU
>>>> +             * in their cpumask field. For hctxs that belong to few online
>>>> +             * and few offline CPUs, this will always provide one CPU from
>>>> +             * online ones. For hctxs belonging to all offline CPUs, their
>>>> +             * cpumask will be updated in reinit_notify.
>>>>               */
>>>> -            hctx->next_cpu = cpumask_first(hctx->cpumask);
>>>> -            hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>>>> +            if(cpumask_first(hctx->cpumask) < nr_cpu_ids) {
>>>> +                    hctx->next_cpu = cpumask_first(hctx->cpumask);
>>>> +                    hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>>>> +            }
>>>>      }
>>>>
>>>>      queue_for_each_ctx(q, ctx, i) {
>>>> -            if (!cpumask_test_cpu(i, online_mask))
>>>> -                    continue;
>>>> -
>>>>              hctx = q->mq_ops->map_queue(q, i);
>>>> -            cpumask_set_cpu(i, hctx->tags->cpumask);
>>>> +            if (cpumask_test_cpu(i, online_mask))
>>>> +                    cpumask_set_cpu(i, hctx->tags->cpumask);
>>>>      }
>>>>  }
>>>>
>>>> @@ -2101,31 +2099,12 @@ void blk_mq_free_queue(struct request_queue *q)
>>>>      blk_mq_free_hw_queues(q, set);
>>>>  }
>>>>
>>>> -/* Basically redo blk_mq_init_queue with queue frozen */
>>>> -static void blk_mq_queue_reinit(struct request_queue *q,
>>>> -                            const struct cpumask *online_mask)
>>>> -{
>>>> -    WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
>>>> -
>>>> -    blk_mq_sysfs_unregister(q);
>>>> -
>>>> -    blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
> 
> Now blk_mq_update_queue_map() is only used in block/blk-mq-cpumap.c,
> so you can make blk_mq_update_queue_map() static function.
> 
>>>> -
>>>> -    /*
>>>> -     * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
>>>> -     * we should change hctx numa_node according to new topology (this
>>>> -     * involves free and re-allocate memory, worthy doing?)
>>>> -     */
>>>> -
>>>> -    blk_mq_map_swqueue(q, online_mask);
>>>> -
>>>> -    blk_mq_sysfs_register(q);
>>>> -}
>>>> -
>>>>  static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>>>                                    unsigned long action, void *hcpu)
>>>>  {
>>>>      struct request_queue *q;
>>>> +    struct blk_mq_hw_ctx *hctx;
>>>> +    int i;
>>>>      int cpu = (unsigned long)hcpu;
>>>>      /*
>>>>       * New online cpumask which is going to be set in this hotplug event.
>>>> @@ -2155,43 +2134,29 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>>>>      case CPU_DEAD:
>>>>      case CPU_UP_CANCELED:
>>>>              cpumask_copy(&online_new, cpu_online_mask);
>>>> +            list_for_each_entry(q, &all_q_list, all_q_node) {
>>>> +                    queue_for_each_hw_ctx(q, hctx, i) {
>>>> +                            cpumask_clear_cpu(cpu, hctx->cpumask);
>>>> +                            cpumask_clear_cpu(cpu, hctx->tags->cpumask);
>>>> +                    }
>>>> +            }
> 
> Do we need to hold all_q_mutex while iterating through all_q_list?
> 
Yes, we need to hold all_q_mutex here.
>>>>              break;
>>>>      case CPU_UP_PREPARE:
>>>>              cpumask_copy(&online_new, cpu_online_mask);
>>>>              cpumask_set_cpu(cpu, &online_new);
>>>> +            /* Update hctx->cpumask for newly onlined CPUs */
>>>> +            list_for_each_entry(q, &all_q_list, all_q_node) {
>>>> +                    queue_for_each_hw_ctx(q, hctx, i) {
>>>> +                            cpumask_set_cpu(cpu, hctx->cpumask);
>>>> +                            hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>>>> +                            cpumask_set_cpu(cpu, hctx->tags->cpumask);
>>>> +                    }
>>>> +            }
>>>>              break;
>>>>      default:
>>>>              return NOTIFY_OK;
>>>>      }
>>>>
>>>> -    mutex_lock(&all_q_mutex);
>>>> -
>>>> -    /*
>>>> -     * We need to freeze and reinit all existing queues.  Freezing
>>>> -     * involves synchronous wait for an RCU grace period and doing it
>>>> -     * one by one may take a long time.  Start freezing all queues in
>>>> -     * one swoop and then wait for the completions so that freezing can
>>>> -     * take place in parallel.
>>>> -     */
>>>> -    list_for_each_entry(q, &all_q_list, all_q_node)
>>>> -            blk_mq_freeze_queue_start(q);
>>>> -    list_for_each_entry(q, &all_q_list, all_q_node) {
>>>> -            blk_mq_freeze_queue_wait(q);
>>>> -
>>>> -            /*
>>>> -             * timeout handler can't touch hw queue during the
>>>> -             * reinitialization
>>>> -             */
>>>> -            del_timer_sync(&q->timeout);
>>>> -    }
>>>> -
>>>> -    list_for_each_entry(q, &all_q_list, all_q_node)
>>>> -            blk_mq_queue_reinit(q, &online_new);
>>>> -
>>>> -    list_for_each_entry(q, &all_q_list, all_q_node)
>>>> -            blk_mq_unfreeze_queue(q);
>>>> -
>>>> -    mutex_unlock(&all_q_mutex);
> 
> This change makes 'online_new' cpumask useless in blk_mq_queue_reinit_notify(),
> so you can remove it completely.
> 

Will remove this in the new patch set.
Thanks for the review comments. I will take into account the review comments and
send a new patch set.

>>>>      return NOTIFY_OK;
>>>>  }
>>>>
>>>>
>>>>
>>>>
>>> With the patch I have collected some numbers, running some android monkey tests along with
>>> an app that does random hootplugging of CPUs. The numbers obtained are as below:
>>>
>>> CPU UP time(in micro seconds):
>>>          default blk-mq: 58629 54207 61792 58458 59286
>>> blk-mq with above patch: 2755  2588  3412  3239  3010
>>>
>>> CPU DOWN time(in micros seconds):
>>>          default blk-mq: 95990  103379 103686 93214  128274
>>> blk-mq with above patch: 67178  49876  70027  61261  59656
>>>
>> Could someone please review the patch and provide some feedback?
>> Even if the feedback is about the approach we are trying here or some
>> suggestions to move ahead in that direction.
>>>>>> Did you tried Tejun's percpu_ref patchset that I said in the last email?
>>>>>> (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg984823.html)
>>>>>>
>>>>>>
>>>>> I have tried the suggested patch but did not see any improvements in cpu
>>>>> hotplug latencies
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Imran Khan
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation


-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Query] increased latency observed in cpu hotplug path
  2016-08-16  5:53       ` Khan, Imran
@ 2016-08-17 16:47         ` Khan, Imran
  0 siblings, 0 replies; 8+ messages in thread
From: Khan, Imran @ 2016-08-17 16:47 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: "axboe, axboe, hch, tom.leiming",
	"linux-block,
	"\"linux-kernel\\\"\"@vger.kernel.org;tsoni"@codeaurora.org;kaushalk

On 8/16/2016 11:23 AM, Khan, Imran wrote:
> On 8/5/2016 12:49 PM, Khan, Imran wrote:
>> On 8/1/2016 2:58 PM, Khan, Imran wrote:
>>> On 7/30/2016 7:54 AM, Akinobu Mita wrote:
>>>> 2016-07-28 22:18 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Recently we have observed some increased latency in CPU hotplug
>>>>> event in CPU online path. For online latency we see that block
>>>>> layer is executing notification handler for CPU_UP_PREPARE event
>>>>> and this in turn waits for RCU grace period resulting (sometimes)
>>>>> in an execution time of 15-20 ms for this notification handler.
>>>>> This change was not there in 3.18 kernel but is present in 4.4
>>>>> kernel and was introduced by following commit:
>>>>>
>>>>>
>>>>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
>>>>> Author: Akinobu Mita <akinobu.mita@gmail.com>
>>>>> Date:   Sun Sep 27 02:09:23 2015 +0900
>>>>>
>>>>>     blk-mq: avoid inserting requests before establishing new mapping
>>>>
>>>> ...
>>>>
>>>>> Upon reverting this commit I could see an improvement of 15-20 ms in
>>>>> online latency. So I am looking for some help in analyzing the effects
>>>>> of reverting this or should some other approach to reduce the online
>>>>> latency must be taken.
>>>>
>>>> Can you observe the difference in online latency by removing
>>>> get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue()
>>>> instead of full reverting the commit?
>>>>
>>> Hi Akinobu,
>>> I tried your suggestion but could not achieve any improvement. Actually the snippet that is causing the change in latency is the following one :
>>>
>>> list_for_each_entry(q, &all_q_list, all_q_node) {
>>>                 blk_mq_freeze_queue_wait(q);
>>>
>>>                 /*
>>>                  * timeout handler can't touch hw queue during the
>>>                  * reinitialization
>>>                  */
>>>                 del_timer_sync(&q->timeout);
>>>  }
>>>
>>> I understand that this is getting executed now for CPU_UP_PREPARE as well resulting in 
>>> increased latency in the cpu online path. I am trying to reduce this latency while keeping the 
>>> purpose of this commit intact. I would welcome further suggestions/feedback in this regard.
>>>
>> Hi Akinobu,
>>
>> I am not able to reduce the cpu online latency with this patch, could you please let me know what
>> functionality will be broken, if we avoid this patch in our kernel. Also if you have some other 
>> suggestions towards improving this patch please let me know.
>>
> After moving the remapping of queues to block layer's kworker I see that online latency has improved
> while offline latency remains the same. As the freezing of queues happens in the context of block layer's
> worker, I think it would be better to do the remapping in the same context and then go ahead with freezing.
> In this regard I have made following change:
> 
> commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624
> Author: Imran Khan <kimran@codeaurora.org>
> Date:   Fri Aug 12 19:59:47 2016 +0530
> 
>     blk-mq: Move block queue remapping from cpu hotplug path
> 
>     During a cpu hotplug, the hardware and software contexts mappings
>     need to be updated in order to take into account requests
>     submitted for the hotadded CPU. But if this mapping is done
>     in hotplug notifier, it deteriorates the hotplug latency.
>     So move the block queue remapping to block layer worker which
>     results in significant improvements in hotplug latency.
> 
>     Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4
>     Signed-off-by: Imran Khan <kimran@codeaurora.org>
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6d6f8fe..06fcf89 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -22,7 +22,11 @@
>  #include <linux/sched/sysctl.h>
>  #include <linux/delay.h>
>  #include <linux/crash_dump.h>
> -
> +#include <linux/kthread.h>
> +#include <linux/mutex.h>
> +#include <linux/sched.h>
> +#include <linux/kthread.h>
> +#include <linux/mutex.h>
>  #include <trace/events/block.h>
> 
>  #include <linux/blk-mq.h>
> @@ -32,10 +36,18 @@
> 
>  static DEFINE_MUTEX(all_q_mutex);
>  static LIST_HEAD(all_q_list);
> -
>  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx);
> 
>  /*
> + * New online cpumask which is going to be set in this hotplug event.
> + * Declare this cpumasks as global as cpu-hotplug operation is invoked
> + * one-by-one and dynamically allocating this could result in a failure.
> + */
> +static struct cpumask online_new;
> +
> +static struct work_struct blk_mq_remap_work;
> +
> +/*
>   * Check if any of the ctx's have pending work in this hardware queue
>   */
>  static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
> @@ -2125,14 +2137,7 @@ static void blk_mq_queue_reinit(struct request_queue *q,
>  static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>                                       unsigned long action, void *hcpu)
>  {
> -       struct request_queue *q;
>         int cpu = (unsigned long)hcpu;
> -       /*
> -        * New online cpumask which is going to be set in this hotplug event.
> -        * Declare this cpumasks as global as cpu-hotplug operation is invoked
> -        * one-by-one and dynamically allocating this could result in a failure.
> -        */
> -       static struct cpumask online_new;
> 
>         /*
>          * Before hotadded cpu starts handling requests, new mappings must
> @@ -2155,43 +2160,17 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>         case CPU_DEAD:
>         case CPU_UP_CANCELED:
>                 cpumask_copy(&online_new, cpu_online_mask);
> +               kblockd_schedule_work(&blk_mq_remap_work);
>                 break;
>         case CPU_UP_PREPARE:
>                 cpumask_copy(&online_new, cpu_online_mask);
>                 cpumask_set_cpu(cpu, &online_new);
> +               kblockd_schedule_work(&blk_mq_remap_work);
>                 break;
>         default:
>                 return NOTIFY_OK;
>         }
> 
> -       mutex_lock(&all_q_mutex);
> -
> -       /*
> -        * We need to freeze and reinit all existing queues.  Freezing
> -        * involves synchronous wait for an RCU grace period and doing it
> -        * one by one may take a long time.  Start freezing all queues in
> -        * one swoop and then wait for the completions so that freezing can
> -        * take place in parallel.
> -        */
> -       list_for_each_entry(q, &all_q_list, all_q_node)
> -               blk_mq_freeze_queue_start(q);
> -       list_for_each_entry(q, &all_q_list, all_q_node) {
> -               blk_mq_freeze_queue_wait(q);
> -
> -               /*
> -                * timeout handler can't touch hw queue during the
> -                * reinitialization
> -                */
> -               del_timer_sync(&q->timeout);
> -       }
> -
> -       list_for_each_entry(q, &all_q_list, all_q_node)
> -               blk_mq_queue_reinit(q, &online_new);
> -
> -       list_for_each_entry(q, &all_q_list, all_q_node)
> -               blk_mq_unfreeze_queue(q);
> -
> -       mutex_unlock(&all_q_mutex);
>         return NOTIFY_OK;
>  }
> 
> @@ -2357,11 +2336,48 @@ void blk_mq_enable_hotplug(void)
>         mutex_unlock(&all_q_mutex);
>  }
> 
> +static void blk_remap_work(struct work_struct *work)
> +{
> +       struct request_queue *q;
> +
> +       /* Start the task of queue remapping */
> +       mutex_lock(&all_q_mutex);
> +
> +       /*
> +        * We need to freeze and reinit all existing queues.  Freezing
> +        * involves synchronous wait for an RCU grace period and doing it
> +        * one by one may take a long time.  Start freezing all queues in
> +        * one swoop and then wait for the completions so that freezing can
> +        * take place in parallel.
> +        */
> +       list_for_each_entry(q, &all_q_list, all_q_node)
> +               blk_mq_freeze_queue_start(q);
> +       list_for_each_entry(q, &all_q_list, all_q_node) {
> +               blk_mq_freeze_queue_wait(q);
> +
> +               /*
> +                * timeout handler can't touch hw queue during the
> +                * reinitialization
> +                */
> +               del_timer_sync(&q->timeout);
> +       }
> +
> +       list_for_each_entry(q, &all_q_list, all_q_node)
> +               blk_mq_queue_reinit(q, &online_new);
> +
> +       list_for_each_entry(q, &all_q_list, all_q_node)
> +               blk_mq_unfreeze_queue(q);
> +
> +       mutex_unlock(&all_q_mutex);
> +}
> +
> +
>  static int __init blk_mq_init(void)
>  {
>         blk_mq_cpu_init();
> 
>         hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
> +       INIT_WORK(&blk_mq_remap_work, blk_remap_work);
> 
>         return 0;
>  }
> 
> I have run overnight tests of fsstress and multiple dd commands along with cpu hotplugging.
> Could you please review this change and let me know if you see any issues with this change or if I need 
> to test some specific corner cases to validate this change.
> 
Could someone kindly review this change
and provide feedback?

-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Query] increased latency observed in cpu hotplug path
  2016-08-05  7:19     ` Khan, Imran
@ 2016-08-16  5:53       ` Khan, Imran
  2016-08-17 16:47         ` Khan, Imran
  0 siblings, 1 reply; 8+ messages in thread
From: Khan, Imran @ 2016-08-16  5:53 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: "axboe, axboe, hch, tom.leiming",
	"linux-block,
	"\"linux-kernel\\\"\"@vger.kernel.org;tsoni"@codeaurora.org;kaushalk

On 8/5/2016 12:49 PM, Khan, Imran wrote:
> On 8/1/2016 2:58 PM, Khan, Imran wrote:
>> On 7/30/2016 7:54 AM, Akinobu Mita wrote:
>>> 2016-07-28 22:18 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:
>>>>
>>>> Hi,
>>>>
>>>> Recently we have observed some increased latency in CPU hotplug
>>>> event in CPU online path. For online latency we see that block
>>>> layer is executing notification handler for CPU_UP_PREPARE event
>>>> and this in turn waits for RCU grace period resulting (sometimes)
>>>> in an execution time of 15-20 ms for this notification handler.
>>>> This change was not there in 3.18 kernel but is present in 4.4
>>>> kernel and was introduced by following commit:
>>>>
>>>>
>>>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
>>>> Author: Akinobu Mita <akinobu.mita@gmail.com>
>>>> Date:   Sun Sep 27 02:09:23 2015 +0900
>>>>
>>>>     blk-mq: avoid inserting requests before establishing new mapping
>>>
>>> ...
>>>
>>>> Upon reverting this commit I could see an improvement of 15-20 ms in
>>>> online latency. So I am looking for some help in analyzing the effects
>>>> of reverting this or should some other approach to reduce the online
>>>> latency must be taken.
>>>
>>> Can you observe the difference in online latency by removing
>>> get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue()
>>> instead of full reverting the commit?
>>>
>> Hi Akinobu,
>> I tried your suggestion but could not achieve any improvement. Actually the snippet that is causing the change in latency is the following one :
>>
>> list_for_each_entry(q, &all_q_list, all_q_node) {
>>                 blk_mq_freeze_queue_wait(q);
>>
>>                 /*
>>                  * timeout handler can't touch hw queue during the
>>                  * reinitialization
>>                  */
>>                 del_timer_sync(&q->timeout);
>>  }
>>
>> I understand that this is getting executed now for CPU_UP_PREPARE as well resulting in 
>> increased latency in the cpu online path. I am trying to reduce this latency while keeping the 
>> purpose of this commit intact. I would welcome further suggestions/feedback in this regard.
>>
> Hi Akinobu,
> 
> I am not able to reduce the cpu online latency with this patch, could you please let me know what
> functionality will be broken, if we avoid this patch in our kernel. Also if you have some other 
> suggestions towards improving this patch please let me know.
> 
After moving the remapping of queues to block layer's kworker I see that online latency has improved
while offline latency remains the same. As the freezing of queues happens in the context of block layer's
worker, I think it would be better to do the remapping in the same context and then go ahead with freezing.
In this regard I have made following change:

commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624
Author: Imran Khan <kimran@codeaurora.org>
Date:   Fri Aug 12 19:59:47 2016 +0530

    blk-mq: Move block queue remapping from cpu hotplug path

    During a cpu hotplug, the hardware and software contexts mappings
    need to be updated in order to take into account requests
    submitted for the hotadded CPU. But if this mapping is done
    in hotplug notifier, it deteriorates the hotplug latency.
    So move the block queue remapping to block layer worker which
    results in significant improvements in hotplug latency.

    Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4
    Signed-off-by: Imran Khan <kimran@codeaurora.org>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d6f8fe..06fcf89 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,7 +22,11 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
-
+#include <linux/kthread.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/kthread.h>
+#include <linux/mutex.h>
 #include <trace/events/block.h>

 #include <linux/blk-mq.h>
@@ -32,10 +36,18 @@

 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
-
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx);

 /*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+static struct cpumask online_new;
+
+static struct work_struct blk_mq_remap_work;
+
+/*
  * Check if any of the ctx's have pending work in this hardware queue
  */
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
@@ -2125,14 +2137,7 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
                                      unsigned long action, void *hcpu)
 {
-       struct request_queue *q;
        int cpu = (unsigned long)hcpu;
-       /*
-        * New online cpumask which is going to be set in this hotplug event.
-        * Declare this cpumasks as global as cpu-hotplug operation is invoked
-        * one-by-one and dynamically allocating this could result in a failure.
-        */
-       static struct cpumask online_new;

        /*
         * Before hotadded cpu starts handling requests, new mappings must
@@ -2155,43 +2160,17 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
        case CPU_DEAD:
        case CPU_UP_CANCELED:
                cpumask_copy(&online_new, cpu_online_mask);
+               kblockd_schedule_work(&blk_mq_remap_work);
                break;
        case CPU_UP_PREPARE:
                cpumask_copy(&online_new, cpu_online_mask);
                cpumask_set_cpu(cpu, &online_new);
+               kblockd_schedule_work(&blk_mq_remap_work);
                break;
        default:
                return NOTIFY_OK;
        }

-       mutex_lock(&all_q_mutex);
-
-       /*
-        * We need to freeze and reinit all existing queues.  Freezing
-        * involves synchronous wait for an RCU grace period and doing it
-        * one by one may take a long time.  Start freezing all queues in
-        * one swoop and then wait for the completions so that freezing can
-        * take place in parallel.
-        */
-       list_for_each_entry(q, &all_q_list, all_q_node)
-               blk_mq_freeze_queue_start(q);
-       list_for_each_entry(q, &all_q_list, all_q_node) {
-               blk_mq_freeze_queue_wait(q);
-
-               /*
-                * timeout handler can't touch hw queue during the
-                * reinitialization
-                */
-               del_timer_sync(&q->timeout);
-       }
-
-       list_for_each_entry(q, &all_q_list, all_q_node)
-               blk_mq_queue_reinit(q, &online_new);
-
-       list_for_each_entry(q, &all_q_list, all_q_node)
-               blk_mq_unfreeze_queue(q);
-
-       mutex_unlock(&all_q_mutex);
        return NOTIFY_OK;
 }

@@ -2357,11 +2336,48 @@ void blk_mq_enable_hotplug(void)
        mutex_unlock(&all_q_mutex);
 }

+static void blk_remap_work(struct work_struct *work)
+{
+       struct request_queue *q;
+
+       /* Start the task of queue remapping */
+       mutex_lock(&all_q_mutex);
+
+       /*
+        * We need to freeze and reinit all existing queues.  Freezing
+        * involves synchronous wait for an RCU grace period and doing it
+        * one by one may take a long time.  Start freezing all queues in
+        * one swoop and then wait for the completions so that freezing can
+        * take place in parallel.
+        */
+       list_for_each_entry(q, &all_q_list, all_q_node)
+               blk_mq_freeze_queue_start(q);
+       list_for_each_entry(q, &all_q_list, all_q_node) {
+               blk_mq_freeze_queue_wait(q);
+
+               /*
+                * timeout handler can't touch hw queue during the
+                * reinitialization
+                */
+               del_timer_sync(&q->timeout);
+       }
+
+       list_for_each_entry(q, &all_q_list, all_q_node)
+               blk_mq_queue_reinit(q, &online_new);
+
+       list_for_each_entry(q, &all_q_list, all_q_node)
+               blk_mq_unfreeze_queue(q);
+
+       mutex_unlock(&all_q_mutex);
+}
+
+
 static int __init blk_mq_init(void)
 {
        blk_mq_cpu_init();

        hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
+       INIT_WORK(&blk_mq_remap_work, blk_remap_work);

        return 0;
 }

I have run overnight tests of fsstress and multiple dd commands along with cpu hotplugging.
Could you please review this change and let me know if you see any issues with this change or if I need 
to test some specific corner cases to validate this change.

-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Query] increased latency observed in cpu hotplug path
  2016-08-01  9:28   ` Khan, Imran
@ 2016-08-05  7:19     ` Khan, Imran
  2016-08-16  5:53       ` Khan, Imran
  0 siblings, 1 reply; 8+ messages in thread
From: Khan, Imran @ 2016-08-05  7:19 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: "axboe, axboe, hch, tom.leiming",
	"linux-block,
	"linux-kernel\""@vger.kernel.org;tsoni

On 8/1/2016 2:58 PM, Khan, Imran wrote:
> On 7/30/2016 7:54 AM, Akinobu Mita wrote:
>> 2016-07-28 22:18 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:
>>>
>>> Hi,
>>>
>>> Recently we have observed some increased latency in CPU hotplug
>>> event in CPU online path. For online latency we see that block
>>> layer is executing notification handler for CPU_UP_PREPARE event
>>> and this in turn waits for RCU grace period resulting (sometimes)
>>> in an execution time of 15-20 ms for this notification handler.
>>> This change was not there in 3.18 kernel but is present in 4.4
>>> kernel and was introduced by following commit:
>>>
>>>
>>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
>>> Author: Akinobu Mita <akinobu.mita@gmail.com>
>>> Date:   Sun Sep 27 02:09:23 2015 +0900
>>>
>>>     blk-mq: avoid inserting requests before establishing new mapping
>>
>> ...
>>
>>> Upon reverting this commit I could see an improvement of 15-20 ms in
>>> online latency. So I am looking for some help in analyzing the effects
>>> of reverting this or should some other approach to reduce the online
>>> latency must be taken.
>>
>> Can you observe the difference in online latency by removing
>> get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue()
>> instead of full reverting the commit?
>>
> Hi Akinobu,
> I tried your suggestion but could not achieve any improvement. Actually the snippet that is causing the change in latency is the following one :
> 
> list_for_each_entry(q, &all_q_list, all_q_node) {
>                 blk_mq_freeze_queue_wait(q);
> 
>                 /*
>                  * timeout handler can't touch hw queue during the
>                  * reinitialization
>                  */
>                 del_timer_sync(&q->timeout);
>  }
> 
> I understand that this is getting executed now for CPU_UP_PREPARE as well resulting in 
> increased latency in the cpu online path. I am trying to reduce this latency while keeping the 
> purpose of this commit intact. I would welcome further suggestions/feedback in this regard.
> 
Hi Akinobu,

I am not able to reduce the cpu online latency with this patch, could you please let me know what
functionality will be broken, if we avoid this patch in our kernel. Also if you have some other 
suggestions towards improving this patch please let me know.

-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Query] increased latency observed in cpu hotplug path
       [not found] ` <CAC5umyi46+iiKS=924xOdVj4++qPgfXkjM-hqg+e68LPDM2izQ@mail.gmail.com>
@ 2016-08-01  9:28   ` Khan, Imran
  2016-08-05  7:19     ` Khan, Imran
  0 siblings, 1 reply; 8+ messages in thread
From: Khan, Imran @ 2016-08-01  9:28 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: "axboe, akinobu.mita, axboe, hch, tom.leiming",
	"linux-block, linux-kernel"

On 7/30/2016 7:54 AM, Akinobu Mita wrote:
> 2016-07-28 22:18 GMT+09:00 Khan, Imran <kimran@codeaurora.org>:
>>
>> Hi,
>>
>> Recently we have observed some increased latency in CPU hotplug
>> event in CPU online path. For online latency we see that block
>> layer is executing notification handler for CPU_UP_PREPARE event
>> and this in turn waits for RCU grace period resulting (sometimes)
>> in an execution time of 15-20 ms for this notification handler.
>> This change was not there in 3.18 kernel but is present in 4.4
>> kernel and was introduced by following commit:
>>
>>
>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
>> Author: Akinobu Mita <akinobu.mita@gmail.com>
>> Date:   Sun Sep 27 02:09:23 2015 +0900
>>
>>     blk-mq: avoid inserting requests before establishing new mapping
> 
> ...
> 
>> Upon reverting this commit I could see an improvement of 15-20 ms in
>> online latency. So I am looking for some help in analyzing the effects
>> of reverting this or should some other approach to reduce the online
>> latency must be taken.
> 
> Can you observe the difference in online latency by removing
> get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue()
> instead of full reverting the commit?
> 
Hi Akinobu,
I tried your suggestion but could not achieve any improvement. Actually the snippet that is causing the change in latency is the following one :

list_for_each_entry(q, &all_q_list, all_q_node) {
                blk_mq_freeze_queue_wait(q);

                /*
                 * timeout handler can't touch hw queue during the
                 * reinitialization
                 */
                del_timer_sync(&q->timeout);
 }

I understand that this is getting executed now for CPU_UP_PREPARE as well resulting in 
increased latency in the cpu online path. I am trying to reduce this latency while keeping the 
purpose of this commit intact. I would welcome further suggestions/feedback in this regard.

-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Query] increased latency observed in cpu hotplug path
@ 2016-07-28 13:18 Khan, Imran
       [not found] ` <CAC5umyi46+iiKS=924xOdVj4++qPgfXkjM-hqg+e68LPDM2izQ@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Khan, Imran @ 2016-07-28 13:18 UTC (permalink / raw)
  To: axboe@kernel.dk;akinobu.mita@gmail.com;
	axboe@fb.com;hch@lst.de;tom.leiming
  Cc: linux-block@vger.kernel.org ;linux-kernel


Hi,

Recently we have observed some increased latency in CPU hotplug
event in CPU online path. For online latency we see that block
layer is executing notification handler for CPU_UP_PREPARE event
and this in turn waits for RCU grace period resulting (sometimes)
in an execution time of 15-20 ms for this notification handler.
This change was not there in 3.18 kernel but is present in 4.4
kernel and was introduced by following commit:

 
commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
Author: Akinobu Mita <akinobu.mita@gmail.com>
Date:   Sun Sep 27 02:09:23 2015 +0900
 
    blk-mq: avoid inserting requests before establishing new mapping

    Notifier callbacks for CPU_ONLINE action can be run on the other CPU
    than the CPU which was just onlined.  So it is possible for the
    process running on the just onlined CPU to insert request and run
    hw queue before establishing new mapping which is done by
    blk_mq_queue_reinit_notify().

    This can cause a problem when the CPU has just been onlined first time
    since the request queue was initialized.  At this time ctx->index_hw
    for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
    zero before blk_mq_queue_reinit_notify() is called by notifier
    callbacks for CPU_ONLINE action.

    For example, there is a single hw queue (hctx) and two CPU queues
    (ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
    a request is inserted into ctx1->rq_list and set bit0 in pending
    bitmap as ctx1->index_hw is still zero.

    And then while running hw queue, flush_busy_ctxs() finds bit0 is set
    in pending bitmap and tries to retrieve requests in
    hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
    request in ctx1->rq_list is ignored.

    Fix it by ensuring that new mapping is established before onlined cpu
    starts running.

    Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
    Reviewed-by: Ming Lei <tom.leiming@gmail.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Ming Lei <tom.leiming@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

 
Upon reverting this commit I could see an improvement of 15-20 ms in
online latency. So I am looking for some help in analyzing the effects
of reverting this or should some other approach to reduce the online
latency must be taken.

Can you please provide some feedback in this regard?

-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-22 10:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAC5umyj04kkoufz3me17Xqk_b-eqxPqocJHmRAKr+UkBawV+xQ@mail.gmail.com>
     [not found] ` <34042993-30f9-81e0-622d-9cd4283624b4@codeaurora.org>
     [not found]   ` <65516437-cf7a-4728-50d1-6d3c4b5a9adc@codeaurora.org>
     [not found]     ` <faa0cfa9-46e5-f580-654c-5ec7dcc8f977@codeaurora.org>
2016-09-21 14:06       ` [Query] increased latency observed in cpu hotplug path Khan, Imran
2016-09-21 15:56         ` Akinobu Mita
2016-09-22 10:06           ` Khan, Imran
2016-07-28 13:18 Khan, Imran
     [not found] ` <CAC5umyi46+iiKS=924xOdVj4++qPgfXkjM-hqg+e68LPDM2izQ@mail.gmail.com>
2016-08-01  9:28   ` Khan, Imran
2016-08-05  7:19     ` Khan, Imran
2016-08-16  5:53       ` Khan, Imran
2016-08-17 16:47         ` Khan, Imran

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.