Re: [PATCH] irq_pipeline: Prevent returning to user space with pending inband IRQs

From: Jan Kiszka <jan.kiszka@siemens.com>
To: Philippe Gerum <rpm@xenomai.org>
Cc: Xenomai <xenomai@xenomai.org>
Subject: Re: [PATCH] irq_pipeline: Prevent returning to user space with pending inband IRQs
Date: Fri, 11 Jun 2021 09:15:34 +0200	[thread overview]
Message-ID: <99135b48-1452-4f47-eb93-1bfae2e5e0a2@siemens.com> (raw)
In-Reply-To: <87wnr37eht.fsf@xenomai.org>

On 09.06.21 08:12, Philippe Gerum wrote:
> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> On 08.06.21 19:39, Philippe Gerum wrote:
>>>
>>> Philippe Gerum <rpm@xenomai.org> writes:
>>>
>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>
>>>>> On 07.06.21 18:19, Philippe Gerum wrote:
>>>>>>
>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>
>>>>>>> On 07.06.21 17:04, Philippe Gerum wrote:
>>>>>>>>
>>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>>
>>>>>>>>> On 07.06.21 15:48, Jan Kiszka wrote:
>>>>>>>>>> On 07.06.21 15:44, Jan Kiszka wrote:
>>>>>>>>>>> On 07.06.21 15:03, Philippe Gerum wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 07.06.21 09:17, Philippe Gerum wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 06.06.21 16:43, Philippe Gerum via Xenomai wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jan Kiszka <jan.kiszka@web.de> writes:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The correct order is hard IRQs off first, then stalling inband, see also
>>>>>>>>>>>>>>>>> I-pipe. Otherwise, we risk to take an interrupt for the inband stage but
>>>>>>>>>>>>>>>>> do not inject it before returning to user space.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Fixes the RCU stalls Florian reported, at least for me.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm afraid this wasn't the last regression of translating I-pipe into
>>>>>>>>>>>>>>>>> dovetail..
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  include/linux/entry-common.h | 3 ++-
>>>>>>>>>>>>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
>>>>>>>>>>>>>>>>> index 0fb45b2d6094..00540110985e 100644
>>>>>>>>>>>>>>>>> --- a/include/linux/entry-common.h
>>>>>>>>>>>>>>>>> +++ b/include/linux/entry-common.h
>>>>>>>>>>>>>>>>> @@ -203,7 +203,8 @@ static inline void local_irq_disable_exit_to_user(void);
>>>>>>>>>>>>>>>>>  #ifndef local_irq_disable_exit_to_user
>>>>>>>>>>>>>>>>>  static inline void local_irq_disable_exit_to_user(void)
>>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>> -	local_irq_disable_full();
>>>>>>>>>>>>>>>>> +	hard_cond_local_irq_disable();
>>>>>>>>>>>>>>>>> +	local_irq_disable();
>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>  #endif
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Good spot, real issue here, just like the CPU_IDLE code deals with. The
>>>>>>>>>>>>>>>> fix looks good, but maybe we could do even better.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The way local_irq_disable_full() works tends to introduce this issue in
>>>>>>>>>>>>>>>> a sneaky way, when the code path does not synchronize the interrupt log
>>>>>>>>>>>>>>>> (exit_to_user_mode_loop may be at fault in this case, which does not
>>>>>>>>>>>>>>>> traverses synchronize_pipeline()). Lack of synchronization would still
>>>>>>>>>>>>>>>> hit us if some trap handler turns hard irqs off -> on -> off the same
>>>>>>>>>>>>>>>> way, and we don't leave through the user exit path.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Inverting the order of the actions in local_irq_disable_full() may be an
>>>>>>>>>>>>>>>> option (inband_irq_disable would allow that), making sure we cannot be
>>>>>>>>>>>>>>>> caught in the same issue when returning to kernel mode is another
>>>>>>>>>>>>>>>> way. This needs more thought I think.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, always this way?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> local_irq_disable_full:
>>>>>>>>>>>>>>> 	hard_local_irq_disable	
>>>>>>>>>>>>>>> 	local_irq_disable
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> local_irq_enable_full:
>>>>>>>>>>>>>>> 	hard_local_irq_enable
>>>>>>>>>>>>>>> 	local_irq_enable
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or even flip local_irq_enable_full as well?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not this one, hard irqs must be on before local_irq_enable() is issued
>>>>>>>>>>>>>> otherwise we would trigger a debug assertion. The reason for such
>>>>>>>>>>>>>> assertion is that the I-pipe version of local_irq_enable() force enables
>>>>>>>>>>>>>> hard irqs on, and this is something we want to depart from because
>>>>>>>>>>>>>> whether the caller expects or not such side-effect is error-prone. For
>>>>>>>>>>>>>> this reason, inband_irq_enable() expects hard irqs on, which should be
>>>>>>>>>>>>>> the current hard irq state for the caller under normal circumstances.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Was there ever code that required ordering local_irq_disable_full the
>>>>>>>>>>>>>>> other way around?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After review, I don't think so. The current order for
>>>>>>>>>>>>>> local_irq_disable_full() was rather determined by applying a reverse
>>>>>>>>>>>>>> sequence compared to local_irq_enable_full(). So this looks good. I need
>>>>>>>>>>>>>> to test this on the armv[7/8] side here first before merging.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a commit somewhere that I can drop already?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 1. These are fine with me in their latest iteration:
>>>>>>>>>>>>
>>>>>>>>>>>>   irq_pipeline: Clean up stage_info field and users
>>>>>>>>>>>>   irq_pipeline: Account for stage migration across faults
>>>>>>>>>>>>   irq_pipeline: Warn when calling irqentry_enter with oob stalled
>>>>>>>>>>>>   x86: dovetail: Fix TS flag reservation
>>>>>>>>>>>>
>>>>>>>>>>>> 2. This one should be replaced by a fix in local_irq_disable_full(),
>>>>>>>>>>>> pending some tests ongoing here:
>>>>>>>>>>>>
>>>>>>>>>>>> irq_pipeline: Prevent returning to user space with pending inband IRQs
>>>>>>>>>>>>
>>>>>>>>>>>> However, as I mentioned earlier, this may not be sufficient per se, we
>>>>>>>>>>>> may have to synchronize the in-band log when unstalling the stage on the
>>>>>>>>>>>> kernel exit path, i.e. the exit point fixed up by the "Account for stage
>>>>>>>>>>>> migration across faults" patch. This would address the following
>>>>>>>>>>>> scenario:
>>>>>>>>>>>>
>>>>>>>>>>>>         irqentry_enter on kernel context:
>>>>>>>>>>>>                        stall_inband();
>>>>>>>>>>>>                                 trap_handler();
>>>>>>>>>>>>                                         hard_local_irq_enable()
>>>>>>>>>>>>                                         hard_local_irq_disable()
>>>>>>>>>>>>                        if (state.stage_info == IRQENTRY_INBAND_UNSTALLED ||
>>>>>>>>>>>>                            state.stage_info == IRQENTRY_OOB) {
>>>>>>>>>>>>                              unstall_inband();
>>>>>>>>>>>>                        	     synchronize_pipeline();
>>>>>>>>>>>
>>>>>>>>>>> Oh, unstall_inband does not synchronize? Because of
>>>>>>>>>>> hard_local_irq_disable or in general?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ah, found it - unstall != inband_irq_enable. Why not using the latter
>>>>>>>>>> directly?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ...because it's instrumented to detect hard_irqs_disabled.
>>>>>>>>>
>>>>>>>>> Can we safely sync at that point?
>>>>>>>>
>>>>>>>> Yes we may do this, because we restore the original -unstalled- state of
>>>>>>>> the in-band stage. So if we have pending IRQs, we may play them at this
>>>>>>>> point. What we may not do obviously, is creating synchronization points
>>>>>>>> where the kernel does not expect irqs.
>>>>>>>>
>>>>>>>
>>>>>>> Again: When will any reschedules that might be requested by those
>>>>>>> replayed IRQs be handled then?
>>>>>>>
>>>>>>
>>>>>> Nowhere unless we provide a log synchronization routine dedicated to the
>>>>>> in-band kernel exit path, which would fold in the rescheduling call via
>>>>>> irqentry_exit_cond_resched(). I'll write that one.
>>>>>>
>>>>>
>>>>> How about simply moving the replay up, before the kernel checks for
>>>>> pending reschedules anyway? We can stall inband then again, to please
>>>>> lockdep & Co., as long as there is no hard-irq-on section before the intret.
>>>>>
>>>>
>>>> Yes, this is the right way to do this. I'm on it.
>>>
>>> Please test this (heavily) with cobalt. This survived a 2-hours long kvm
>>> test here, running loops of the evl test suite + hackbench over 64 vcpus,
>>> so far so good. No reason for any difference in behavior between cobalt
>>> and evl in this area, this is merely a dovetail business.
>>>
>>> commit 30808d958277a525baad33f3c68ccb7910dc3176 (HEAD -> rebase/v5.10.y-evl)
>>> Author: Philippe Gerum <rpm@xenomai.org>
>>>
>>>     genirq: irq_pipeline: synchronize log on irq exit to kernel
>>>     
>>>     We must make sure to play any IRQ which might be pending in the
>>>     in-band log before leaving an interrupt frame for a preempted kernel
>>>     context.
>>>     
>>>     This completes "irq_pipeline: Account for stage migration across
>>>     faults", so that we synchronize the log once the in-band stage is
>>>     unstalled. In addition, we also care to do this before
>>>     preempt_schedule_irq() runs, so that we won't miss any rescheduling
>>>     request which might have been triggered by some IRQ we just played.
>>>     
>>>     Signed-off-by: Philippe Gerum <rpm@xenomai.org>
>>>     Suggested-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>
>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>>> index 4e81c0c03e5726a..99512307537561b 100644
>>> --- a/kernel/entry/common.c
>>> +++ b/kernel/entry/common.c
>>> @@ -477,8 +477,57 @@ bool irqexit_may_preempt_schedule(irqentry_state_t state,
>>>  
>>>  #endif
>>>  
>>> +#ifdef CONFIG_IRQ_PIPELINE
>>> +
>>> +static bool irqentry_syncstage(irqentry_state_t state) /* hard irqs off */
>>> +{
>>> +	/*
>>> +	 * If pipelining interrupts, enable in-band IRQs then
>>> +	 * synchronize the interrupt log on exit if:
>>> +	 *
>>> +	 * - irqentry_enter() stalled the stage in order to mirror the
>>> +	 * hardware state.
>>> +	 *
>>> +	 * - we where coming from oob, thus went through a stage migration
>>> +	 * that was caused by taking a CPU exception, e.g., a fault.
>>> +	 *
>>> +	 * We run before preempt_schedule_irq() may be called later on
>>> +	 * by preemptible kernels, so that any rescheduling request
>>> +	 * triggered by in-band IRQ handlers is considered.
>>> +	 */
>>> +	if (state.stage_info == IRQENTRY_INBAND_UNSTALLED ||
>>> +		state.stage_info == IRQENTRY_OOB) {
>>> +		unstall_inband_nocheck();
>>> +		synchronize_pipeline_on_irq();
>>> +		stall_inband_nocheck();
>>> +		return true;
>>> +	}
>>> +
>>> +	return false;
>>> +}
>>> +
>>> +static void irqentry_unstall(void)
>>> +{
>>> +	unstall_inband_nocheck();
>>> +}
>>> +
>>> +#else
>>> +
>>> +static bool irqentry_syncstage(irqentry_state_t state)
>>> +{
>>> +	return false;
>>> +}
>>> +
>>> +static void irqentry_unstall(void)
>>> +{
>>> +}
>>> +
>>> +#endif
>>> +
>>>  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>>>  {
>>> +	bool synchronized = false;
>>> +
>>>  	if (running_oob())
>>>  		return;
>>>  
>>> @@ -488,7 +537,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>>>  	if (user_mode(regs)) {
>>>  		irqentry_exit_to_user_mode(regs);
>>>  		return;
>>> -	} else if (irqexit_may_preempt_schedule(state, regs)) {
>>> +	}
>>> +
>>> +	synchronized = irqentry_syncstage(state);
>>> +
>>> +	if (irqexit_may_preempt_schedule(state, regs)) {
>>>  		/*
>>>  		 * If RCU was not watching on entry this needs to be done
>>>  		 * carefully and needs the same ordering of lockdep/tracing
>>> @@ -521,17 +574,9 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>>>  	}
>>>  
>>>  out:
>>> -#ifdef CONFIG_IRQ_PIPELINE
>>> -	/*
>>> -	 * If pipelining interrupts, clear the in-band stall bit if
>>> -	 * irqentry_enter() raised it in order to mirror the hardware
>>> -	 * state. Also clear it when we where coming from oob, thus went
>>> -	 * through a migration that was caused by taking, e.g., a fault.
>>> -	 */
>>> -	if (state.stage_info == IRQENTRY_INBAND_UNSTALLED ||
>>> -	    state.stage_info == IRQENTRY_OOB)
>>> -		unstall_inband();
>>> -#endif
>>> +	if (synchronized)
>>> +		irqentry_unstall();
>>> +
>>>  	return;
>>>  }
>>>  
>>>
>>
>> I've put that into a staging/v5.10.y-dovetail branch, pointed
>> xenomai-images to that and started a run in our lab. That should give us
>> some confidence that things work under cobalt across all archs, and we
>> can make that commit official.
>>
>> However, heavy stressing would require triggering migration-on-fault in
>> tighter loops while running something like stress-ng in the background.
>> Maybe it would make sense to add such a pattern to the switchtest (would
>> also require a way to silence the related kernel warnings during
>> runtime, not only build-time)?
>>
> 
> switchtest aims at exercising the core scheduling logic including the
> transition between execution stages, so migration-on-fault would fit
> there indeed.
> 

Do we need this testcase first? My feeling from review and testing is
that is safe to move forward with the patch already.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux