Hi,

On 1/4/22 11:22, Hans de Goede wrote:
> Hi,
> 
> On 1/4/22 10:43, Jarkko Nikula wrote:
>> Hi
>>
>> On 1/3/22 18:40, Hans de Goede wrote:
>>> So we do eventually get an IRQ request for a pin using the GPIO controller
>>> internal interrupt-line 0. But the IRQ triggers at least once before then and
>>> even though we haven't set a handler yet, calling generic_handle_irq for the
>>> GPIO-chips irqdomain, offset 0 IRQ does manage to silence the interrupt.
>>>
>>> I've been tracing this through the kernel code and AFAICT we end up in:
>>>
>>> arch/x86/kernel/irq.c: ack_bad_irq() in this case:
>>>
>>> Which means that this message should show up in dmesg:
>>>
>>>          if (printk_ratelimit())
>>>                  pr_err("unexpected IRQ trap at vector %02x\n", irq);
>>>
>>> Can you confirm this? Also can you share the full dmesg output of this
>>> device (with both patches, with dyndbg option) ?
>>>
>> Hmm.. don't see it. Attached dmesg_20220103 is with both yesterday's patches.
> 
> I guess I must be misreading the code somehow then, the pinctrl-cherryview code
> sets the default low-level IRQ handler to handle_bad_irq and it does not
> get changed to handle_level_irq / handle_edge_irq until we hit a code
> path which also sets intr_lines[intsel] to the pin for which the IRQ is
> being activated.
> 
> And handle_bad_irq() is:
> 
> void handle_bad_irq(struct irq_desc *desc)
> {
>         unsigned int irq = irq_desc_get_irq(desc);
> 
>         print_irq_desc(irq, desc);
>         kstat_incr_irqs_this_cpu(desc);
>         ack_bad_irq(irq);
> }
> 
> So we should end up in arch/x86/kernel/irq.c: ack_bad_irq() AFAICT, but
> I guess I'm missing something somewhere.
> 
> Anyways since my hack to enable the spurious IRQ workaround works and
> a spurious IRQ is the root-cause here lets focus on that.
> 
> 
>>> Note what we are seeing here is basically a spurious IRQ. Except on a few
>>> known broken devices the cherryview pinctrl code relies on the BIOS having
>>> configured things so that there are no spurious IRQs. I've attached a
>>> quick hack which activates the workaround for known broken devices
>>> unconditionally. This replace my previous 2 patches. I expect this to
>>> fix things too.
>>>
>>> If you can make some time to give this one a test too that would be
>>> great, then we have some options on how to fix this :)
>>>
>> Hack patch booted too. Attached dmesg_20220104-hack is from this test.
> 
> Thanks.
> 
> Andy, Mika, why don't we just mask out all IRQs at boot unconditionally:
> 
> diff --git a/drivers/pinctrl/intel/pinctrl-cherryview.c b/drivers/pinctrl/intel/pinctrl-cherryview.c
> index 683b95e9639a..8981755a5d83 100644
> --- a/drivers/pinctrl/intel/pinctrl-cherryview.c
> +++ b/drivers/pinctrl/intel/pinctrl-cherryview.c
> @@ -1552,19 +1552,10 @@ static int chv_gpio_irq_init_hw(struct gpio_chip *chip)
>  	const struct intel_community *community = &pctrl->communities[0];
>  
>  	/*
> -	 * The same set of machines in chv_no_valid_mask[] have incorrectly
> -	 * configured GPIOs that generate spurious interrupts so we use
> -	 * this same list to apply another quirk for them.
> -	 *
> -	 * See also https://bugzilla.kernel.org/show_bug.cgi?id=197953.
> +	 * Start with all normal interrupts in the community masked,
> +	 * but leave the ones that can only generate GPEs unmasked.
>  	 */
> -	if (!pctrl->chip.irq.init_valid_mask) {
> -		/*
> -		 * Mask all interrupts the community is able to generate
> -		 * but leave the ones that can only generate GPEs unmasked.
> -		 */
> -		chv_pctrl_writel(pctrl, CHV_INTMASK, GENMASK(31, community->nirqs));
> -	}
> +	chv_pctrl_writel(pctrl, CHV_INTMASK, GENMASK(31, community->nirqs));
>  
>  	/* Clear all interrupts */
>  	chv_pctrl_writel(pctrl, CHV_INTSTAT, 0xffff);
> 
> ?
> 
> I never really understood why the code only did this on models which are
> known to generate spurious IRQs, leaving the IRQs unmasked before any driver
> has requested them seems not useful? And as this shows can actually be
> harmful at times ?

So although masking all regular (non GPE) interrupt-lines at boot fixes things,
I was thinking that if we somehow still manage to hit the new if() which prints
the "interrupt on unused interrupt line %u" message, we should still do
something to avoid the interrupt storm.

So I've written another patch, which I believe is something which we will want
regardless of the question if we should mask interrupts at boot or not.

I've attached this patch here. Jarkko, can you test a linux-next kernel with
just this patch added?

This should still lead to the "interrupt on unused interrupt line %u" message
getting printed, but hopefully the system will actually boot despite this,
since the code path printing the msg now acks the interrupt.

Thinking more about this I believe that this is likely the correct fix for
the caused regression, because the spurious IRQ was always there already.

Fixing the spurious IRQ is still good to do but is a somewhat separate issue
really.

Regards,

Hans