From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A10FC2B9F4 for ; Tue, 22 Jun 2021 12:21:39 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id D464A6023B for ; Tue, 22 Jun 2021 12:21:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D464A6023B Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=xen.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from list by lists.xenproject.org with outflank-mailman.145869.268293 (Exim 4.92) (envelope-from ) id 1lvfP9-0006HF-71; Tue, 22 Jun 2021 12:21:23 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 145869.268293; Tue, 22 Jun 2021 12:21:23 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lvfP9-0006H8-3a; Tue, 22 Jun 2021 12:21:23 +0000 Received: by outflank-mailman (input) for mailman id 145869; Tue, 22 Jun 2021 12:21:22 +0000 Received: from mail.xenproject.org ([104.130.215.37]) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lvfP7-0006H2-UO for xen-devel@lists.xenproject.org; Tue, 22 Jun 2021 12:21:21 +0000 Received: from xenbits.xenproject.org ([104.239.192.120]) by mail.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lvfP6-0002LX-OE; Tue, 22 Jun 2021 12:21:20 +0000 Received: from [54.239.6.182] (helo=a483e7b01a66.ant.amazon.com) by xenbits.xenproject.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1lvfP6-0005P7-Ey; Tue, 22 Jun 2021 12:21:20 +0000 X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=xen.org; s=20200302mail; h=Content-Transfer-Encoding:Content-Type:In-Reply-To: MIME-Version:Date:Message-ID:From:References:Cc:To:Subject; bh=M0Y3trJ0hBFNWH8kNw4EvDo6aslDCpP5XqPwDmKmclg=; b=nHLtiNmGFSPdcnX0RLBYD156cX 0SKd/Ms6obc2pi1YLItOiufKYUrDzYprkklLMDzNahgdHqV1egpJ66GCHLe4GwxsIGwksdEFVlrga kcYllL7BCT6kQvHHXxNOzK6Fq/zBvVVR5qsNmK77DFFlGSXgNOcOnNptTG39ob2dnSOM=; Subject: Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc To: Juergen Gross Cc: "xen-devel@lists.xenproject.org" , linux-kernel@vger.kernel.org, mheyne@amazon.de References: <6552fc66-ba19-2c77-7928-b0272d3e1622@xen.org> <4d8a7ba7-a9f6-2999-8750-bfe2b85f064e@suse.com> From: Julien Grall Message-ID: <9a08bbf2-ba6a-6e49-3bcb-bfe2beb32b99@xen.org> Date: Tue, 22 Jun 2021 14:21:18 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <4d8a7ba7-a9f6-2999-8750-bfe2b85f064e@suse.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Hi Juergen, On 22/06/2021 13:04, Juergen Gross wrote: > On 22.06.21 12:24, Julien Grall wrote: >> Hi Juergen, >> >> As discussed on IRC yesterday, we noticed a couple of splat in 5.13-rc6 > >> (and stable 5.4) in the evtchn driver: >> >> [    7.581000] ------------[ cut here ]------------ >> [    7.581899] Interrupt for port 19, but apparently not > enabled; >> per-user 000000004af23acc >> [    7.583401] WARNING: CPU: 0 PID: 467 at >> /home/ANT.AMAZON.COM/jgrall/works/oss/linux/drivers/xen/evtchn.c:169 >> evtchn_interrupt+0xd5/0x100 >> [    7.585583] Modules linked in: >> [    7.586188] CPU: 0 PID: 467 Comm: xenstore-read Not tainted >> 5.13.0-rc6 #240 >> [    7.587462] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), >> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 >> [    7.589462] RIP: e030:evtchn_interrupt+0xd5/0x100 >> [    7.590361] Code: 48 8d bb d8 01 00 00 ba 01 00 00 00 > be 1d 00 00 00 >> e8 5f 72 c4 ff eb b2 8b 75 20 48 89 da 48 c7 c7 a8 03 5f 82 e8 6b 2d 96 > >> ff <0f> 0b e9 4d ff ff ff 41 0f b6 f4 48 c7 c7 80 da a2 82 e8 f0 >> [    7.593662] RSP: e02b:ffffc90040003e60 EFLAGS: 00010082 >> [    7.594636] RAX: 0000000000000000 RBX: ffff888102328c00 RCX: >> 0000000000000027 >> [    7.595924] RDX: 0000000000000000 RSI: ffff88817fe18ad0 RDI: >> ffff88817fe18ad8 >> [    7.597216] RBP: ffff888108ef8140 R08: 0000000000000000 R09: >> 0000000000000001 >> [    7.598522] R10: 0000000000000000 R11: 7075727265746e49 R12: >> 0000000000000000 >> [    7.599810] R13: ffffc90040003ec4 R14: ffff8881001b8000 R15: >> ffff888109b36f80 >> [    7.601113] FS:  0000000000000000(0000) GS:ffff88817fe00000(0000) >> knlGS:0000000000000000 >> [    7.602570] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [    7.603700] CR2: 00007f15b390e368 CR3: 000000010bb04000 CR4: >> 0000000000050660 >> [    7.604993] Call Trace: >> [    7.605501]  >> [    7.605929]  __handle_irq_event_percpu+0x4c/0x330 >> [    7.606817]  handle_irq_event_percpu+0x32/0xa0 >> [    7.607670]  handle_irq_event+0x3a/0x60 >> [    7.608416]  handle_edge_irq+0x9b/0x1f0 >> [    7.609154]  generic_handle_irq+0x4f/0x60 >> [    7.609918]  __evtchn_fifo_handle_events+0x195/0x3a0 >> [    7.610864]  __xen_evtchn_do_upcall+0x66/0xb0 >> [    7.611693]  __xen_pv_evtchn_do_upcall+0x1d/0x30 >> [    7.612582]  xen_pv_evtchn_do_upcall+0x9d/0xc0 >> [    7.613439]  >> [    7.613882]  exc_xen_hypervisor_callback+0x8/0x10 >> >> This is quite similar to the problem I reported a few months ago (see >> [1]) but this time this is happening with fifo rather than 2L. >> >> I haven't been able to reproduced it reliably so far. But looking at >> the code, I think I have found another potential race after commit >> >> commit b6622798bc50b625a1e62f82c7190df40c1f5b21 >> Author: Juergen Gross >> Date:   Sat Mar 6 17:18:33 2021 +0100 >>     xen/events: avoid handling the same event on two cpus at the same >> time >>     When changing the cpu affinity of an event it can happen today that >>     (with some unlucky timing) the same event will be handled > on the old >>     and the new cpu at the same time. >>     Avoid that by adding an "event active" flag to the per-event data and >>     call the handler only if this flag isn't set. >>     Cc: stable@vger.kernel.org >>     Reported-by: Julien Grall >>     Signed-off-by: Juergen Gross >>     Reviewed-by: Julien Grall >>     Link: https://lore.kernel.org/r/20210306161833.4552-4-jgross@suse.com >>     Signed-off-by: Boris Ostrovsky >> >> The evtchn driver will use the lateeoi handlers. So the code to ack >> looks like: >> >> do_mask(..., EVT_MASK_REASON_EOI_PENDING) >> smp_store_release(&info->is_active, 0); >> clear_evtchn(info->evtchn); >> >> The code to handle an interrupts look like: >> >> clear_link(...) >> if ( evtchn_fifo_is_pending(port) && !evtchn_fifo_is_mask()) { >>    if (xchg_acquire(&info->is_active, 1) >>      return; >>    generic_handle_irq(); >> } >> >> After changing the affinity, an interrupt may be received once on the >> previous vCPU. So, I think the following can happen: >> >> vCPU0                             | vCPU1 >>                    | >>   Receive event              | >>                    | change affinity to vCPU1 >>   clear_link()              | >>                        | >>                 /* The interrupt is re-raised */ >>                    | receive event >>                      | >>                    | /* The interrupt is not masked */ >>   info->is_active = 1          | >>   do_mask(...)              | >>   info->is_active = 0          | >>                    | info->is_active = 1 >>   clear_evtchn(...)               | >>                                   | do_mask(...) >>                                   | info->is_active = 0 >>                    | clear_evtchn(...) >> >> Does this look plausible to you? > > Yes, it does. > > Thanks for the analysis. > > So I guess for lateeoi events we need to clear is_active only in > xen_irq_lateeoi()? At a first glance this should fix the issue. It should work and would be quite neat. But, I believe clear_evtchn() would have to stick in the ack helper to avoid losing interrupts. Cheers, -- Julien Grall