From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 037A2C433EF for ; Fri, 1 Oct 2021 04:45:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DC6C661A58 for ; Fri, 1 Oct 2021 04:45:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351889AbhJAEqq (ORCPT ); Fri, 1 Oct 2021 00:46:46 -0400 Received: from mail.kernel.org ([198.145.29.99]:53406 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230477AbhJAEqp (ORCPT ); Fri, 1 Oct 2021 00:46:45 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id A85CE61A51; Fri, 1 Oct 2021 04:44:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1633063501; bh=nOYu6VGiA3i8vYIYqOGagXBTJt6u+kei0x1A8SWhhng=; h=In-Reply-To:References:Date:From:To:Cc:Subject:From; b=jmqBu5BQnMy1T7zlK97rr+koPVq4uGoHFILiPGTr9K1sdc7CftdKR5keb0V7sTYwC 3O4ZSHI1jNhm8o6KHA8q7SQKALCsq+x6XDbzgNUvLuGHnaV3CkH5vx+6vk+vbqhBrO qtUihZTe+MoJ4djG8DuJqKSFh+qc6Nw+SkgxQ81LxPVPGmCkiUkoQ1qD3CMgXW4PWz tweNnIcf1hD42U/7i2VeX8sWz7qhSlyNK4XRQ/rVztjvouiN2eoY+Cr0YntkN6m+9e T2292+n5NMHvRayA0gzxSRrvKCzDCeodGfZ9yebyIXHW6VCeY00V3zL0lLT1NQjrMN HBD4E/lXFTpdA== Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailauth.nyi.internal (Postfix) with ESMTP id B1D1527C0054; Fri, 1 Oct 2021 00:44:58 -0400 (EDT) Received: from imap48 ([10.202.2.98]) by compute6.internal (MEProxy); Fri, 01 Oct 2021 00:44:58 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvtddrudekhedgkeehucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepofgfggfkjghffffhvffutgfgsehtqhertderreejnecuhfhrohhmpedftehn ugihucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqnecugg ftrfgrthhtvghrnhepvdelheejjeevhfdutdeggefftdejtdffgeevteehvdfgjeeiveei ueefveeuvdetnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homheprghnugihodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduudeiudek heeifedvqddvieefudeiiedtkedqlhhuthhopeepkhgvrhhnvghlrdhorhhgsehlihhnuh igrdhluhhtohdruhhs X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id C362021E0063; Fri, 1 Oct 2021 00:44:57 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.5.0-alpha0-1322-g921842b88a-fm-20210929.001-g921842b8 Mime-Version: 1.0 Message-Id: <25ba1e1f-c05b-4b67-b547-6b5dbc958a2f@www.fastmail.com> In-Reply-To: <87tui162am.ffs@tglx> References: <20210913200132.3396598-1-sohil.mehta@intel.com> <20210913200132.3396598-12-sohil.mehta@intel.com> <877dex7tgj.ffs@tglx> <87tui162am.ffs@tglx> Date: Thu, 30 Sep 2021 21:41:52 -0700 From: "Andy Lutomirski" To: "Thomas Gleixner" , "Sohil Mehta" , "the arch/x86 maintainers" Cc: "Tony Luck" , "Dave Hansen" , "Ingo Molnar" , "Borislav Petkov" , "H. Peter Anvin" , "Jens Axboe" , "Christian Brauner" , "Peter Zijlstra (Intel)" , "Shuah Khan" , "Arnd Bergmann" , "Jonathan Corbet" , "Raj Ashok" , "Jacob Pan" , "Gayatri Kammela" , "Zeng Guang" , "Williams, Dan J" , "Randy E Witt" , "Shankar, Ravi V" , "Ramesh Thomas" , "Linux API" , linux-arch@vger.kernel.org, "Linux Kernel Mailing List" , linux-kselftest@vger.kernel.org Subject: Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote: > On Thu, Sep 30 2021 at 15:01, Andy Lutomirski wrote: >> On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote: >>> >>> But even with that we still need to keep track of the armed ones per= CPU >>> so we can handle CPU hotunplug correctly. Sigh... >> >> I don=E2=80=99t think any real work is needed. We will only ever have= armed >> UPIDs (with notification interrupts enabled) for running tasks, and >> hot-unplugged CPUs don=E2=80=99t have running tasks. > > That's not the problem. The problem is the wait for uintr case where t= he > task is obviously not running: > > CPU 1 > upid =3D T1->upid; > upid->vector =3D UINTR_WAIT_VECTOR; > upid->ndst =3D local_apic_id(); > ... > do { > .... > schedule(); > } > > CPU 0 > unplug CPU 1 > > SENDUPI(index) > // Hardware does: > tblentry =3D &ttable[index]; > upid =3D tblentry->upid; > upid->pir |=3D tblentry->uv; > send_IPI(upid->vector, upid->ndst); > > So SENDUPI will send the IPI to the APIC ID provided by T1->upid.ndst > which points to the offlined CPU 1 and therefore is obviously going to > /dev/null. IOW, lost wakeup... Yes, but I don't think this is how we should structure this. CPU 1 upid->vector =3D UINV; upid->ndst =3D local_apic_id() exit to usermode; return from usermode; ... schedule(); fpu__save_crap [see below]: if (this task is waiting for a uintr) { upid->resv0 =3D 1; /* arm #GP */ } else { upid->sn =3D 1; } > >> We do need a way to drain pending IPIs before we offline a CPU, but >> that=E2=80=99s a separate problem and may be unsolvable for all I kno= w. Is >> there a magic APIC operation to wait until all initiated IPIs >> targeting the local CPU arrive? I guess we can also just mask the >> notification vector so that it won=E2=80=99t crash us if we get a sta= le IPI >> after going offline. > > All of this is solved already otherwise CPU hot unplug would explode in > your face every time. The software IPI send side is carefully > synchronized vs. hotplug (at least in theory). May I ask you politely = to > make yourself familiar with all that before touting "We do need..." ba= sed > on random assumptions? I'm aware that the software send IPI side is synchronized against hotplu= g. But SENDUIPI is not unless we're going to have the CPU offline code = IPI every other CPU to make sure that their SENDUIPIs have completed -- = we don't control the SENDUIPI code. After reading the ISE docs again, I think it might be possible to use th= e ON bit to synchronize. In the schedule-out path, if we discover that = ON =3D 1, then there is an IPI in flight to us. In theory, we could wai= t for it, although actually doing so could be a mess. That's why I'm as= king whether there's a way to tell the APIC to literally wait for all IP= Is that are *already sent* to be delivered. > > The above SENDUIPI vs. CPU hotplug scenario is the same problem as we > have with regular device interrupts which are targeted at an outgoing > CPU. We have magic mechanisms in place to handle that to the extent > possible, but due to the insanity of X86 interrupt handling mechanics > that still leaves a very tiny hole which might cause a lost and > subsequently stale interrupt. Nothing we can fix in software. > > So on CPU offline the hotplug code walks through all device interrupts > and checks whether they are targeted at the outgoing CPU. If so they a= re > rerouted to an online CPU with lots of care to make the possible race > window as small as it gets. That's nowadays only a problem on systems > where interrupt remapping is not available or disabled via commandline. > > For tasks which just have the user interrupt armed there is no problem > because SENDUPI modifies UPID->PIR which is reevaluated when the task > which got migrated to an online CPU is going back to user space. > > The uintr_wait() syscall creates the very same problem as we have with > device interrupts. Which means we need to make that wait thing: > > upid =3D T1->upid; > upid->vector =3D UINTR_WAIT_VECTOR; This is exactly what I'm suggesting we *don't* do. Instead we set a res= erved bit, we decode SENDUIPI in the #GP handler, and we emulate, in-ker= nel, the notification process for non-running tasks. Now that I read the docs some more, I'm seriously concerned about this X= SAVE design. XSAVES with UINTR is destructive -- it clears UINV. If we= actually use this, then the whole last_cpu "preserve the state in regis= ters" optimization goes out the window. So does anything that happens t= o assume that merely saving the state doesn't destroy it on respectable = modern CPUs XRSTORS will #GP if you XRSTORS twice, which makes me nervo= us and would need a serious audit of our XRSTORS paths. This is gross. --Andy