From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86AB5C433ED for ; Tue, 13 Apr 2021 22:59:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 61B9461242 for ; Tue, 13 Apr 2021 22:59:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347853AbhDMW7d (ORCPT ); Tue, 13 Apr 2021 18:59:33 -0400 Received: from mail.kernel.org ([198.145.29.99]:38114 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231787AbhDMW7c (ORCPT ); Tue, 13 Apr 2021 18:59:32 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 115FD61246 for ; Tue, 13 Apr 2021 22:59:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1618354752; bh=2W54etGrZFjK3LvTvpFBh3mSYBYxMTnNbzR4OPQA1FU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=ScIRVEJKFsCPTP5/oxbdUJWk4rJKO0dforaKk5k1b0NZGZL5TCsJ2ElcNiwGxAKHa gxGZ1R3E/zYzyAXUQpVRwJnEbx5169juPGVFJCcvEzuunYFL3LrQNIM9EYp6rdTdyj +vjKLv8ZNT1J1J2I4MT7OmEoGHgQjmFj9sPhrvyuJFMQCb31tBCAf3lMFB1FRRbqKd T/8mER3ddgN4Ozc64QaGimsdVsiN/UxaEYhTUeFDgUG+239saI3rbZkShm/Wwez+YS 7Exl7qNZYICzaLWx9iRYMeZHI8qVIm9krfy274wYXUhONmenYlExRseWwyN4RkwjYc JqjxVJSpDGP3g== Received: by mail-ed1-f48.google.com with SMTP id g17so20623020edm.6 for ; Tue, 13 Apr 2021 15:59:11 -0700 (PDT) X-Gm-Message-State: AOAM530ppebeQxc9NHFx8GB+karQ26PkboqkAGdh+LDM+OoIaw0H/6NG N5E9mLhMQkHBqP/bwZmC1H6bzh+bf7ro8CXl3cpRJw== X-Google-Smtp-Source: ABdhPJzJkPw21zTgf71eMpCcBJ3V1F0sHGsjyBMesqTdaShDitlG+tfS7Km1wCffDoYAPcz8pg7cyeKVQJPirZVPmj8= X-Received: by 2002:aa7:d353:: with SMTP id m19mr34103472edr.172.1618354749851; Tue, 13 Apr 2021 15:59:09 -0700 (PDT) MIME-Version: 1.0 References: <87lf9nk2ku.fsf@oldenburg.str.redhat.com> In-Reply-To: From: Andy Lutomirski Date: Tue, 13 Apr 2021 15:58:58 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Candidate Linux ABI for Intel AMX and hypothetical new related features To: Len Brown Cc: Andy Lutomirski , Willy Tarreau , Florian Weimer , "Bae, Chang Seok" , Dave Hansen , X86 ML , LKML , linux-abi@vger.kernel.org, "libc-alpha@sourceware.org" , Rich Felker , Kyle Huey , Keno Fischer Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 13, 2021 at 3:47 PM Len Brown wrote: > > On Tue, Apr 13, 2021 at 4:16 PM Andy Lutomirski wrote: > > > > On Mon, Apr 12, 2021 at 4:46 PM Len Brown wrote: > > > > > > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski wrote: > > > > > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a > > > > microbenchmark. Do it once and you permanently allocate 8kB (is that > > > > even a constant? can it grow in newer parts?), potentially hurts all > > > > future context switches, and does who-knows-what to Turbo licenses and > > > > such. > > > > > > Intel expects that AMX will be extremely valuable to key workloads. > > > It is true that you may never run that kind of workload on the machine > > > in front of you, > > > and so you have every right to be doubtful about the value of AMX. > > > > I fully believe that AMX will be amazing when used for the right > > workload. The problem is that a library may have no way to tell > > whether a workload is the type of computationally intensive workload > > for which it makes sense. Imagine you have a little function: > > > > int matrix_times_vector(int dim, float *out, const float *matrix, > > const float *vector); > > > > A clever library might use AMX for this. If dim == 4 and the caller > > is planning to call it in a long, tight loop, maybe this even makes > > sense. If dim == 4 and it's being called once, AMX is probably a > > losing proposition. With previous technologies, at least the impact > > was limited to the function itself and maybe once per call to the > > caller. But now, with AMX, the program that invoked this takes a > > performance and memory hit *forever* if it uses AMX once. > > Again... > > As this is a "clever" library, built with a clever toolchain, and the > result is that > TILERELEASE was properly issued at the end of computation. > Thus the hardware knows that the (volatile) AMX registers are no longer live. My argument has *nothing* to do with TILERELEASE. Let me try again. Suppose I write some user code an call into a library that uses AMX because the library authors benchmarked it and determined that using AMX is faster when called in a loop. But I don't call it in a loop. Then I take the transition penalty into and out of AMX code (I'll believe there is no penalty when I see it -- we've had a penalty with VEX and with AVX-512) and my program runs *slower*. And, to top it off, I've just permanently allocated 8kB of extra FPU state buffer, *and* I'm taking either an XCR0 or an XFD write penalty on every future context switch. Someone or something needs to make a decision as to whether AMX should actually be used for a given algorithm. The user library community has swept this under the rug by declaring that libraries should use the best-in-a-tight-loop code for the entire existence of extensions beyond XMM, and the cost keeps getting higher. > > Beyond that, we have the signal handling issue. > > I'm unaware of any unresolved feedback on the signal handling series > other than a wistful "wouldn't a new SIGFAIL be more clear (for future apps) > than the existing SIGSEGV?" I agree with this sentiment, but I don't > think we should hold up a patch to prevent corrupting user data > because a new signal number to describe the scenario doesn't exit. > Particularly since the new code that knows about the new SIGFAIL > will also be new code that has been compiled with the new glibc > that for most cases will prevent this scenario in the first place... > > > One solution, going > > off of what WIlly mentioned, is: > > > > bool amx_begin(void *signal_save_buffer); > > void amx_end(); > > > > In the amx_begin() region, if you get a signal, the AMX state is saved > > in the buffer. Outside the region, if you get a signal and AMX is in > > use, the kernel will either unceremoniously kill the task or will > > deliver SIGYOUBLEWIT. [0] > > I think it is clear that if a new signal ABI is going to be invented, > that it should be opt-in on state, so that it can run fast on machines > far into the future by not choosing to opt-in on anything. > > It isn't clear that changing the signal save state around critical regions > (in multiple threads) so that a single (per process definition) of a signal > handler gets a different result at different times is going to make that > (new) signal handler author especially happy. More likely they > either always want the state, or they do not. Perhaps some form of decision should be reached before AMX lands? Landing AMX in its current form is a decision, and we should make a credible effort to decide if it's the right one. --Andy