From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6DF0BC43465 for ; Sun, 20 Sep 2020 17:03:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2522220EDD for ; Sun, 20 Sep 2020 17:03:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1600621406; bh=kyo1St4l8qKBQKzJx541WBwShE6xclRqeg2l4fcEcNk=; h=References:In-Reply-To:From:Date:Subject:To:Cc:List-ID:From; b=sZAdSjKDaW2FRER/CeLNLNYWd/mp+SvK/THbYmxnQ2DNSa9XAwNk00VJ2d993BEpB v4uy52JCXV1vu+YRexleaqfWx6RuLj+O6DgRTdS37vdt7x58wc7qW44pjyTzZg8e8e vUdjOPKJ/yJzrLbro1KDWZOpCcN774AvooWrL2uA= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726381AbgITRDZ (ORCPT ); Sun, 20 Sep 2020 13:03:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726267AbgITRDY (ORCPT ); Sun, 20 Sep 2020 13:03:24 -0400 Received: from mail-ej1-x644.google.com (mail-ej1-x644.google.com [IPv6:2a00:1450:4864:20::644]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54F29C061755 for ; Sun, 20 Sep 2020 10:03:24 -0700 (PDT) Received: by mail-ej1-x644.google.com with SMTP id gr14so14622165ejb.1 for ; Sun, 20 Sep 2020 10:03:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=W4HUdF5y8IL46li/aLWtqF1XJO6kSUNyNg6lQQlq/lHa09QYfyRuIdGDfLmuGoashc 83+sSg8r4ohTOPsGgHOTDX6qHwsouAY/v6fpdZCCgYQJkl9j9MXkxTiUMxbLIJvpzCCu 42aLHOyMwVjBWHWNqUM6rhcX9wijwR1WFrrsPR++y+L1mW//WOYh/0HdhLuSkDVsydnB L6zl7lUqy2UyzOKbv1xrByqblI8VXR5bRkA88pP3c6bQKqbSB7M8eJ4fRC3Tn+cnylaH HWdYPYqoqyYFn0Q+t40G3z/KbXwGjH0SOD0X4qhCB47AoB/9A4e37EY6QdcBbtUWNm+T HiWQ== X-Gm-Message-State: AOAM5320R4eEFICC6JlHBJizV7idZPF0jgBhuIzRGIwNmSsF1+/H91bn pUqw42ZbhY33Xu+4YUfrY2J30dXcISIdbA== X-Google-Smtp-Source: ABdhPJw8aePyNIBJCqaSdn8XsSIQuoAqvzqCXu5gVZJ8sjQeyd+h0cYIGmgawxVkFcL+HkZKYT7Wig== X-Received: by 2002:a17:906:2cd2:: with SMTP id r18mr48010495ejr.371.1600621402484; Sun, 20 Sep 2020 10:03:22 -0700 (PDT) Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com. [209.85.218.41]) by smtp.gmail.com with ESMTPSA id lo25sm6938175ejb.53.2020.09.20.10.03.22 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 10:03:22 -0700 (PDT) Received: by mail-ej1-f41.google.com with SMTP id j11so14628391ejk.0 for ; Sun, 20 Sep 2020 10:03:22 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends To: Thomas Gleixner Cc: LKML , linux-arch , Paul McKenney , "the arch/x86 maintainers" , Sebastian Andrzej Siewior , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Will Deacon , Andrew Morton , Linux-MM , Russell King , Linux ARM , Chris Zankel , Max Filippov , linux-xtensa@linux-xtensa.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , David Airlie , Daniel Vetter , intel-gfx , dri-devel , Ard Biesheuvel , Herbert Xu , Vineet Gupta , "open list:SYNOPSYS ARC ARCHITECTURE" , Arnd Bergmann , Guo Ren , linux-csky@vger.kernel.org, Michal Simek , Thomas Bogendoerfer , linux-mips@vger.kernel.org, Nick Hu , Greentime Hu , Vincent Chen , Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras , linuxppc-dev , "David S. Miller" , linux-sparc Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds Date: Sun, 20 Sep 2020 16:57:40 +0000 Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends Message-Id: List-Id: References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Thomas Gleixner Cc: Juri Lelli , Peter Zijlstra , Sebastian Andrzej Siewior , dri-devel , linux-mips@vger.kernel.org, Ben Segall , Max Filippov , Guo Ren , linux-sparc , Vincent Chen , Will Deacon , Ard Biesheuvel , linux-arch , Vincent Guittot , Herbert Xu , Michael Ellerman , the arch/x86 maintainers , Russell King , linux-csky@vger.kernel.org, David Airlie , Mel Gorman , "open list:SYNOPSYS ARC ARCHITECTURE" , linux-xtensa@linux-xtensa.org, Paul McKenney , intel-gfx , linuxppc-dev , Steven Rostedt , Rodrigo Vivi , Dietmar Eggemann , Linux ARM , Chris Zankel , Michal Simek , Thomas Bogendoerfer , Nick Hu , Linux-MM , Vineet Gupta , LKML , Arnd Bergmann , Paul Mackerras , Andrew Morton , Daniel Bristot de Oliveira , "David S. Miller" , Greentime Hu On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 265E5C43463 for ; Sun, 20 Sep 2020 17:04:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 92E17214F1 for ; Sun, 20 Sep 2020 17:04:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="O7YDY5EZ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 92E17214F1 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 13259900005; Sun, 20 Sep 2020 13:04:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 10A0C900003; Sun, 20 Sep 2020 13:04:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F3B0D900005; Sun, 20 Sep 2020 13:04:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97]) by kanga.kvack.org (Postfix) with ESMTP id D69D5900003 for ; Sun, 20 Sep 2020 13:04:26 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 9738E181AEF0B for ; Sun, 20 Sep 2020 17:04:26 +0000 (UTC) X-FDA: 77284063332.07.bag95_5001d5e2713e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin07.hostedemail.com (Postfix) with ESMTP id 7A9371803F9A7 for ; Sun, 20 Sep 2020 17:04:26 +0000 (UTC) X-HE-Tag: bag95_5001d5e2713e X-Filterd-Recvd-Size: 9372 Received: from mail-ed1-f66.google.com (mail-ed1-f66.google.com [209.85.208.66]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Sun, 20 Sep 2020 17:04:25 +0000 (UTC) Received: by mail-ed1-f66.google.com with SMTP id ay8so10586505edb.8 for ; Sun, 20 Sep 2020 10:04:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=jqnr2279gpG1J4WQ6ljAG2tH8DMVRhSH5TkqaeGDjSZqIjgFME1FABQuKpomJxVtqx GoXx4Jk8wRaSXNJ3gRA38zKrc/FRZjfd0kdT5J9RR5Hg84Se0Xe+QeV9PA83u0TDaBbc 1nC9tnyFOmvKIe7mXOclskntF4BIjxIO3XXaCQJAUhZBOfijNGDezKHr1DGSZrC6/EuI lSU81j+3tdU1NdG/H8weJ5qju19dqrzoE32OaCl3Y5ZsQz/N0biptxgkUIuWSAiM8Ssg pT2BKL1/O5NpSK4zTpfqRL1owF+Rv6yUWRLDyCCsF0BsSzQvtbt8cBrcpJZ9qh//9P/u G7pQ== X-Gm-Message-State: AOAM531/3wttup1rhND9n7uw4UqXGYycaLnzjxgJvqsJuULypzH/gDN3 xEJV0/Y1rTtM8zQxiBOzk5v6vk/HbjBrYQ== X-Google-Smtp-Source: ABdhPJxHcREMZBPBiLFEoyZQwZcbyrg/YuvAd4b4TY/+l7g4WghHVA3X2Lsx+vZ4K+tuN1cKlLG9Lw== X-Received: by 2002:aa7:c885:: with SMTP id p5mr47884269eds.127.1600621464175; Sun, 20 Sep 2020 10:04:24 -0700 (PDT) Received: from mail-wr1-f47.google.com (mail-wr1-f47.google.com. [209.85.221.47]) by smtp.gmail.com with ESMTPSA id a15sm6857511eje.16.2020.09.20.10.04.23 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 10:04:23 -0700 (PDT) Received: by mail-wr1-f47.google.com with SMTP id o5so10341269wrn.13 for ; Sun, 20 Sep 2020 10:04:23 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends To: Thomas Gleixner Cc: LKML , linux-arch , Paul McKenney , "the arch/x86 maintainers" , Sebastian Andrzej Siewior , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Will Deacon , Andrew Morton , Linux-MM , Russell King , Linux ARM , Chris Zankel , Max Filippov , linux-xtensa@linux-xtensa.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , David Airlie , Daniel Vetter , intel-gfx , dri-devel , Ard Biesheuvel , Herbert Xu , Vineet Gupta , "open list:SYNOPSYS ARC ARCHITECTURE" , Arnd Bergmann , Guo Ren , linux-csky@vger.kernel.org, Michal Simek , Thomas Bogendoerfer , linux-mips@vger.kernel.org, Nick Hu , Greentime Hu , Vincent Chen , Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras , linuxppc-dev , "David S. Miller" , linux-sparc Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68927C43465 for ; Sun, 20 Sep 2020 17:08:10 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 50A0D2065D for ; Sun, 20 Sep 2020 17:08:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="O7YDY5EZ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 50A0D2065D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from bilbo.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 4BvYtF4z52zDqXC for ; Mon, 21 Sep 2020 03:08:05 +1000 (AEST) Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=linuxfoundation.org (client-ip=2a00:1450:4864:20::541; helo=mail-ed1-x541.google.com; envelope-from=torvalds@linuxfoundation.org; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: lists.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=linux-foundation.org header.i=@linux-foundation.org header.a=rsa-sha256 header.s=google header.b=O7YDY5EZ; dkim-atps=neutral Received: from mail-ed1-x541.google.com (mail-ed1-x541.google.com [IPv6:2a00:1450:4864:20::541]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4BvYqK5rMczDqC6 for ; Mon, 21 Sep 2020 03:05:31 +1000 (AEST) Received: by mail-ed1-x541.google.com with SMTP id n13so10553198edo.10 for ; Sun, 20 Sep 2020 10:05:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=X1pkD06gdXBpL3PxrwShyaYMqoKSRvsW0OcEhik+njNtWF1qaeRKF37lwKhvRNaDEj 0qiuxGg7Jljyd4/I17X1tgE4/fOndoZtM92NR9UIQcfADAIeach+1YO7WmU+LdjivYPI sPepGTgF/Us6BtuIxWmbWccGfmL6CJkGvDYlM49ppbnGVn0UjfdNVHt0ZbR5ojPamt15 Z++SBLjGhutnM/XOaIKJKCvKm3jm9MYG0TMBIdvIt5CvwNyuRiCIh6nJt9B/TmWMmdN9 FSSMV/T2eosAx+xyIeQ97ZJYcUCw/6mO9pj98n9pCzLARhbuq7mS7hEc3KEZyzE3zBun dZ6A== X-Gm-Message-State: AOAM532vmsUV5xpFP9EYrd7ll+zdQpNt5tnr2T5Y7h5+cQcQ9yOSCo5B dd3QCdzM+toASdIWD/9Gcw8h9nGust1OOA== X-Google-Smtp-Source: ABdhPJySC1iym3Zo0uvNdG5J40HH/Ig/duFf9Gm2x/hVkI6vq/f+vTjIAB0ntZ1eagPkkf3UzoJ2og== X-Received: by 2002:a05:6402:1717:: with SMTP id y23mr50731057edu.112.1600621527071; Sun, 20 Sep 2020 10:05:27 -0700 (PDT) Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com. [209.85.221.43]) by smtp.gmail.com with ESMTPSA id b13sm6822160edf.89.2020.09.20.10.05.26 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 10:05:26 -0700 (PDT) Received: by mail-wr1-f43.google.com with SMTP id j2so10361938wrx.7 for ; Sun, 20 Sep 2020 10:05:26 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends To: Thomas Gleixner Content-Type: text/plain; charset="UTF-8" X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , Peter Zijlstra , Sebastian Andrzej Siewior , Joonas Lahtinen , dri-devel , linux-mips@vger.kernel.org, Ben Segall , Max Filippov , Guo Ren , linux-sparc , Vincent Chen , Will Deacon , Ard Biesheuvel , linux-arch , Vincent Guittot , Herbert Xu , the arch/x86 maintainers , Russell King , linux-csky@vger.kernel.org, David Airlie , Mel Gorman , "open list:SYNOPSYS ARC ARCHITECTURE" , linux-xtensa@linux-xtensa.org, Paul McKenney , intel-gfx , linuxppc-dev , Steven Rostedt , Jani Nikula , Rodrigo Vivi , Dietmar Eggemann , Linux ARM , Chris Zankel , Michal Simek , Thomas Bogendoerfer , Nick Hu , Linux-MM , Vineet Gupta , LKML , Arnd Bergmann , Daniel Vetter , Paul Mackerras , Andrew Morton , Daniel Bristot de Oliveira , "David S. Miller" , Greentime Hu Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FADEC43465 for ; Sun, 20 Sep 2020 16:58:08 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8EB9020EDD for ; Sun, 20 Sep 2020 16:58:07 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="DhsDKbLR"; dkim=fail reason="signature verification failed" (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="O7YDY5EZ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8EB9020EDD Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-snps-arc-bounces+linux-snps-arc=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From:In-Reply-To: References:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=JH0Tum3fBMc+lZT2tw2vx7fc1F7xGF9sS9NAqvBqYmY=; b=DhsDKbLRKg/V/YUqOD5eVq48q wkwV7takrOXmJIy29OkIlaw+Bh8Hr5p0Rc6cRTnZ1YPBbfp5+bhebNyp0fmF6u9AXxsBuq2hI4ia1 hMHBbY1fhpaDxonmz5T6ktmUum/zzv8rAxYNub+p1dGK46WoqvH8yEX3+zVW8Vc8ndEAP4cwzWH0B HN0KSO4+3FACT5Bvvr8HZ+vWX/04Q4lyO9V62ow/iceugC+6283gHLq0E9PO/a2j5K9QFr2F+RZ+1 ywAVVj38cYOA/KeChexvZAMS+vgKtHcIiIeq611QiOhqs0sb90yFZAJ/ksbINpwVwdZe4f0QzcmkB MDcQa37qA==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kK2f8-0005G1-Af; Sun, 20 Sep 2020 16:58:06 +0000 Received: from mail-lj1-x243.google.com ([2a00:1450:4864:20::243]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kK2f5-0005Ei-N5 for linux-snps-arc@lists.infradead.org; Sun, 20 Sep 2020 16:58:05 +0000 Received: by mail-lj1-x243.google.com with SMTP id k25so9158621ljk.0 for ; Sun, 20 Sep 2020 09:58:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=h04+0x1Z+2YfuzPFUZvVp+xxbQhAI89FpDU3YW+WVpdKMorADJDxM6hlTvCzM1H4Yl k3QVCv4eaHtFq833frLciuj/DX1tjL9n8H7YXQG2r3Io0rbe4Q4Q7brpIpP+yOLdNB3g W6z1bxHJ/LwIbQ9LvFaoc3MR2XWYnhMbJbB0kHikp+fpsZgxeGamU23Mtuj8UHLX+7hr dChoNwbtlTLijSzPwy38Rv4e/94GmBwHC9nOzaM8gMQpmExnaJJaWuaKhUUCz69nh7lu QFxcQ/It/8V3u8sYbqfi9hpxgamfXyZrkQktULAni1ZWVjuJaFmjEcH9yyQZpPzjEMmv bUZw== X-Gm-Message-State: AOAM531NR4CiKh8zh7HBs1pMAV0Wj0QqWHmnlw8P38uMaDR2HOWw/6Tl gNOP9aCn0bW6sroDZRdUg5KFzThq6P0z2g== X-Google-Smtp-Source: ABdhPJwfrZeu60AN4uwl9fy9NHpCM9m1ZCcPOosz+p5rfE6zeWgRMGVx3KzmlWND2sv/FGsucengUA== X-Received: by 2002:a05:651c:124b:: with SMTP id h11mr15636888ljh.172.1600621080282; Sun, 20 Sep 2020 09:58:00 -0700 (PDT) Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com. [209.85.167.53]) by smtp.gmail.com with ESMTPSA id q21sm1172658lfo.219.2020.09.20.09.57.58 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 09:57:59 -0700 (PDT) Received: by mail-lf1-f53.google.com with SMTP id z17so11402485lfi.12 for ; Sun, 20 Sep 2020 09:57:58 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends To: Thomas Gleixner X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200920_125803_829035_CFE9DA92 X-CRM114-Status: GOOD ( 24.85 ) X-BeenThere: linux-snps-arc@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on Synopsys ARC Processors List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , Peter Zijlstra , Benjamin Herrenschmidt , Sebastian Andrzej Siewior , Joonas Lahtinen , dri-devel , linux-mips@vger.kernel.org, Ben Segall , Max Filippov , Guo Ren , linux-sparc , Vincent Chen , Will Deacon , Ard Biesheuvel , linux-arch , Vincent Guittot , Herbert Xu , Michael Ellerman , the arch/x86 maintainers , Russell King , linux-csky@vger.kernel.org, David Airlie , Mel Gorman , "open list:SYNOPSYS ARC ARCHITECTURE" , linux-xtensa@linux-xtensa.org, Paul McKenney , intel-gfx , linuxppc-dev , Steven Rostedt , Jani Nikula , Rodrigo Vivi , Dietmar Eggemann , Linux ARM , Chris Zankel , Michal Simek , Thomas Bogendoerfer , Nick Hu , Linux-MM , Vineet Gupta , LKML , Arnd Bergmann , Daniel Vetter , Paul Mackerras , Andrew Morton , Daniel Bristot de Oliveira , "David S. Miller" , Greentime Hu Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-snps-arc" Errors-To: linux-snps-arc-bounces+linux-snps-arc=archiver.kernel.org@lists.infradead.org On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus _______________________________________________ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B788C43465 for ; Sun, 20 Sep 2020 16:59:51 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 20BFF2158C for ; Sun, 20 Sep 2020 16:59:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="1EAR+AYz"; dkim=fail reason="signature verification failed" (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="O7YDY5EZ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 20BFF2158C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From:In-Reply-To: References:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=oamAxioNct4J9SAkbV6xqxVejZ8hTexL/HsUlmMUKG4=; b=1EAR+AYzweLfSouL9dbpQB2v6 ICqerkN0vKlPPG/ogiMDjzzMbLjfadDzH5Iwr031mW9i3Uct7mzM080rowBIOpXfx6fIBKr9EzhZi UlaCJkX7G9gtZ2NlqCN2tehDrq+swWwr9htBxYWSjmlqkWtfyxVb/CsJeVUIGpO2xFoEFscPciRAD 56kigEEN3kakVO/ugCq6WyQAy0J7gs6fsCsJxd7/jf8KmQ4ynItkYaLB6eeZmrAChTRVqzEDk/8sW OLbnfHhjEKBrqS1ovbM+MPjL3p2V5Ibddv7c8nGNeOAFX18uA57GJhhHVx/CNFX4deMXPI2vDP+2M IQza7vScw==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kK2f8-0005G6-OP; Sun, 20 Sep 2020 16:58:06 +0000 Received: from mail-lf1-x141.google.com ([2a00:1450:4864:20::141]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kK2f5-0005Eh-M0 for linux-arm-kernel@lists.infradead.org; Sun, 20 Sep 2020 16:58:05 +0000 Received: by mail-lf1-x141.google.com with SMTP id q8so11442750lfb.6 for ; Sun, 20 Sep 2020 09:58:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=Ejm8UjbfTUIrqoV5t+nkBQjsqmM0cJll+/zAoC0NHr3NiurO755SG8h9l/P6bne/p2 fTZmaXWRJwdZvGsNO+NZKdhB/VhOMaoKtFNsj3Hh8raU7cxbZCQqp62pALiiIq6bGhAU QXVplrk6xWt77xgXD9isDqCFV4VSzzBY6GsSoloj7WWuOZ/vd+6lsI9NXSRlUkXJuxyv JrEw1SrJzfae2qimK51TMzGc6OPoLJcZ6Bo5+1sLIH5I7UCxTX3o3RKZmDEEX246h1h5 VKZ6EIjTf++VCbIGa8brSxtEunAAvKgux7/4geuQPCy87QMqYFmqiAk9GaCSjlXbqscE n1Hw== X-Gm-Message-State: AOAM530q4n8c1DejPuZbRqjRPkXqGMt4GGTcE7mv3Nyl9QwsWkMoPKHN p0xLTMbxbhwKyYzPPXPBfY7x1WXOiSFtsQ== X-Google-Smtp-Source: ABdhPJxTXz7NnOT+/DrLvYdBuhzJRy6UQC2AvZq1cYtSEHolSnqT9fLsx1Wac8ov93glp+LZPq7svA== X-Received: by 2002:a05:6512:74:: with SMTP id i20mr13362897lfo.219.1600621079693; Sun, 20 Sep 2020 09:57:59 -0700 (PDT) Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com. [209.85.167.43]) by smtp.gmail.com with ESMTPSA id s20sm1915111lfs.135.2020.09.20.09.57.57 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 09:57:58 -0700 (PDT) Received: by mail-lf1-f43.google.com with SMTP id y2so11425659lfy.10 for ; Sun, 20 Sep 2020 09:57:57 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends To: Thomas Gleixner X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200920_125803_829667_E045F0BD X-CRM114-Status: GOOD ( 26.38 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , Peter Zijlstra , Benjamin Herrenschmidt , Sebastian Andrzej Siewior , Joonas Lahtinen , dri-devel , linux-mips@vger.kernel.org, Ben Segall , Max Filippov , Guo Ren , linux-sparc , Vincent Chen , Will Deacon , Ard Biesheuvel , linux-arch , Vincent Guittot , Herbert Xu , Michael Ellerman , the arch/x86 maintainers , Russell King , linux-csky@vger.kernel.org, David Airlie , Mel Gorman , "open list:SYNOPSYS ARC ARCHITECTURE" , linux-xtensa@linux-xtensa.org, Paul McKenney , intel-gfx , linuxppc-dev , Steven Rostedt , Jani Nikula , Rodrigo Vivi , Dietmar Eggemann , Linux ARM , Chris Zankel , Michal Simek , Thomas Bogendoerfer , Nick Hu , Linux-MM , Vineet Gupta , LKML , Arnd Bergmann , Daniel Vetter , Paul Mackerras , Andrew Morton , Daniel Bristot de Oliveira , "David S. Miller" , Greentime Hu Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C8B2C43463 for ; Sun, 20 Sep 2020 17:03:37 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2CCD720EDD for ; Sun, 20 Sep 2020 17:03:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="O7YDY5EZ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2CCD720EDD Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1A3376E057; Sun, 20 Sep 2020 17:03:36 +0000 (UTC) Received: from mail-lj1-x241.google.com (mail-lj1-x241.google.com [IPv6:2a00:1450:4864:20::241]) by gabe.freedesktop.org (Postfix) with ESMTPS id D31586E057 for ; Sun, 20 Sep 2020 17:03:32 +0000 (UTC) Received: by mail-lj1-x241.google.com with SMTP id u21so9132467ljl.6 for ; Sun, 20 Sep 2020 10:03:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=fL6Vg3MWG4JUHXgFojq8eVhG/1CCFGtGKEk2SIMdxfIWlBBzyn5jQhTkZN8nEXCtZq +BRZ6H+2UdxOBxbXRDp6Fb6g3gCq/pBF0cCu1HT7DVbA4GVreWD6Ru+B6xu328txAH5L sGEUAoCvlT4BvM9v/LIQVZMEH4RuEtRn2t4Nfl4ajXYLAssMz7tBNjzOvy5gXdMu+acr Q+NW2gxWlkCCDY3Je6s7o85BhhXldbIjTImNedCM5JCaHvWoiLjyAs0vkChLR/+wmXv9 bms81ZW1YbVCxGKvk3giMHjHc0H7J78RuuRWG3fvM5y+/GTne3AevFv8jPb/EunulQMZ TH6A== X-Gm-Message-State: AOAM530C010x7KkyVrpYWlgiEYj9LgGfgN2U7Ac/EBeGKlXU0oEnq61O 0FJja3lO6OTQPYL9uMofTyqex8D5XwebRg== X-Google-Smtp-Source: ABdhPJwDD3FMU8Mwqp0lQBhiV4TLvr7uaurfIrw41t6AUoljH7EQfm28z5ZcMSlYsaYMO5VspYzO8w== X-Received: by 2002:a2e:9cc2:: with SMTP id g2mr8295985ljj.77.1600621410681; Sun, 20 Sep 2020 10:03:30 -0700 (PDT) Received: from mail-lf1-f42.google.com (mail-lf1-f42.google.com. [209.85.167.42]) by smtp.gmail.com with ESMTPSA id e14sm2049048ljp.15.2020.09.20.10.03.30 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 10:03:30 -0700 (PDT) Received: by mail-lf1-f42.google.com with SMTP id x69so11459354lff.3 for ; Sun, 20 Sep 2020 10:03:30 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends To: Thomas Gleixner X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , Peter Zijlstra , Sebastian Andrzej Siewior , dri-devel , linux-mips@vger.kernel.org, Ben Segall , Max Filippov , Guo Ren , linux-sparc , Vincent Chen , Will Deacon , Ard Biesheuvel , linux-arch , Vincent Guittot , Herbert Xu , Michael Ellerman , the arch/x86 maintainers , Russell King , linux-csky@vger.kernel.org, David Airlie , Mel Gorman , "open list:SYNOPSYS ARC ARCHITECTURE" , linux-xtensa@linux-xtensa.org, Paul McKenney , intel-gfx , linuxppc-dev , Steven Rostedt , Rodrigo Vivi , Dietmar Eggemann , Linux ARM , Chris Zankel , Michal Simek , Thomas Bogendoerfer , Nick Hu , Linux-MM , Vineet Gupta , LKML , Arnd Bergmann , Paul Mackerras , Andrew Morton , Daniel Bristot de Oliveira , "David S. Miller" , Greentime Hu Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A26F5C43463 for ; Sun, 20 Sep 2020 17:04:15 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 50DEE20EDD for ; Sun, 20 Sep 2020 17:04:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="O7YDY5EZ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 50DEE20EDD Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1D3C76E060; Sun, 20 Sep 2020 17:04:14 +0000 (UTC) Received: from mail-lj1-x244.google.com (mail-lj1-x244.google.com [IPv6:2a00:1450:4864:20::244]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0D6386E060 for ; Sun, 20 Sep 2020 17:03:58 +0000 (UTC) Received: by mail-lj1-x244.google.com with SMTP id a22so9089832ljp.13 for ; Sun, 20 Sep 2020 10:03:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=JOErmRLwAnSeJMGLB/DFuXoC0C9aJZ7TCm4WDtkW6B+O9IKhZ3mqtzwg+lt7pX/8mD y+Eq3STqPaPm50DaOXvNwXbUUkVHP6fCz/AHPsnXoshYd+VLtwap/icGavglxtCQuRkr 25HdDTH3xvu+VtRn40yznl1PHcdAmmhQj60A5yg564O0czZ1NcG0czQyaeuZTwGxLGcu FZhZGoJjuEYjhGvQM4O1Nrpx/3axEMzus+FAf78Vzs6PwlssvqQPzVvd9bQiYlnGS+I1 6FKu0cz3Rp36jElQnfWpw/RKea81v70V4BGiOB+BhOzUD/o5W3LO7iE7395GnFWJcqHN RGIw== X-Gm-Message-State: AOAM532+pBxIwFBoIN98CJzb+aCUf+bl9TzJvTLFGSblfE0wRcqaKv4u QsBmBNayGn77n/2cKuZab1KaQwURFTf4gA== X-Google-Smtp-Source: ABdhPJzkz3ADDCBu5jPLxzLvzQszAouVufdTeSttOCl0IV/xIm2cvXVj7bSVB8wEwxPPd7qjaa7S0w== X-Received: by 2002:a2e:a590:: with SMTP id m16mr15451558ljp.166.1600621436066; Sun, 20 Sep 2020 10:03:56 -0700 (PDT) Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com. [209.85.167.52]) by smtp.gmail.com with ESMTPSA id l17sm1895349lfj.278.2020.09.20.10.03.55 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 10:03:55 -0700 (PDT) Received: by mail-lf1-f52.google.com with SMTP id y11so11465144lfl.5 for ; Sun, 20 Sep 2020 10:03:55 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: To: Thomas Gleixner Subject: Re: [Intel-gfx] [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , Peter Zijlstra , Benjamin Herrenschmidt , Sebastian Andrzej Siewior , dri-devel , linux-mips@vger.kernel.org, Ben Segall , Max Filippov , Guo Ren , linux-sparc , Vincent Chen , Will Deacon , Ard Biesheuvel , linux-arch , Herbert Xu , Michael Ellerman , the arch/x86 maintainers , Russell King , linux-csky@vger.kernel.org, David Airlie , Mel Gorman , "open list:SYNOPSYS ARC ARCHITECTURE" , linux-xtensa@linux-xtensa.org, Paul McKenney , intel-gfx , linuxppc-dev , Steven Rostedt , Dietmar Eggemann , Linux ARM , Chris Zankel , Michal Simek , Thomas Bogendoerfer , Nick Hu , Linux-MM , Vineet Gupta , LKML , Arnd Bergmann , Paul Mackerras , Andrew Morton , Daniel Bristot de Oliveira , "David S. Miller" , Greentime Hu Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx