From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B9F7C388CD for ; Fri, 9 Oct 2020 19:51:00 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 537D6223AB for ; Fri, 9 Oct 2020 19:51:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 537D6223AB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 2AA9B15923CE6; Fri, 9 Oct 2020 12:51:00 -0700 (PDT) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=192.55.52.136; helo=mga12.intel.com; envelope-from=ira.weiny@intel.com; receiver= Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 03A8015923CD5 for ; Fri, 9 Oct 2020 12:50:56 -0700 (PDT) IronPort-SDR: 489NrOL8xlkKqPG2mCet9ndRwiDvOZCQdFTC/8J8blREDJq5H2iw8Ovx8oWJqsTM20lKqvgQzI iqAoFJ+4DeqA== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="144850608" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="144850608" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 Message-ID-Hash: 3XLAOC3SW7OC2DDTHPRMHJFPCQGKDHGW X-Message-ID-Hash: 3XLAOC3SW7OC2DDTHPRMHJFPCQGKDHGW X-MailFrom: ira.weiny@intel.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header CC: Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , x86@kernel.org, Dave Hansen , Fenghua Yu , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, kexec@lists.infradead.org, linux-bcache@vger.kernel.org, linux-mtd@lists.infradead.org, devel@driverdev.osuosl.org, linux-efi@vger.kernel.org, linux-mmc@vger.kernel.org, linux-scsi@vger.kernel.org, target-devel@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-aio@kvack.org, io-uring@vger.kernel.org, linux-erofs@lists.ozlabs.org, linux-um@lists.infr adead.org, linux-ntfs-dev@lists.sourceforge.net, reiserfs-devel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-nilfs@vger.kernel.org, cluster-devel@redhat.com, ecryptfs@vger.kernel.org, linux-cifs@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-afs@lists.infradead.org, linux-rdma@vger.kernel.org, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, drbd-dev@lists.linbit.com, linux-block@vger.kernel.org, xen-devel@lists.xenproject.org, linux-cachefs@redhat.com, samba-technical@lists.samba.org, intel-wired-lan@lists.osuosl.org X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19562C2D0AE for ; Fri, 9 Oct 2020 20:11:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D434B221FD for ; Fri, 9 Oct 2020 20:11:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391881AbgJIUKi (ORCPT ); Fri, 9 Oct 2020 16:10:38 -0400 Received: from mga04.intel.com ([192.55.52.120]:43246 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2403817AbgJITu7 (ORCPT ); Fri, 9 Oct 2020 15:50:59 -0400 IronPort-SDR: I+kluzUq8pxlBDtt5OqAnT2WCF+BNOozZc8KJ9HRHJVfpcd0kw39MYtQEO55RaZxKxDv2NPRPB 5bQ8ZYnMvMug== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="162893218" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="162893218" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Cc: Ira Weiny , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , x86@kernel.org, Dave Hansen , Dan Williams , Fenghua Yu , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, kexec@lists.infradead.org, linux-bcache@vger.kernel.org, linux-mtd@lists.infradead.org, devel@driverdev.osuosl.org, linux-efi@vger.kernel.org, linux-mmc@vger.kernel.org, linux-scsi@vger.kernel.org, target-devel@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-aio@kvack.org, io-uring@vger.kernel.org, linux-erofs@lists.ozlabs.org, linux-um@lists.infradead.org, linux-ntfs-dev@lists.sourceforge.net, reiserfs-devel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-nilfs@vger.kernel.org, cluster-devel@redhat.com, ecryptfs@vger.kernel.org, linux-cifs@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-afs@lists.infradead.org, linux-rdma@vger.kernel.org, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, drbd-dev@lists.linbit.com, linux-block@vger.kernel.org, xen-devel@lists.xenproject.org, linux-cachefs@redhat.com, samba-technical@lists.samba.org, intel-wired-lan@lists.osuosl.org Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 From mboxrd@z Thu Jan 1 00:00:00 1970 From: ira.weiny@intel.com Date: Fri, 09 Oct 2020 19:49:38 +0000 Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> List-Id: References: <20201009195033.3208459-1-ira.weiny@intel.com> In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref = 0 dev_access_enable() // ref += 1 => disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 => disable protection dev_access_disable() // ref -= 1 => enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 => 0 => enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0EDDAC388F8 for ; Fri, 9 Oct 2020 19:51:10 +0000 (UTC) Received: from lists.sourceforge.net (lists.sourceforge.net [216.105.38.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id B9E8D22282; Fri, 9 Oct 2020 19:51:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=sourceforge.net header.i=@sourceforge.net header.b="SlnDdHU+"; dkim=fail reason="signature verification failed" (1024-bit key) header.d=sf.net header.i=@sf.net header.b="ZH2K5jAI" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B9E8D22282 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linux-f2fs-devel-bounces@lists.sourceforge.net Received: from [127.0.0.1] (helo=sfs-ml-2.v29.lw.sourceforge.com) by sfs-ml-2.v29.lw.sourceforge.com with esmtp (Exim 4.90_1) (envelope-from ) id 1kQyPz-00016p-07; Fri, 09 Oct 2020 19:51:07 +0000 Received: from [172.30.20.202] (helo=mx.sourceforge.net) by sfs-ml-2.v29.lw.sourceforge.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kQyPx-00016V-Ji; Fri, 09 Oct 2020 19:51:05 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=sourceforge.net; s=x; h=Content-Transfer-Encoding:MIME-Version:References: In-Reply-To:Message-Id:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=aBUOsqEpGguM/KLUdGYOL835HjJgH9FQh0jwcJGVZeE=; b=SlnDdHU+dg1zlY3151Ea74JqD7 Jbz1/Dkw+CIaiaN4IkPZsM1L+c5q/WnJ+tStei2xGneCWg4VKVvmayPwIm7A1i/RORnm2/L2SWJ6F pROfqKTXE8LouHMcvOHtaARwHcV3UPJVYQcmMUl7Zen1hbxoK8vRyY8yKlUPVC40/TvI=; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=sf.net; s=x ; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To:Message-Id: Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=aBUOsqEpGguM/KLUdGYOL835HjJgH9FQh0jwcJGVZeE=; b=ZH2K5jAIEQIh0l2omwmGzSkDFG 4nOvuFCc07QyuQbxXOPR0NAov3c50PG2Le+dKCZr3XJeXpHLT3uyz1MTxXnz96izaduDcHcbA40Su 3DvOVWB+DPpTcpuIYDwd9JZhNuQ+wBDUffj9SnqGkFNcZsvIfi611Hd6wkSSyTLHfNUk=; Received: from mga04.intel.com ([192.55.52.120]) by sfi-mx-1.v28.lw.sourceforge.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.2) id 1kQyPu-008rfx-SK; Fri, 09 Oct 2020 19:51:05 +0000 IronPort-SDR: pyDq4KRa1wdrGZYvorn7pTKQwU/6ZAWxiud/WIfbaVXV4hsvaRD3vZUhsOmbrruuGkIAPQF1zM uxEiTZXN6dvg== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="162893220" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="162893220" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 X-Headers-End: 1kQyPu-008rfx-SK Subject: [f2fs-dev] [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection X-BeenThere: linux-f2fs-devel@lists.sourceforge.net X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-f2fs-devel-bounces@lists.sourceforge.net From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E70C1C6377D for ; Fri, 9 Oct 2020 19:51:20 +0000 (UTC) Received: from silver.osuosl.org (smtp3.osuosl.org [140.211.166.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id AC13622282 for ; Fri, 9 Oct 2020 19:51:20 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AC13622282 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=driverdev-devel-bounces@linuxdriverproject.org Received: from localhost (localhost [127.0.0.1]) by silver.osuosl.org (Postfix) with ESMTP id 49CE92E2EB; Fri, 9 Oct 2020 19:51:20 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from silver.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MdgpZuVlbdvi; Fri, 9 Oct 2020 19:51:03 +0000 (UTC) Received: from ash.osuosl.org (ash.osuosl.org [140.211.166.34]) by silver.osuosl.org (Postfix) with ESMTP id 72DDB2E2CC; Fri, 9 Oct 2020 19:51:00 +0000 (UTC) Received: from fraxinus.osuosl.org (smtp4.osuosl.org [140.211.166.137]) by ash.osuosl.org (Postfix) with ESMTP id 49A2D1BF2F3 for ; Fri, 9 Oct 2020 19:50:59 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by fraxinus.osuosl.org (Postfix) with ESMTP id 43430871DE for ; Fri, 9 Oct 2020 19:50:59 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from fraxinus.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GNyZvK76HMGY for ; Fri, 9 Oct 2020 19:50:57 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by fraxinus.osuosl.org (Postfix) with ESMTPS id 0E1F8870D6 for ; Fri, 9 Oct 2020 19:50:57 +0000 (UTC) IronPort-SDR: ZFkDppzujeTZ7WGsnv9zLMGKX4pPodlh9nUPbD7iNFGwNSv73EKrzrohhCFsmADs5mgDul3M+g VUuWnpKy+W7Q== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="152450743" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="152450743" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 X-BeenThere: driverdev-devel@linuxdriverproject.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux Driver Project Developer List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: driverdev-devel-bounces@linuxdriverproject.org Sender: "devel" From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 _______________________________________________ devel mailing list devel@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC5DBC433E7 for ; Sat, 10 Oct 2020 01:51:16 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 93A4B2225B for ; Sat, 10 Oct 2020 01:51:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 93A4B2225B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from bilbo.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 4C7Sb51ZlgzDqnZ for ; Sat, 10 Oct 2020 12:51:13 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=intel.com (client-ip=134.134.136.20; helo=mga02.intel.com; envelope-from=ira.weiny@intel.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=intel.com Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4C7JbQ40SXzDqXt; Sat, 10 Oct 2020 06:50:58 +1100 (AEDT) IronPort-SDR: sqmSYntlS3ZaE46bkl56rpseRyooN3BElpuPrWLmOrKg9ypW/uzzfORivLDVboUWs4doavpo9o gu2Wa9P1utKA== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="152450747" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="152450747" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Mailman-Approved-At: Sat, 10 Oct 2020 12:49:23 +1100 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9451C433DF for ; Fri, 9 Oct 2020 20:16:50 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 51EC7215A4 for ; Fri, 9 Oct 2020 20:16:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="U8HelWMv"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="TEPtk4tT" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 51EC7215A4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-mtd-bounces+linux-mtd=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:Message-Id:Date: Subject:To:From:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=Xv8WnvZ6bejxyjXAfFb3TQq3og241pfNzblaHZVNaMA=; b=U8HelWMvMhpOHk1vsT/bGPUf5 s1xKEWKUG7cp+swT5RpT+Y2kwgVI7cwA+QjzJmJANFBnJBnDTUDAoy28ZoyhrpiaFBwQNzU9SRiLc 0/XV8WgcCladPElnnsQeEUVJS+ahBc21nuVrOQxCNeHtxTH9UP9akzlwva8C0A6V2AlE5I2WZq/G+ BgPiLn4EqFINXSObOy/sbo0GpQ2aJd998KPA6+4Xwkz8kL9PDjNK/yQ7LJjPan+oSNzRd2zkE9Mn+ IfIrROF2lDuio39AD/cwTv/sbNR8zq4ZfgMU3EC7RyH2dlJGl0xHq17cU3ArOPPn1Kg7jh5kBiLl1 O5LHJf0sw==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kQynf-0006cl-QH; Fri, 09 Oct 2020 20:15:36 +0000 Received: from casper.infradead.org ([2001:8b0:10b:1236::1]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kQyjS-00014k-JR; Fri, 09 Oct 2020 20:11:14 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:MIME-Version: References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From:Sender:Reply-To: Content-Type:Content-ID:Content-Description; bh=aBUOsqEpGguM/KLUdGYOL835HjJgH9FQh0jwcJGVZeE=; b=TEPtk4tTr6e6CEuYJXosqA30ae vB8Xx/sgtDFfSMrZAxHn7MBkEfs1RWvEwQwsbnBpUfs2O8SRSNQ/6R/1QTnTt3LFu2dYahhxLX3xh nRVqnkUAze+tEldVR9sm1rmPCqPMuvOrEtrnH/UfV634Z12dNbwr5RF5SAOWWlHH0CqgLu+noZTgF X8Dt78RxTd+H1WhoAAa2fudM+Cwb5Sxl/jNxcuUeJSWbcoxQxIcl78QhWmJeUU901hDKDfvP+SDAJ VfQCzfa7BohIvQPK0zWMG3pv/Vvpq4lzIPc0gej9TDsN5Bp53S4SgwKvmACEM+7hhM+FjYLZecAEz 0ROu2/WQ==; Received: from mga05.intel.com ([192.55.52.43]) by casper.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kQyPu-0001DH-S1; Fri, 09 Oct 2020 19:51:10 +0000 IronPort-SDR: cNoBaSbZbi6NkgO8j+7wvI6FohNG+jLCG9J/tILbXoyjB++3A2NABh6BR9AkX9e7xLKP2kX25e lvwKOYbdfFLg== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="250225892" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="250225892" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20201009_205103_682412_A8148B76 X-CRM114-Status: GOOD ( 33.36 ) X-BeenThere: linux-mtd@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-mtd" Errors-To: linux-mtd-bounces+linux-mtd=archiver.kernel.org@lists.infradead.org From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CBAB7C388CE for ; Fri, 9 Oct 2020 19:51:00 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 9AAC522314 for ; Fri, 9 Oct 2020 19:51:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9AAC522314 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4F78A6ED9C; Fri, 9 Oct 2020 19:50:59 +0000 (UTC) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4582A6ED99; Fri, 9 Oct 2020 19:50:57 +0000 (UTC) IronPort-SDR: AeuGVtwLm2XosW+9G/k/OKFjRF2ulYyO9wB2O8hv9WNtCJE8HQ489ccVl3VPErImzorKlobKT0 hDrn7J3wKj1Q== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="182976012" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="182976012" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5E36C3815A for ; Fri, 9 Oct 2020 19:50:59 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 959AE223B0 for ; Fri, 9 Oct 2020 19:50:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 959AE223B0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 889496ED99; Fri, 9 Oct 2020 19:50:58 +0000 (UTC) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4582A6ED99; Fri, 9 Oct 2020 19:50:57 +0000 (UTC) IronPort-SDR: AeuGVtwLm2XosW+9G/k/OKFjRF2ulYyO9wB2O8hv9WNtCJE8HQ489ccVl3VPErImzorKlobKT0 hDrn7J3wKj1Q== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="182976012" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="182976012" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 714E9C41604 for ; Fri, 9 Oct 2020 21:11:13 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 39C34222C3 for ; Fri, 9 Oct 2020 21:11:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 39C34222C3 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 37D406EE53; Fri, 9 Oct 2020 21:11:03 +0000 (UTC) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4582A6ED99; Fri, 9 Oct 2020 19:50:57 +0000 (UTC) IronPort-SDR: AeuGVtwLm2XosW+9G/k/OKFjRF2ulYyO9wB2O8hv9WNtCJE8HQ489ccVl3VPErImzorKlobKT0 hDrn7J3wKj1Q== X-IronPort-AV: E=McAfee;i="6000,8403,9769"; a="182976012" X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="182976012" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:56 -0700 IronPort-SDR: WlCZORaTV0ocl1yC7IDa4J4IbRZJZvP+KytzS1LtxRNRUTgHeg4W5ZWyzNndx3DpBXRsqL9R3c CQItA3N5s8ig== X-IronPort-AV: E=Sophos;i="5.77,355,1596524400"; d="scan'208";a="462300571" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2020 12:50:55 -0700 From: ira.weiny@intel.com To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 X-Mailman-Approved-At: Fri, 09 Oct 2020 21:10:40 +0000 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx From mboxrd@z Thu Jan 1 00:00:00 1970 From: ira.weiny@intel.com Date: Fri, 9 Oct 2020 12:49:38 -0700 Subject: [Intel-wired-lan] [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> Message-ID: <20201009195033.3208459-4-ira.weiny@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: intel-wired-lan@osuosl.org List-ID: From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: From: ira.weiny@intel.com Subject: [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection Date: Fri, 9 Oct 2020 12:49:38 -0700 Message-Id: <20201009195033.3208459-4-ira.weiny@intel.com> In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> MIME-Version: 1.0 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "kexec" Errors-To: kexec-bounces+dwmw2=infradead.org@lists.infradead.org To: Andrew Morton , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , Peter Zijlstra Cc: Juri Lelli , linux-aio@kvack.org, linux-efi@vger.kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-mmc@vger.kernel.org, Dave Hansen , dri-devel@lists.freedesktop.org, Ben Segall , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-mtd@lists.infradead.org, linux-kselftest@vger.kernel.org, samba-technical@lists.samba.org, Ira Weiny , ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com, devel@driverdev.osuosl.org, linux-cifs@vger.kernel.org, linux-nilfs@vger.kernel.org, Vincent Guittot , linux-scsi@vger.kernel.org, linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org, x86@kernel.org, amd-gfx@lists.freedesktop.org, io-uring@vger.kernel.org, cluster-devel@redhat.com, linux-cachefs@redhat.com, intel-wired-lan@lists.osuosl.org, Mel Gorman , xen-devel@lists.xenproject.org, linux-ext4@vger.kernel.org, Fenghua Yu , linux-afs@lists.infradead.org, linux-um@lists.infradead.org, intel-gfx@lists.freedesktop.org, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, reiserfs-devel@vger.kernel.org, Steven Rostedt , linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, Dan Williams , Dietmar Eggemann , linux-nfs@vger.kernel.org, linux-ntfs-dev@lists.sourceforge.net, netdev@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-btrfs@vger.kernel.org From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9 _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec From mboxrd@z Thu Jan 1 00:00:00 1970 From: ira.weiny@intel.com Date: Fri, 9 Oct 2020 12:49:38 -0700 Subject: [Cluster-devel] [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com> References: <20201009195033.3208459-1-ira.weiny@intel.com> Message-ID: <20201009195033.3208459-4-ira.weiny@intel.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit From: Ira Weiny Device managed memory exposes itself to the kernel direct map which allows stray pointers to access these device memories. Stray pointers to normal memory may result in a crash or other undesirable behavior which, while unfortunate, are usually recoverable with a reboot. Stray access, specifically stray writes, to areas such as non-volatile memory are permanent in nature and thus are more likely to result in permanent user data loss vs stray access to other memory areas. Furthermore, we protect against reads which can help with speculative reads to poison areas as well. But this is a secondary reason. Set up an infrastructure for extra device access protection. Then implement the new protection using the new Protection Keys Supervisor (PKS) on architectures which support it. To enable this extra protection devices specify a flag in the pgmap to indicate that these areas wish to use additional protection. Kernel code which intends to access this memory can do so automatically through the use of the kmap infrastructure calling into dev_access_[enable|disable]() described here. The kmap infrastructure is implemented in a follow on patch. In addition, users can directly enable/disable the access through dev_access_[enable|disable]() if they have a priori knowledge of the type of pages they are accessing. All calls to enable/disable protection flow through dev_access_[enable|disable]() and are nestable by the use of a per task reference count. This reference count does 2 things. 1) Allows a thread to nest calls to disable protection such that the first call to re-enable protection does not 'break' the last access of the pmem device memory. 2) Provides faster performance by avoiding lots of MSR writes. For example, looping over a sequence of pmem pages. In addition, we must ensure the reference count is preserved through an exception so we add the count to irqentry_state_t and save/restore the reference count while giving exceptions their own count should they use a kmap call. The following shows how this works through an exception: ... // ref == 0 dev_access_enable() // ref += 1 ==> disable protection irq() // enable protection // ref = 0 _handler() dev_access_enable() // ref += 1 ==> disable protection dev_access_disable() // ref -= 1 ==> enable protection // WARN_ON(ref != 0) // disable protection do_pmem_thing() // all good here dev_access_disable() // ref -= 1 ==> 0 ==> enable protection ... Nested exceptions operate the same way with each exception storing the interrupted exception state all the way down. The pkey value is never free'ed as this optimizes the implementation to be either on or off using a static branch conditional in the fast paths. Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Signed-off-by: Ira Weiny --- arch/x86/entry/common.c | 21 +++++++++ include/linux/entry-common.h | 3 ++ include/linux/memremap.h | 1 + include/linux/mm.h | 43 +++++++++++++++++ include/linux/sched.h | 3 ++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 13 ++++++ mm/memremap.c | 90 ++++++++++++++++++++++++++++++++++++ 9 files changed, 180 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 86ad32e0095e..3680724c1a4d 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state * * NOTE That the thread saved PKRS must be preserved separately to ensure * global overrides do not 'stick' on a thread. + * + * Furthermore, Zone Device Access Protection maintains access in a re-entrant + * manner through a reference count which also needs to be maintained should + * exception handlers use those interfaces for memory access. Here we start + * off the exception handler ref count to 0 and ensure it is 0 when the + * exception is done. Then restore it for the interrupted task. */ noinstr void irq_save_pkrs(irqentry_state_t *state) { if (!cpu_feature_enabled(X86_FEATURE_PKS)) return; +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + /* + * Save the ref count of the current running process and set it to 0 + * for any irq users to properly track re-entrance + */ + state->pkrs_ref = current->dev_page_access_ref; + current->dev_page_access_ref = 0; +#endif + /* * The thread_pkrs must be maintained separately to prevent global * overrides from 'sticking' on a thread. @@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state) write_pkrs(state->pkrs); current->thread.saved_pkrs = state->thread_pkrs; + +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + WARN_ON_ONCE(current->dev_page_access_ref != 0); + /* Restore the interrupted process reference */ + current->dev_page_access_ref = state->pkrs_ref; +#endif } #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index c3b361ffa059..06743cce2dbf 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs); #ifndef irqentry_state typedef struct irqentry_state { #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int pkrs_ref; +#endif u32 pkrs; u32 thread_pkrs; #endif diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..b6713ee7b218 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -89,6 +89,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROT_ENABLED (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..9e845515ff15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_protection_static_key); + +/* + * We make page_is_access_protected() as quick as possible. + * 1) If no mappings have been enabled with extra protection we skip this + * entirely + * 2) Skip pages which are not ZONE_DEVICE + * 3) Only then check if this particular page was mapped with extra + * protections. + */ +static inline bool page_is_access_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROT_ENABLED) + return true; + return false; +} + +void __dev_access_enable(bool global); +void __dev_access_disable(bool global); +static __always_inline void dev_access_enable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_enable(global); +} +static __always_inline void dev_access_disable(bool global) +{ + if (static_branch_unlikely(&dev_protection_static_key)) + __dev_access_disable(global); +} +#else +static inline bool page_is_access_protected(struct page *page) +{ + return false; +} +static inline void dev_access_enable(bool global) { } +static inline void dev_access_disable(bool global) { } +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..25d97ab6c757 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1315,6 +1315,9 @@ struct task_struct { struct callback_head mce_kill_me; #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + unsigned int dev_page_access_ref; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index f6889fce64af..9b39f25de59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -209,6 +209,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + .dev_page_access_ref = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..b6a3ee328a89 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION + tsk->dev_page_access_ref = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index 1b9bc004d9bc..01dd75720ae6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config ZONE_DEVICE_ACCESS_PROTECTION + bool "Device memory access protection" + depends on ZONE_DEVICE + depends on ARCH_HAS_SUPERVISOR_PKEYS + + help + Enable the option of having access protections on device memory + areas. This protects against access to device memory which is not + intended such as stray writes. This feature is particularly useful + to protect against corruption of persistent memory. + + If in doubt, say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index fbfc79fd9c24..edad2aa0bd24 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,12 +6,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include + +#define PKEY_INVALID (INT_MIN) static DEFINE_XARRAY(pgmap_array); @@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will remain disabled. The key acquisition is attempted at init time and + * never again. So we don't have to worry about dev_page_pkey changing. + */ +static int dev_page_pkey = PKEY_INVALID; +DEFINE_STATIC_KEY_FALSE(dev_protection_static_key); +EXPORT_SYMBOL(dev_protection_static_key); + +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) { + pgprotval_t val = pgprot_val(prot); + + static_branch_inc(&dev_protection_static_key); + prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey)); + } + return prot; +} + +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ + if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) + static_branch_dec(&dev_protection_static_key); +} + +void __dev_access_disable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + if (!--current->dev_page_access_ref) + pks_mknoaccess(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_disable); + +void __dev_access_enable(bool global) +{ + unsigned long flags; + + local_irq_save(flags); + /* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */ + if (!current->dev_page_access_ref++) + pks_mkrdwr(dev_page_pkey, global); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__dev_access_enable); + +/** + * dev_access_protection_init: Configure a PKS key domain for device pages + * + * The domain defaults to the protected state. Device page mappings should set + * the PGMAP_PROT_ENABLED flag when mapping pages. + * + * Note the pkey is never free'ed. This is run at init time and we either get + * the key or we do not. We need to do this to maintian a constant key (or + * not) as device memory is added or removed. + */ +static int __init __dev_access_protection_init(void) +{ + int pkey = pks_key_alloc("Device Memory"); + + if (pkey < 0) + return 0; + + dev_page_pkey = pkey; + + return 0; +} +subsys_initcall(__dev_access_protection_init); +#else +static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot) +{ + return prot; +} +static void dev_pgprot_put(struct dev_pagemap *pgmap) +{ +} +#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct resource *res) { xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), @@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap) pgmap_array_delete(res); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); + dev_pgprot_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) int error, is_ram; bool need_devmap_managed = true; + params.pgprot = dev_pgprot_get(pgmap, params.pgprot); + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { -- 2.28.0.rc0.12.gb6a658bd00c9