From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 93A71C54EE9 for ; Mon, 19 Sep 2022 04:55:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:Subject:Message-ID:Date:From: In-Reply-To:References:MIME-Version:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=o5lvrD9AmWRpWQZU89x/XIcljPhbepImctnrB+o/K4o=; b=OMn+okgdjiOhQk 1CTIy4pYUZUB5Ri2eqJ5c/ZET6llcAFXecErxC26rT6HLbAxSQxYlBhELnknWxcnfrmjS9j5v8Sqi cIQ3mAkXqr+RmwjhwQXpEYu+PdSePrGikHrnqf5kGu4dP9JWjE+G6582D83HaJK92z+nBnJlCuC/v hwekG1gxKJkgRxhx5e0Wv7tvKEqXmJesXrw+groANFv3n4rmq2HUqyANBRLF36OviMoBdohxj4BQV dxBCkih8zEEeP+3GXANLXZOkwH69yP3k0ja7EYanSHSlmjA+5UaeCgItF56LJCbasGyLisInFtRcI nbC7RdcIHVTj4WIf5cNw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1oa8nA-007mlE-LX; Mon, 19 Sep 2022 04:54:00 +0000 Received: from mail-ej1-x62f.google.com ([2a00:1450:4864:20::62f]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1oa8n5-007mdY-3c; Mon, 19 Sep 2022 04:53:57 +0000 Received: by mail-ej1-x62f.google.com with SMTP id r18so61785534eja.11; Sun, 18 Sep 2022 21:53:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=RlNy82oQRqLWrukEgPPnsPx9ducFrofzFQxvq1WAWcE=; b=SWzD8Jaq4buNaL5BrTco3g79OygvK1ANfle2ttw6VSUUeLNJRuWEUj9aXuIa6WFA6K oJ2zVNi6VPRUkc1GxJH7N9Ss21z6NFqSbPeeoFDQ60kND6rNSOFOR1dgKENozfyLp+mQ s8CWhNENLdyW3ejGTwZKnuOy9TaWaqEucd1lpB3OyayouB4REUo2O7/xpjrcAw3ZMW7N MtAGOcc1BE/dRwmGuZSuF1r8rmrQ+uQZR80XeGi4DbTjXM3rvT/SDlpcke5IYuTSWq1R iAxdcue8bBet5hdIA2QBmip9NBRCbm6iDeBEv5y7jvpcjFHFRGnsMGyKckpSJ5zrm9KI DOUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=RlNy82oQRqLWrukEgPPnsPx9ducFrofzFQxvq1WAWcE=; b=S5UOLEA+jyaIdfchmcBm+W5TYt7R86P59X7Pzi+n+GZaYN2EaIILQk5X9HwO+a27ov p8j5Aq29jhJYISJjzVbIvDOtfBLqfgCBY3VL3gBxU0qVcRXJMt/Usv8VgZfnrezp+Io4 aT/PEkE6E+cOqKfubGgKeHg37FPEmITixm47E602CsJ+JVORNlsgFqQ/VZkkbD3+DT+R SUxTmSfpbnPmveD7yE0nO1OsSX1fCPuVA0SNL9A4K65+d0Rsl1jr5V9g8buZHH2uwauY xgCrajQhBsDDxGws/Md13ehUurm66kV/USxVP3PgE3K04NsPhCuyuohc3y2O/+unrp84 LX2w== X-Gm-Message-State: ACrzQf3Y1ZGlef8VLTDH6hLj3K+xkNQcHrcu6YsJFDpLzPmwqi32yqF/ /u7A+tA/07dJqMT+3LIm1EGlLxaYfPXdjfRpv6Y= X-Google-Smtp-Source: AMsMyM4AYmx+Vf7zVPAGNAasxPMQ83IjOnq+AdlCIQY9kw2WYK+mzB/b5yECnUiPW9LwKnU+Gnlr2169w4VKov+0wUo= X-Received: by 2002:a17:907:7f8e:b0:780:ec2b:bcc5 with SMTP id qk14-20020a1709077f8e00b00780ec2bbcc5mr6197116ejc.192.1663563228607; Sun, 18 Sep 2022 21:53:48 -0700 (PDT) MIME-Version: 1.0 References: <20220822082120.8347-1-yangyicong@huawei.com> <20220822082120.8347-5-yangyicong@huawei.com> <1e8642d5-0e2d-5747-d0d2-5aa0817ea4af@arm.com> <1125554b-c183-23c4-5516-95b918a761cc@arm.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 19 Sep 2022 16:53:36 +1200 Message-ID: Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation To: Anshuman Khandual Cc: Yicong Yang , akpm@linux-foundation.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, x86@kernel.org, catalin.marinas@arm.com, will@kernel.org, linux-doc@vger.kernel.org, corbet@lwn.net, peterz@infradead.org, arnd@arndb.de, linux-kernel@vger.kernel.org, darren@os.amperecomputing.com, yangyicong@hisilicon.com, huzhanyuan@oppo.com, lipeifeng@oppo.com, zhangshiming@oppo.com, guojian@oppo.com, realmz6@gmail.com, linux-mips@vger.kernel.org, openrisc@lists.librecores.org, linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, wangkefeng.wang@huawei.com, xhao@linux.alibaba.com, prime.zeng@hisilicon.com, Barry Song , Nadav Amit , Mel Gorman X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220918_215355_494464_E8B1D66E X-CRM114-Status: GOOD ( 47.27 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual wrote: > > > > On 9/15/22 12:12, Barry Song wrote: > > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual > > wrote: > >> > >> > >> > >> On 9/9/22 11:05, Barry Song wrote: > >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > >>> wrote: > >>>> > >>>> > >>>> > >>>> On 8/22/22 13:51, Yicong Yang wrote: > >>>>> From: Barry Song > >>>>> > >>>>> on x86, batched and deferred tlb shootdown has lead to 90% > >>>>> performance increase on tlb shootdown. on arm64, HW can do > >>>>> tlb shootdown without software IPI. But sync tlbi is still > >>>>> quite expensive. > >>>>> > >>>>> Even running a simplest program which requires swapout can > >>>>> prove this is true, > >>>>> #include > >>>>> #include > >>>>> #include > >>>>> #include > >>>>> > >>>>> int main() > >>>>> { > >>>>> #define SIZE (1 * 1024 * 1024) > >>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > >>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); > >>>>> > >>>>> memset(p, 0x88, SIZE); > >>>>> > >>>>> for (int k = 0; k < 10000; k++) { > >>>>> /* swap in */ > >>>>> for (int i = 0; i < SIZE; i += 4096) { > >>>>> (void)p[i]; > >>>>> } > >>>>> > >>>>> /* swap out */ > >>>>> madvise(p, SIZE, MADV_PAGEOUT); > >>>>> } > >>>>> } > >>>>> > >>>>> Perf result on snapdragon 888 with 8 cores by using zRAM > >>>>> as the swap block device. > >>>>> > >>>>> ~ # perf record taskset -c 4 ./a.out > >>>>> [ perf record: Woken up 10 times to write data ] > >>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > >>>>> ~ # perf report > >>>>> # To display the perf.data header info, please use --header/--header-only options. > >>>>> # To display the perf.data header info, please use --header/--header-only options. > >>>>> # > >>>>> # > >>>>> # Total Lost Samples: 0 > >>>>> # > >>>>> # Samples: 60K of event 'cycles' > >>>>> # Event count (approx.): 35706225414 > >>>>> # > >>>>> # Overhead Command Shared Object Symbol > >>>>> # ........ ....... ................. ............................................................................. > >>>>> # > >>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > >>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > >>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > >>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > >>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > >>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > >>>>> 3.49% a.out [kernel.kallsyms] [k] memset64 > >>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page > >>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > >>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > >>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load > >>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > >>>>> > >>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > >>>>> swapping in/out a page mapped by only one process. If the > >>>>> page is mapped by multiple processes, typically, like more > >>>>> than 100 on a phone, the overhead would be much higher as > >>>>> we have to run tlb flush 100 times for one single page. > >>>>> Plus, tlb flush overhead will increase with the number > >>>>> of CPU cores due to the bad scalability of tlb shootdown > >>>>> in HW, so those ARM64 servers should expect much higher > >>>>> overhead. > >>>>> > >>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush > >>>>> is actually used by the final dsb() to wait for the completion > >>>>> of tlb flush. This provides us a very good chance to leverage > >>>>> the existing batched tlb in kernel. The minimum modification > >>>>> is that we only send async tlbi in the first stage and we send > >>>>> dsb while we have to sync in the second stage. > >>>>> > >>>>> With the above simplest micro benchmark, collapsed time to > >>>>> finish the program decreases around 5%. > >>>>> > >>>>> Typical collapsed time w/o patch: > >>>>> ~ # time taskset -c 4 ./a.out > >>>>> 0.21user 14.34system 0:14.69elapsed > >>>>> w/ patch: > >>>>> ~ # time taskset -c 4 ./a.out > >>>>> 0.22user 13.45system 0:13.80elapsed > >>>>> > >>>>> Also, Yicong Yang added the following observation. > >>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server, > >>>>> observed an improvement around 12.5% with command > >>>>> `time ./swap_bench`. > >>>>> w/o w/ > >>>>> real 0m13.460s 0m11.771s > >>>>> user 0m0.248s 0m0.279s > >>>>> sys 0m12.039s 0m11.458s > >>>>> > >>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() > >>>>> which has been eliminated by this patch: > >>>>> > >>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report > >>>>> [...] > >>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > >>>>> > >>>>> Cc: Jonathan Corbet > >>>>> Cc: Nadav Amit > >>>>> Cc: Mel Gorman > >>>>> Tested-by: Yicong Yang > >>>>> Tested-by: Xin Hao > >>>>> Signed-off-by: Barry Song > >>>>> Signed-off-by: Yicong Yang > >>>>> --- > >>>>> .../features/vm/TLB/arch-support.txt | 2 +- > >>>>> arch/arm64/Kconfig | 1 + > >>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > >>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > >>>>> 4 files changed, 40 insertions(+), 3 deletions(-) > >>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h > >>>>> > >>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > >>>>> index 1c009312b9c1..2caf815d7c6c 100644 > >>>>> --- a/Documentation/features/vm/TLB/arch-support.txt > >>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt > >>>>> @@ -9,7 +9,7 @@ > >>>>> | alpha: | TODO | > >>>>> | arc: | TODO | > >>>>> | arm: | TODO | > >>>>> - | arm64: | TODO | > >>>>> + | arm64: | ok | > >>>>> | csky: | TODO | > >>>>> | hexagon: | TODO | > >>>>> | ia64: | TODO | > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > >>>>> index 571cc234d0b3..09d45cd6d665 100644 > >>>>> --- a/arch/arm64/Kconfig > >>>>> +++ b/arch/arm64/Kconfig > >>>>> @@ -93,6 +93,7 @@ config ARM64 > >>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > >>>>> select ARCH_SUPPORTS_NUMA_BALANCING > >>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK > >>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > >>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > >>>>> select ARCH_WANT_DEFAULT_BPF_JIT > >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > >>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > >>>>> new file mode 100644 > >>>>> index 000000000000..fedb0b87b8db > >>>>> --- /dev/null > >>>>> +++ b/arch/arm64/include/asm/tlbbatch.h > >>>>> @@ -0,0 +1,12 @@ > >>>>> +/* SPDX-License-Identifier: GPL-2.0 */ > >>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H > >>>>> +#define _ARCH_ARM64_TLBBATCH_H > >>>>> + > >>>>> +struct arch_tlbflush_unmap_batch { > >>>>> + /* > >>>>> + * For arm64, HW can do tlb shootdown, so we don't > >>>>> + * need to record cpumask for sending IPI > >>>>> + */ > >>>>> +}; > >>>>> + > >>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */ > >>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > >>>>> index 412a3b9a3c25..23cbc987321a 100644 > >>>>> --- a/arch/arm64/include/asm/tlbflush.h > >>>>> +++ b/arch/arm64/include/asm/tlbflush.h > >>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > >>>>> dsb(ish); > >>>>> } > >>>>> > >>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > >>>>> + > >>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > >>>>> unsigned long uaddr) > >>>>> { > >>>>> unsigned long addr; > >>>>> > >>>>> dsb(ishst); > >>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > >>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm)); > >>>>> __tlbi(vale1is, addr); > >>>>> __tlbi_user(vale1is, addr); > >>>>> } > >>>>> > >>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > >>>>> + unsigned long uaddr) > >>>>> +{ > >>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > >>>>> +} > >>>>> + > >>>>> static inline void flush_tlb_page(struct vm_area_struct *vma, > >>>>> unsigned long uaddr) > >>>>> { > >>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > >>>>> dsb(ish); > >>>>> } > >>>>> > >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >>>>> +{ > >>>>> + return true; > >>>>> +} > >>>> > >>>> Always defer and batch up TLB flush, unconditionally ? > >>> > >>> My understanding is we actually don't need tlbbatch for a machine with one > >>> or two cores as the tlb flush is not expensive. even for a system with four > >>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%. > >>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with > >>> 100 crores, the cost is incredibly huge. > >> > >> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still > >> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that > >> expensive as well, while queuing up individual TLB flushes ? > > > > This one is much much cheaper as it is not waiting for the > > completion of tlbi. waiting for the completion of tlbi is a big > > deal in arm64, thus, similar optimization can be seen here > > > > 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation > > when page aging"). > > > > > >> > >> The very idea behind TLB deferral is the opportunity it (might) provide > >> to accumulate address ranges and cpu masks so that individual TLB flush > >> can be replaced with a more cost effective range based TLB flush. Hence > >> I guess unless address range or cpumask based cost effective TLB flush > >> is available, deferral does not improve the unmap performance as much. > > > > > > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack > > from all cpus in the system, tlbi is not scalable. The point here is that we > > avoid waiting for each individual TLBi. Alternatively, they are batched. If > > you read the benchmark in the commit log, you can find the great decline > > in the cost to swap out a page. > > Alright, although collecting and deferring 'dsb(ish)' to the very end, does > not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I > guess it can be used to improve unmap performance on arm64. > > But is this 'dsb(ish)' deferral architecturally valid ? yes as dsb(ish) ensures the completion of tlbi. https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers We are even depending on the dsb(ish) during context switch in commit: 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation when page aging"). Before the context switch, lots of tlbi could have been sent. > > Let's examine single page unmap path via try_to_unmap_one(). > > should_defer_flush() { > ptep_get_and_clear() > set_tlb_ubc_flush_pending() > arch_tlbbatch_add_mm() > __flush_tlb_page_nosync() > } else { > ptep_clear_flush() > ptep_get_and_clear() > flush_tlb_page() > flush_tlb_page_nosync() > __flush_tlb_page_nosync() > dsb(ish) > } > > __flush_tlb_page_nosync() > { > dsb(ishst); > addr = __TLBI_VADDR(uaddr, ASID(mm)); > __tlbi(vale1is, addr); > __tlbi_user(vale1is, addr); > } > > Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi() > and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function. > > #define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \ > "tlbi " #op "\n" \ > ALTERNATIVE("nop\n nop", \ > "dsb ish\n tlbi " #op, \ > ARM64_WORKAROUND_REPEAT_TLBI, \ > CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > : : ) > > #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \ > "tlbi " #op ", %0\n" \ > ALTERNATIVE("nop\n nop", \ > "dsb ish\n tlbi " #op ", %0", \ > ARM64_WORKAROUND_REPEAT_TLBI, \ > CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > : : "r" (arg)) > > #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg) > > #define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0) > > #define __tlbi_user(op, arg) do { \ > if (arm64_kernel_unmapped_at_el0()) \ > __tlbi(op, (arg) | USER_ASID_FLAG); \ > } while (0) > > There is already a 'dsb(ish)' in between two subsequent TLB operations in > case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess > deferral should not enabled on such systems ? > > But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush() > via try_to_unmap_flush[_dirty](). There might be random number of instructions > in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'. > Just wondering, if such 'detached in time with other instructions in between' > 'tlbi' and 'dsb(ish)', is architecturally valid ? yes. I think so, arm64 even depends on the implicit dsb in context switch. > > There is a comment in 'struct tlbflush_unmap_batch'. > > /* > * The arch code makes the following promise: generic code can modify a > * PTE, then call arch_tlbbatch_add_mm() (which internally provides all > * needed barriers), then call arch_tlbbatch_flush(), and the entries > * will be flushed on all CPUs by the time that arch_tlbbatch_flush() > * returns. > */ > > > It expects arch_tlbbatch_add_mm() to provide all barriers, hence wondering if > that would include just the first 'dsb(isht)' not the subsequent 'dsb(ish)' ? yes. include/asm/tlbflush.h, we see the below comments: * TLB Invalidation * ================ * * This header file implements the low-level TLB invalidation routines * (sometimes referred to as "flushing" in the kernel) for arm64. * * Every invalidation operation uses the following template: * * DSB ISHST // Ensure prior page-table updates have completed * TLBI ... // Invalidate the TLB * DSB ISH // Ensure the TLB invalidation has completed * if (invalidated kernel mappings) * ISB // Discard any instructions fetched from the old mapping * Clearly dsb(ishst) has ensured page-table updates are visible to all CPUs. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel