IOMMU Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH V1] riscv-privileged: Add broadcast mode to sfence.vma
@ 2019-09-19 12:35 guoren
  2019-09-19 16:04 ` [tech-privileged] " Andrew Waterman
  0 siblings, 1 reply; 4+ messages in thread
From: guoren @ 2019-09-19 12:35 UTC (permalink / raw)
  To: benh
  Cc: julien.thierry, catalin.marinas, palmer, will.deacon,
	Atish.Patra, julien.grall, gary, linux-riscv, kvmarm,
	jean-philippe, linux-csky, rppt, Guo Ren, tech-privileged,
	marc.zyngier, linux-arm-kernel, feiteng_li, Anup.Patel,
	linux-kernel, iommu, dwmw2

From: Guo Ren <ren_guo@c-sky.com>

The patch is for https://github.com/riscv/riscv-isa-manual

The proposal has been talked in LPC-2019 RISC-V MC ref [1]. Here is the
formal patch.

Introduction
============

Using the Hardware TLB broadcast invalidation instruction to maintain the
system TLB is a good choice and it'll simplify the system software design.
The proposal hopes to add a broadcast mode to the sfence.vma in the
riscv-privilege specification. To support the sfence.vma broadcast mode,
there are two modification introduced below:

 1) Add PGD.PPN (root page table's PPN) as the unique identifier of the
    address space in addition to asid/vmid. Compared to the dynamically
    changed asid/vmid, PGD.PPN is fixed throughout the address space life
    cycle. This feature enables uniform address space identification
    between different TLB systems (actually, it's difficult to unify the
    asid/vmid between the CPU system and the IOMMU system, because their
    mechanisms are different)

 2) Modify the definition of the sfence.vma instruction from synchronous
    mode to asynchronous mode, which means that the completion of the TLB
    operation is not guaranteed when the sfence.vma instruction retires.
    It needs to be completed by checking the flag bit on the hart. The
    sfence.vma request finish can notify the software by generating an
    interrupt. This function alleviates the large delay of TLB invalidation
    in the PCI ATS system.

Add S1/S2.PGD.PPN for ASID/VMID
===============================

PGD is global directory (defined in linux) and PPN is page physical number
(defined in riscv-spec). PGD.PNN corresponds to the root page table pointer
of the address space, i.e. mm->pgd (linux concept).

In CPU/IOMMU TLB, we use asid/vmid to distinguish the address space of
process or virtual machine. Due to the limitation of id encoding, it can
only represent a part(window) of the address space. S1/S2.PGD.PPN are the
root page table's PPNs of the address spaces and S1/S2.PGD.PPN are the
unique identifier of the address spaces.

For the CPU SMP system, you can use context switch to perform the necessary
software mechanism to ensure that the asid/vmid on all harts is consistent
(please refer to the arm64 asid mechanism). In this way, the TLB broadcast
invalidation instruction can determine the address space processed on all
harts by asid/vmid.

Different from the CPU SMP system, there is no context switch for the
DMA-IOMMU system, so the unification with the CPU asid/vmid cannot be
guaranteed. So we need a unique identifier for the address space to
establish a communication bridge between the TLBs of different systems.

That is PGD.PPN (for virtualization scenarios: S1/S2.PGD.PPN)

current:
 sfence.vma  rs1 = vaddr, rs2 = asid
 hfence.vvma rs1 = vaddr, rs2 = asid
 hfence.gvma rs1 = gaddr, rs2 = vmid

proposed:
 sfence.vma  rs1 = vaddr, rs2 = mode:ppn:asid
 hfence.vvma rs1 = vaddr, rs2 = mode:ppn:asid
 hfence.gvma rs1 = gaddr, rs2 = mode:ppn:vmid

 mode      - broadcast | local
 ppn       - the PPN of the address space of the root page table
 vmid/asid - the window identifier of the address space

At the Linux Plumber Conference 2019 RISCV-MC, ref:[1], we've showed two
IOMMU examples to explain how it work with hardware.

1) In a lightweight IOMMU system (up to 64 address spaces), the hardware
   could directly convert PGD.PPN into DID (IOMMU ASID)

2) For the PCI ATS scenario, its IO ASID/VMID encoding space can support
   a very large number of address spaces. We use two reverse mapping
   tables to let the hardware translate S1/S2.PGD.PPN into IO ASID/VMID.

ASYNC BROADCAST SFENCE.VMA
===========================

To support the high latency broadcast sfence.vma operation in the PCI ATS
usage scenario, we modify the sfence.vma from synchronous mode to
asynchronous mode. (For simpler implementation, if hardware only implement
synchronous mode and software still work in asynchronous mode)

To implement the asynchronous mode, 3 features are added:
 1) sstatus:TLBI
    A "status bit - TLBI" is added to the sstatus register. The TLBI status
    bit indicates if there are still outstanding sfence.vma requests on the
    current hart.
    Value:
      1: sfence.vma requests are not completed.
      0: all sfece.vma requests completed, request queue is empty.

 2) sstatus:TLBIC
    A "control bits - TLBIC" is added to sstatus register. The TLBIC control
    bits are controlled by software.
    "Write 1" will trigger the current hart check to see if there are still
    outstanding sfence.vma requests. If there are unfinished requests, an
    interrupt will be generated when the request is completed, notifying the
    software that all of the current sfence.vma requests have been completed.
    "Write 0" will cause nothing.

 3) supervisor interrupt register (sip & sie):TLBI finish interrupt
    A per-hart interrupt is added to supervisor interrupt registers.
    When all sfence.vma requests are completed and sstatus:TLBIC has been
    triggered, hart will receive a TLBI finish interrupt. Just like timer,
    software and external interrupt's definition in sip & sie.

Fake code:

flush_tlb_page(vma, addr) {
    asid = cpu_asid(vma->vm_mm);
    ppn = PFN_DOWN(vma->vm_mm->pgd);

    sfence.vma (addr, 1|PPN_OFFSET(ppn)|asid); //1. start request

    while(sstatus:TLBI) if (time_out() > 1ms) break; //2. loop check

    while (sstatus:TLBI) {
        ...
        set sstatus:TLBIC;
        wait_TLBI_finish_interrupt(); //3. wait irq, io_schedule
    }
}

Here we give 2 level check:
 1) loop check sstatus:TLBI, CPU could response Interrupt.
 2) set sstatus:TLBIC and wait for irq, CPU schedule out for other task.

ACE-DVM Example
===============

Honestly, "broadcasting addr, asid, vmid, S1/S2.PGD.PPN to interconnects"
and "ASYNC SFENCE.VMA" could be implemented by ACE-DVM protocol ref [2].

There are 3 types of transactions in DVM:

 - DVM operation
   Send all information to the interconnect, including addr, asid,
   S1.PGD.PPN, vmid, S2.PGD.PPN.

 - DVM synchronization
   Check that all DVM operations have been completed. If not, it will use
   state machine to wait DVM complete requests.

 - DVM complete
   Return transaction from components, eg: IOMMU. If hart has received all
   DVM completes which are triggered by sfence.vma instructions and
   "sstatus:TLBIC" has been set, a TLBI finish interrupt is triggered.

(Actually, we do not need to implement the above functions strictly
 according to the ACE specification :P )

 1: https://www.linuxplumbersconf.org/event/4/contributions/307/
 2: AMBA AXI and ACE Protocol Specification - Distributed Virtual Memory
    Transactions"

Signed-off-by: Guo Ren <ren_guo@c-sky.com>
Reviewed-by: Li Feiteng <feiteng_li@c-sky.com>
---
 src/hypervisor.tex |  43 ++++++++-------
 src/supervisor.tex | 155 +++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 143 insertions(+), 55 deletions(-)

diff --git a/src/hypervisor.tex b/src/hypervisor.tex
index 47b90b2..3718819 100644
--- a/src/hypervisor.tex
+++ b/src/hypervisor.tex
@@ -1094,15 +1094,15 @@ The hypervisor extension adds two new privileged fence instructions.
 \multicolumn{1}{c|}{opcode} \\
 \hline
 7 & 5 & 5 & 3 & 5 & 7 \\
-HFENCE.GVMA & vmid & gaddr & PRIV & 0 & SYSTEM \\
-HFENCE.VVMA & asid & vaddr & PRIV & 0 & SYSTEM \\
+HFENCE.GVMA & mode:ppn:vmid & gaddr & PRIV & 0 & SYSTEM \\
+HFENCE.VVMA & mode:ppn:asid & vaddr & PRIV & 0 & SYSTEM \\
 \end{tabular}
 \end{center}
 
 The hypervisor memory-management fence instructions, HFENCE.GVMA and
 HFENCE.VVMA, are valid only in HS-mode when {\tt mstatus}.TVM=0, or in M-mode
 (irrespective of {\tt mstatus}.TVM).
-These instructions perform a function similar to SFENCE.VMA
+These instructions perform a function similar to SFENCE.VMA (broadcast/local)
 (Section~\ref{sec:sfence.vma}), except applying to the guest-physical
 memory-management data structures controlled by CSR {\tt hgatp} (HFENCE.GVMA)
 or the VS-level memory-management data structures controlled by CSR {\tt vsatp}
@@ -1136,11 +1136,10 @@ An HFENCE.VVMA instruction applies only to a single virtual machine, identified
 by the setting of {\tt hgatp}.VMID when HFENCE.VVMA executes.
 \end{commentary}
 
-When {\em rs2}$\neq${\tt x0}, bits XLEN-1:ASIDMAX of the value held in {\em
-rs2} are reserved for future use and should be zeroed by software and ignored
-by current implementations.
-Furthermore, if ASIDLEN~$<$~ASIDMAX, the implementation shall ignore bits
-ASIDMAX-1:ASIDLEN of the value held in {\em rs2}.
+When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, ppn, asid.
+1) mode control HFENCE.VVMA broadcast or not.
+2) ppn is the root page talbe's PPN of the asid address space.
+3) asid is the identifier of process in virtual machine.
 
 \begin{commentary}
 Simpler implementations of HFENCE.VVMA can ignore the guest virtual address in
@@ -1168,11 +1167,10 @@ physical addresses in PMP address registers (Section~\ref{sec:pmp}) and in page
 table entries (Sections \ref{sec:sv32}, \ref{sec:sv39}, and~\ref{sec:sv48}).
 \end{commentary}
 
-When {\em rs2}$\neq${\tt x0}, bits XLEN-1:VMIDMAX of the value held in {\em
-rs2} are reserved for future use and should be zeroed by software and ignored
-by current implementations.
-Furthermore, if VMIDLEN~$<$~VMIDMAX, the implementation shall ignore bits
-VMIDMAX-1:VMIDLEN of the value held in {\em rs2}.
+When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, vmid, ppn.
+1) mode control HFENCE.GVMA broadcast or not.
+2) ppn is the root page talbe's PPN of the vmid address space.
+3) vmid is the identifier of virtual machine.
 
 \begin{commentary}
 Simpler implementations of HFENCE.GVMA can ignore the guest physical address in
@@ -1567,21 +1565,22 @@ register.
 \subsection{Memory-Management Fences}
 
 The behavior of the SFENCE.VMA instruction is affected by the current
-virtualization mode V.  When V=0, the virtual-address argument is an HS-level
-virtual address, and the ASID argument is an HS-level ASID.
+virtualization mode V.  When V=0, the rs1 argument is an HS-level
+virtual address, and the rs2 argument is an HS-level ASID and root page table's PPN.
 The instruction orders stores only to HS-level address-translation structures
 with subsequent HS-level address translations.
 
-When V=1, the virtual-address argument to SFENCE.VMA is a guest virtual
-address within the current virtual machine, and the ASID argument is a VS-level
-ASID within the current virtual machine.
+When V=1, the rs1 argument to SFENCE.VMA is a guest virtual
+address within the current virtual machine, and the rs2 argument is a VS-level
+ASID and root page table's PPN within the current virtual machine.
 The current virtual machine is identified by the VMID field of CSR {\tt hgatp},
-and the effective ASID can be considered to be the combination of this VMID
-with the VS-level ASID.
+and the effective ASID and root page table's PPN can be considered to be the
+combination of this VMID and root page table's PPN with the VS-level ASID and
+root page table's PPN.
 The SFENCE.VMA instruction orders stores only to the VS-level
 address-translation structures with subsequent VS-level address translations
-for the same virtual machine, i.e., only when {\tt hgatp}.VMID is the same as
-when the SFENCE.VMA executed.
+for the same virtual machine, i.e., only when {\tt hgatp}.VMID and {\\tt hgatp}.PPN is
+the same as when the SFENCE.VMA executed.
 
 Hypervisor instructions HFENCE.GVMA and HFENCE.VVMA provide additional
 memory-management fences to complement SFENCE.VMA.
diff --git a/src/supervisor.tex b/src/supervisor.tex
index ba3ced5..2877b7a 100644
--- a/src/supervisor.tex
+++ b/src/supervisor.tex
@@ -47,10 +47,12 @@ register keeps track of the processor's current operating state.
 \begin{center}
 \setlength{\tabcolsep}{4pt}
 \scalebox{0.95}{
-\begin{tabular}{cWcccccWccccWcc}
+\begin{tabular}{cccWcccccWccccWcc}
 \\
 \instbit{31} &
-\instbitrange{30}{20} &
+\instbit{30} &
+\instbit{29} &
+\instbitrange{28}{20} &
 \instbit{19} &
 \instbit{18} &
 \instbit{17} &
@@ -66,6 +68,8 @@ register keeps track of the processor's current operating state.
 \instbit{0} \\
 \hline
 \multicolumn{1}{|c|}{SD} &
+\multicolumn{1}{|c|}{TLBI} &
+\multicolumn{1}{|c|}{TLBIC} &
 \multicolumn{1}{c|}{\wpri} &
 \multicolumn{1}{c|}{MXR} &
 \multicolumn{1}{c|}{SUM} &
@@ -82,7 +86,7 @@ register keeps track of the processor's current operating state.
 \multicolumn{1}{c|}{\wpri}
 \\
 \hline
-1 & 11 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1 \\
+1 & 1 & 1 & 10 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1 \\
 \end{tabular}}
 \end{center}
 }
@@ -95,10 +99,12 @@ register keeps track of the processor's current operating state.
 {\footnotesize
 \begin{center}
 \setlength{\tabcolsep}{4pt}
-\begin{tabular}{cMFScccc}
+\begin{tabular}{cccMFScccc}
 \\
 \instbit{SXLEN-1} &
-\instbitrange{SXLEN-2}{34} &
+\instbit{SXLEN-2} &
+\instbit{SXLEN-3} &
+\instbitrange{SXLEN-4}{34} &
 \instbitrange{33}{32} &
 \instbitrange{31}{20} &
 \instbit{19} &
@@ -107,6 +113,8 @@ register keeps track of the processor's current operating state.
  \\
 \hline
 \multicolumn{1}{|c|}{SD} &
+\multicolumn{1}{|c|}{TLBI} &
+\multicolumn{1}{|c|}{TLBIC} &
 \multicolumn{1}{c|}{\wpri} &
 \multicolumn{1}{c|}{UXL[1:0]} &
 \multicolumn{1}{c|}{\wpri} &
@@ -115,7 +123,7 @@ register keeps track of the processor's current operating state.
 \multicolumn{1}{c|}{\wpri} &
  \\
 \hline
-1 & SXLEN-35 & 2 & 12 & 1 & 1 & 1 & \\
+1 & 1 & 1 & SXLEN-37 & 2 & 12 & 1 & 1 & 1 & \\
 \end{tabular}
 \begin{tabular}{cWWFccccWcc}
 \\
@@ -152,6 +160,17 @@ register keeps track of the processor's current operating state.
 \label{sstatusreg}
 \end{figure*}
 
+The TLBI (read-only) bit indicates that any async sfence.vma operations are
+still pended on the hart. The value:0 means that there is no sfence.vma
+operations pending and value:1 means that there are still sfence.vma operations
+pending on the hart.
+
+When the sstatus:TLBIC bit is written 1, it triggers the hardware to check if
+there are any TLB invalidate operations being pended. When all operations are
+finished, a TLB Invalidate finish interrupt will be triggered
+(see Section~\ref{sipreg}). When the sstatus:TLBIC bit is written 0, it will
+cause nothing. Reading sstatus:TLBIC bit will alaways return 0.
+
 The SPP bit indicates the privilege level at which a hart was executing before
 entering supervisor mode.  When a trap is taken, SPP is set to 0 if the trap
 originated from user mode, or 1 otherwise.  When an SRET instruction
@@ -329,8 +348,10 @@ SXLEN-bit read/write register containing interrupt enable bits.
 {\footnotesize
 \begin{center}
 \setlength{\tabcolsep}{4pt}
-\begin{tabular}{KcFcFcc}
-\instbitrange{SXLEN-1}{10} &
+\begin{tabular}{KcFcFcFcc}
+\instbitrange{SXLEN-1}{14} &
+\instbit{13} &
+\instbitrange{12}{10} &
 \instbit{9} &
 \instbitrange{8}{6} &
 \instbit{5} &
@@ -339,6 +360,8 @@ SXLEN-bit read/write register containing interrupt enable bits.
 \instbit{0} \\
 \hline
 \multicolumn{1}{|c|}{\wpri} &
+\multicolumn{1}{c|}{STLBIP} &
+\multicolumn{1}{|c|}{\wpri} &
 \multicolumn{1}{c|}{SEIP} &
 \multicolumn{1}{c|}{\wpri} &
 \multicolumn{1}{c|}{STIP} &
@@ -346,7 +369,7 @@ SXLEN-bit read/write register containing interrupt enable bits.
 \multicolumn{1}{c|}{SSIP} &
 \multicolumn{1}{c|}{\wpri} \\
 \hline
-SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
+SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
 \end{tabular}
 \end{center}
 }
@@ -359,8 +382,10 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
 {\footnotesize
 \begin{center}
 \setlength{\tabcolsep}{4pt}
-\begin{tabular}{KcFcFcc}
-\instbitrange{SXLEN-1}{10} &
+\begin{tabular}{KcFcFcFcc}
+\instbitrange{SXLEN-1}{14} &
+\instbit{13} &
+\instbitrange{12}{10} &
 \instbit{9} &
 \instbitrange{8}{6} &
 \instbit{5} &
@@ -369,6 +394,8 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
 \instbit{0} \\
 \hline
 \multicolumn{1}{|c|}{\wpri} &
+\multicolumn{1}{c|}{STLBIE} &
+\multicolumn{1}{|c|}{\wpri} &
 \multicolumn{1}{c|}{SEIE} &
 \multicolumn{1}{c|}{\wpri} &
 \multicolumn{1}{c|}{STIE} &
@@ -376,7 +403,7 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
 \multicolumn{1}{c|}{SSIE} &
 \multicolumn{1}{c|}{\wpri} \\
 \hline
-SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
+SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
 \end{tabular}
 \end{center}
 }
@@ -410,6 +437,12 @@ when the SEIE bit in the {\tt sie} register is clear.  The implementation
 should provide facilities to mask, unmask, and query the cause of external
 interrupts.
 
+A supervisor-level TLB Invalidate finish interrupt is pending if the STLBIP bit
+in the {\tt sip} register is set.  Supervisor-level TLB Invalidate finish
+interrupts are disabled when the STLBIE bit in the {\tt sie} register is clear.
+When hart tlb invalidate operations are finished, hardware will change sstatus:TLBI
+bit from 1 to 0 and trigger TLB Invalidate finish interrupt.
+
 \begin{commentary}
 The {\tt sip} and {\tt sie} registers are subsets of the {\tt mip} and {\tt
 mie} registers.  Reading any field, or writing any writable field, of {\tt
@@ -598,7 +631,9 @@ so is only guaranteed to hold supported exception codes.
   1         & 5               & Supervisor timer interrupt \\
   1         & 6--8            & {\em Reserved} \\
   1         & 9               & Supervisor external interrupt \\
-  1         & 10--15          & {\em Reserved} \\
+  1         & 10--11          & {\em Reserved} \\
+  1         & 12              & Supervisor TLBI finish interrupt \\
+  1         & 13--15          & {\em Reserved} \\
   1         & $\ge$16         & {\em Available for platform use} \\ \hline
   0         & 0               & Instruction address misaligned \\
   0         & 1               & Instruction access fault \\
@@ -884,7 +919,7 @@ provided.
 \multicolumn{1}{c|}{opcode} \\
 \hline
 7 & 5 & 5 & 3 & 5 & 7 \\
-SFENCE.VMA & asid & vaddr & PRIV & 0 & SYSTEM \\
+SFENCE.VMA & mode:ppn:asid & vaddr & LOCAL & 0 & SYSTEM \\
 \end{tabular}
 \end{center}
 
@@ -899,21 +934,70 @@ from that hart to the memory-management data structures.
 Further details on the behavior of this instruction are
 described in Section~\ref{virt-control} and Section~\ref{pmp-vmem}.
 
+SFENCE.VMA is defined as an asynchronous completion instruction, which means
+that the TLB operation is not guaranteed to complete when the instruction retires.
+Software need check sstatus:TLBI to determine all TLB operations complete.
+The sstatus:TLBI described in Section~\ref{sstatus}. When hardware change
+sstatus:TLBI bit from 1 to 0, the TLB Invalidate finish interrupt will be
+triggered.
+
 \begin{commentary}
-The SFENCE.VMA is used to flush any local hardware caches related to
+The SFENCE.VMA is used to flush any local/remote hardware caches related to
 address translation.  It is specified as a fence rather than a TLB
 flush to provide cleaner semantics with respect to which instructions
 are affected by the flush operation and to support a wider variety of
 dynamic caching structures and memory-management schemes.  SFENCE.VMA
 is also used by higher privilege levels to synchronize page table
-writes and the address translation hardware.
+writes and the address translation hardware. There is a mode bit to determine
+sfence.vma would broadcast on interconnect or not.
 \end{commentary}
 
-SFENCE.VMA orders only the local hart's implicit references to the
-memory-management data structures.
+\begin{figure}[h!]
+{\footnotesize
+\begin{center}
+\begin{tabular}{c@{}E@{}K}
+\instbit{31} &
+\instbitrange{30}{9} &
+\instbitrange{8}{0} \\
+\hline
+\multicolumn{1}{|c|}{{\tt MODE}} &
+\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
+\multicolumn{1}{|c|}{{\tt ASID}} \\
+\hline
+1 & 22 & 9 \\
+\end{tabular}
+\end{center}
+}
+\vspace{-0.1in}
+\caption{RV32 sfence.vma rs2 format.}
+\label{rv32satp}
+\end{figure}
+
+\begin{figure}[h!]
+{\footnotesize
+\begin{center}
+\begin{tabular}{@{}S@{}T@{}U}
+\instbitrange{63}{60} &
+\instbitrange{59}{16} &
+\instbitrange{15}{0} \\
+\hline
+\multicolumn{1}{|c|}{{\tt MODE}} &
+\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
+\multicolumn{1}{|c|}{{\tt ASID}} \\
+\hline
+4 & 44 & 16 \\
+\end{tabular}
+\end{center}
+}
+\vspace{-0.1in}
+\caption{RV64 sfence.vma rs2 format, for MODE values, only highest bit:63 is
+valid and others are reserved.}
+\label{rv64satp}
+\end{figure}
 
 \begin{commentary}
-Consequently, other harts must be notified separately when the
+The mode's highest bit could control sfence.vma behavior with 1:broadcast or 0:local.
+If only have mode:local, other harts must be notified separately when the
 memory-management data structures have been modified.
 One approach is to use 1)
 a local data fence to ensure local writes are visible globally, then
@@ -928,8 +1012,17 @@ modified for a single address mapping (i.e., one page or superpage), {\em rs1}
 can specify a virtual address within that mapping to effect a translation
 fence for that mapping only.  Furthermore, for the common case that the
 translation data structures have only been modified for a single address-space
-identifier, {\em rs2} can specify the address space.  The behavior of
-SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
+identifier, {\em rs2} can specify the address space with {\tt satp} format
+which include asid and root page table's PPN information.
+
+\begin{commentary}
+We use ASID and root page table's PPN to determine address space and the format
+stored in rs2 is similar with {\tt satp} described in Section~\ref{sec:satp}.
+ASID are used by local harts and root page table's PPN of the asid are used by
+other different TLB systems, eg: IOMMU.
+\end{commentary}
+
+The behavior of SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
 
 \begin{itemize}
 \item If {\em rs1}={\tt x0} and {\em rs2}={\tt x0}, the fence orders all
@@ -939,23 +1032,18 @@ SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
       all reads and writes made to any level of the page tables, but only
       for the address space identified by integer register {\em rs2}.
       Accesses to {\em global} mappings (see Section~\ref{sec:translation})
-      are not ordered.
+      are not ordered. The mode field in rs2 is determine broadcast or local.
 \item If {\em rs1}$\neq${\tt x0} and {\em rs2}={\tt x0}, the fence orders
       only reads and writes made to the leaf page table entry corresponding
       to the virtual address in {\em rs1}, for all address spaces.
 \item If {\em rs1}$\neq${\tt x0} and {\em rs2}$\neq${\tt x0}, the fence
       orders only reads and writes made to the leaf page table entry
       corresponding to the virtual address in {\em rs1}, for the address
-      space identified by integer register {\em rs2}.
+      space identified by integer register {\em rs2}. The mode field in rs2
+      is determine broadcast or local.
       Accesses to global mappings are not ordered.
 \end{itemize}
 
-When {\em rs2}$\neq${\tt x0}, bits SXLEN-1:ASIDMAX of the value held in {\em
-rs2} are reserved for future use and should be zeroed by software and ignored
-by current implementations.  Furthermore, if ASIDLEN~$<$~ASIDMAX, the
-implementation shall ignore bits ASIDMAX-1:ASIDLEN of the value held in {\em
-rs2}.
-
 \begin{commentary}
 Simpler implementations can ignore the virtual address in {\em rs1} and
 the ASID value in {\em rs2} and always perform a global fence.
@@ -994,7 +1082,7 @@ can execute the same SFENCE.VMA instruction while a different ASID is loaded
 into {\tt satp}, provided the next time {\tt satp} is loaded with the recycled
 ASID, it is simultaneously loaded with the new page table.
 
-\item If the implementation does not provide ASIDs, or software chooses to
+\item If the implementation does not provide ASIDs and PPNs, or software chooses to
 always use ASID 0, then after every {\tt satp} write, software should execute
 SFENCE.VMA with {\em rs1}={\tt x0}.  In the common case that no global
 translations have been modified, {\em rs2} should be set to a register other than
@@ -1003,13 +1091,14 @@ not flushed.
 
 \item If software modifies a non-leaf PTE, it should execute SFENCE.VMA with
 {\em rs1}={\tt x0}.  If any PTE along the traversal path had its G bit set,
-{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the ASID for
-which the translation is being modified.
+{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the ASID and
+root page table's PPN for which the translation is being modified.
 
 \item If software modifies a leaf PTE, it should execute SFENCE.VMA with {\em
 rs1} set to a virtual address within the page.  If any PTE along the traversal
 path had its G bit set, {\em rs2} must be {\tt x0}; otherwise, {\em rs2}
-should be set to the ASID for which the translation is being modified.
+should be set to the ASID and root page table's PPN for which the translation
+is being modified.
 
 \item For the special cases of increasing the permissions on a leaf PTE and
 changing an invalid PTE to a valid leaf, software may choose to execute
-- 
2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [tech-privileged] [RFC PATCH V1] riscv-privileged: Add broadcast mode to sfence.vma
  2019-09-19 12:35 [RFC PATCH V1] riscv-privileged: Add broadcast mode to sfence.vma guoren
@ 2019-09-19 16:04 ` " Andrew Waterman
  2019-09-20  0:13   ` Guo Ren
  2019-09-20  2:27   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 4+ messages in thread
From: Andrew Waterman @ 2019-09-19 16:04 UTC (permalink / raw)
  To: Guo Ren
  Cc: julien.thierry, catalin.marinas, palmer, will.deacon,
	Atish.Patra, julien.grall, gary, linux-riscv, kvmarm,
	jean-philippe, linux-csky, rppt, Guo Ren, benh, tech-privileged,
	marc.zyngier, linux-arm-kernel, feiteng_li, Anup.Patel,
	linux-kernel, iommu, dwmw2

[-- Attachment #1.1: Type: text/plain, Size: 27857 bytes --]

This needs to be discussed and debated at length; proposing edits to the
spec at this stage is putting the cart before the horse!

We shouldn’t change the definition of the existing SFENCE.VMA instruction
to accomplish this. It’s also not abundantly clear to me that this should
be an instruction: TLB shootdown looks more like MMIO.

On Thu, Sep 19, 2019 at 5:36 AM Guo Ren <guoren@kernel.org> wrote:

> From: Guo Ren <ren_guo@c-sky.com>
>
> The patch is for https://github.com/riscv/riscv-isa-manual
>
> The proposal has been talked in LPC-2019 RISC-V MC ref [1]. Here is the
> formal patch.
>
> Introduction
> ============
>
> Using the Hardware TLB broadcast invalidation instruction to maintain the
> system TLB is a good choice and it'll simplify the system software design.
> The proposal hopes to add a broadcast mode to the sfence.vma in the
> riscv-privilege specification. To support the sfence.vma broadcast mode,
> there are two modification introduced below:
>
>  1) Add PGD.PPN (root page table's PPN) as the unique identifier of the
>     address space in addition to asid/vmid. Compared to the dynamically
>     changed asid/vmid, PGD.PPN is fixed throughout the address space life
>     cycle. This feature enables uniform address space identification
>     between different TLB systems (actually, it's difficult to unify the
>     asid/vmid between the CPU system and the IOMMU system, because their
>     mechanisms are different)
>
>  2) Modify the definition of the sfence.vma instruction from synchronous
>     mode to asynchronous mode, which means that the completion of the TLB
>     operation is not guaranteed when the sfence.vma instruction retires.
>     It needs to be completed by checking the flag bit on the hart. The
>     sfence.vma request finish can notify the software by generating an
>     interrupt. This function alleviates the large delay of TLB invalidation
>     in the PCI ATS system.
>
> Add S1/S2.PGD.PPN for ASID/VMID
> ===============================
>
> PGD is global directory (defined in linux) and PPN is page physical number
> (defined in riscv-spec). PGD.PNN corresponds to the root page table pointer
> of the address space, i.e. mm->pgd (linux concept).
>
> In CPU/IOMMU TLB, we use asid/vmid to distinguish the address space of
> process or virtual machine. Due to the limitation of id encoding, it can
> only represent a part(window) of the address space. S1/S2.PGD.PPN are the
> root page table's PPNs of the address spaces and S1/S2.PGD.PPN are the
> unique identifier of the address spaces.
>
> For the CPU SMP system, you can use context switch to perform the necessary
> software mechanism to ensure that the asid/vmid on all harts is consistent
> (please refer to the arm64 asid mechanism). In this way, the TLB broadcast
> invalidation instruction can determine the address space processed on all
> harts by asid/vmid.
>
> Different from the CPU SMP system, there is no context switch for the
> DMA-IOMMU system, so the unification with the CPU asid/vmid cannot be
> guaranteed. So we need a unique identifier for the address space to
> establish a communication bridge between the TLBs of different systems.
>
> That is PGD.PPN (for virtualization scenarios: S1/S2.PGD.PPN)
>
> current:
>  sfence.vma  rs1 = vaddr, rs2 = asid
>  hfence.vvma rs1 = vaddr, rs2 = asid
>  hfence.gvma rs1 = gaddr, rs2 = vmid
>
> proposed:
>  sfence.vma  rs1 = vaddr, rs2 = mode:ppn:asid
>  hfence.vvma rs1 = vaddr, rs2 = mode:ppn:asid
>  hfence.gvma rs1 = gaddr, rs2 = mode:ppn:vmid
>
>  mode      - broadcast | local
>  ppn       - the PPN of the address space of the root page table
>  vmid/asid - the window identifier of the address space
>
> At the Linux Plumber Conference 2019 RISCV-MC, ref:[1], we've showed two
> IOMMU examples to explain how it work with hardware.
>
> 1) In a lightweight IOMMU system (up to 64 address spaces), the hardware
>    could directly convert PGD.PPN into DID (IOMMU ASID)
>
> 2) For the PCI ATS scenario, its IO ASID/VMID encoding space can support
>    a very large number of address spaces. We use two reverse mapping
>    tables to let the hardware translate S1/S2.PGD.PPN into IO ASID/VMID.
>
> ASYNC BROADCAST SFENCE.VMA
> ===========================
>
> To support the high latency broadcast sfence.vma operation in the PCI ATS
> usage scenario, we modify the sfence.vma from synchronous mode to
> asynchronous mode. (For simpler implementation, if hardware only implement
> synchronous mode and software still work in asynchronous mode)
>
> To implement the asynchronous mode, 3 features are added:
>  1) sstatus:TLBI
>     A "status bit - TLBI" is added to the sstatus register. The TLBI status
>     bit indicates if there are still outstanding sfence.vma requests on the
>     current hart.
>     Value:
>       1: sfence.vma requests are not completed.
>       0: all sfece.vma requests completed, request queue is empty.
>
>  2) sstatus:TLBIC
>     A "control bits - TLBIC" is added to sstatus register. The TLBIC
> control
>     bits are controlled by software.
>     "Write 1" will trigger the current hart check to see if there are still
>     outstanding sfence.vma requests. If there are unfinished requests, an
>     interrupt will be generated when the request is completed, notifying
> the
>     software that all of the current sfence.vma requests have been
> completed.
>     "Write 0" will cause nothing.
>
>  3) supervisor interrupt register (sip & sie):TLBI finish interrupt
>     A per-hart interrupt is added to supervisor interrupt registers.
>     When all sfence.vma requests are completed and sstatus:TLBIC has been
>     triggered, hart will receive a TLBI finish interrupt. Just like timer,
>     software and external interrupt's definition in sip & sie.
>
> Fake code:
>
> flush_tlb_page(vma, addr) {
>     asid = cpu_asid(vma->vm_mm);
>     ppn = PFN_DOWN(vma->vm_mm->pgd);
>
>     sfence.vma (addr, 1|PPN_OFFSET(ppn)|asid); //1. start request
>
>     while(sstatus:TLBI) if (time_out() > 1ms) break; //2. loop check
>
>     while (sstatus:TLBI) {
>         ...
>         set sstatus:TLBIC;
>         wait_TLBI_finish_interrupt(); //3. wait irq, io_schedule
>     }
> }
>
> Here we give 2 level check:
>  1) loop check sstatus:TLBI, CPU could response Interrupt.
>  2) set sstatus:TLBIC and wait for irq, CPU schedule out for other task.
>
> ACE-DVM Example
> ===============
>
> Honestly, "broadcasting addr, asid, vmid, S1/S2.PGD.PPN to interconnects"
> and "ASYNC SFENCE.VMA" could be implemented by ACE-DVM protocol ref [2].
>
> There are 3 types of transactions in DVM:
>
>  - DVM operation
>    Send all information to the interconnect, including addr, asid,
>    S1.PGD.PPN, vmid, S2.PGD.PPN.
>
>  - DVM synchronization
>    Check that all DVM operations have been completed. If not, it will use
>    state machine to wait DVM complete requests.
>
>  - DVM complete
>    Return transaction from components, eg: IOMMU. If hart has received all
>    DVM completes which are triggered by sfence.vma instructions and
>    "sstatus:TLBIC" has been set, a TLBI finish interrupt is triggered.
>
> (Actually, we do not need to implement the above functions strictly
>  according to the ACE specification :P )
>
>  1: https://www.linuxplumbersconf.org/event/4/contributions/307/
>  2: AMBA AXI and ACE Protocol Specification - Distributed Virtual Memory
>     Transactions"
>
> Signed-off-by: Guo Ren <ren_guo@c-sky.com>
> Reviewed-by: Li Feiteng <feiteng_li@c-sky.com>
> ---
>  src/hypervisor.tex |  43 ++++++++-------
>  src/supervisor.tex | 155
> +++++++++++++++++++++++++++++++++++++++++------------
>  2 files changed, 143 insertions(+), 55 deletions(-)
>
> diff --git a/src/hypervisor.tex b/src/hypervisor.tex
> index 47b90b2..3718819 100644
> --- a/src/hypervisor.tex
> +++ b/src/hypervisor.tex
> @@ -1094,15 +1094,15 @@ The hypervisor extension adds two new privileged
> fence instructions.
>  \multicolumn{1}{c|}{opcode} \\
>  \hline
>  7 & 5 & 5 & 3 & 5 & 7 \\
> -HFENCE.GVMA & vmid & gaddr & PRIV & 0 & SYSTEM \\
> -HFENCE.VVMA & asid & vaddr & PRIV & 0 & SYSTEM \\
> +HFENCE.GVMA & mode:ppn:vmid & gaddr & PRIV & 0 & SYSTEM \\
> +HFENCE.VVMA & mode:ppn:asid & vaddr & PRIV & 0 & SYSTEM \\
>  \end{tabular}
>  \end{center}
>
>  The hypervisor memory-management fence instructions, HFENCE.GVMA and
>  HFENCE.VVMA, are valid only in HS-mode when {\tt mstatus}.TVM=0, or in
> M-mode
>  (irrespective of {\tt mstatus}.TVM).
> -These instructions perform a function similar to SFENCE.VMA
> +These instructions perform a function similar to SFENCE.VMA
> (broadcast/local)
>  (Section~\ref{sec:sfence.vma}), except applying to the guest-physical
>  memory-management data structures controlled by CSR {\tt hgatp}
> (HFENCE.GVMA)
>  or the VS-level memory-management data structures controlled by CSR {\tt
> vsatp}
> @@ -1136,11 +1136,10 @@ An HFENCE.VVMA instruction applies only to a
> single virtual machine, identified
>  by the setting of {\tt hgatp}.VMID when HFENCE.VVMA executes.
>  \end{commentary}
>
> -When {\em rs2}$\neq${\tt x0}, bits XLEN-1:ASIDMAX of the value held in
> {\em
> -rs2} are reserved for future use and should be zeroed by software and
> ignored
> -by current implementations.
> -Furthermore, if ASIDLEN~$<$~ASIDMAX, the implementation shall ignore bits
> -ASIDMAX-1:ASIDLEN of the value held in {\em rs2}.
> +When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, ppn,
> asid.
> +1) mode control HFENCE.VVMA broadcast or not.
> +2) ppn is the root page talbe's PPN of the asid address space.
> +3) asid is the identifier of process in virtual machine.
>
>  \begin{commentary}
>  Simpler implementations of HFENCE.VVMA can ignore the guest virtual
> address in
> @@ -1168,11 +1167,10 @@ physical addresses in PMP address registers
> (Section~\ref{sec:pmp}) and in page
>  table entries (Sections \ref{sec:sv32}, \ref{sec:sv39},
> and~\ref{sec:sv48}).
>  \end{commentary}
>
> -When {\em rs2}$\neq${\tt x0}, bits XLEN-1:VMIDMAX of the value held in
> {\em
> -rs2} are reserved for future use and should be zeroed by software and
> ignored
> -by current implementations.
> -Furthermore, if VMIDLEN~$<$~VMIDMAX, the implementation shall ignore bits
> -VMIDMAX-1:VMIDLEN of the value held in {\em rs2}.
> +When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, vmid,
> ppn.
> +1) mode control HFENCE.GVMA broadcast or not.
> +2) ppn is the root page talbe's PPN of the vmid address space.
> +3) vmid is the identifier of virtual machine.
>
>  \begin{commentary}
>  Simpler implementations of HFENCE.GVMA can ignore the guest physical
> address in
> @@ -1567,21 +1565,22 @@ register.
>  \subsection{Memory-Management Fences}
>
>  The behavior of the SFENCE.VMA instruction is affected by the current
> -virtualization mode V.  When V=0, the virtual-address argument is an
> HS-level
> -virtual address, and the ASID argument is an HS-level ASID.
> +virtualization mode V.  When V=0, the rs1 argument is an HS-level
> +virtual address, and the rs2 argument is an HS-level ASID and root page
> table's PPN.
>  The instruction orders stores only to HS-level address-translation
> structures
>  with subsequent HS-level address translations.
>
> -When V=1, the virtual-address argument to SFENCE.VMA is a guest virtual
> -address within the current virtual machine, and the ASID argument is a
> VS-level
> -ASID within the current virtual machine.
> +When V=1, the rs1 argument to SFENCE.VMA is a guest virtual
> +address within the current virtual machine, and the rs2 argument is a
> VS-level
> +ASID and root page table's PPN within the current virtual machine.
>  The current virtual machine is identified by the VMID field of CSR {\tt
> hgatp},
> -and the effective ASID can be considered to be the combination of this
> VMID
> -with the VS-level ASID.
> +and the effective ASID and root page table's PPN can be considered to be
> the
> +combination of this VMID and root page table's PPN with the VS-level ASID
> and
> +root page table's PPN.
>  The SFENCE.VMA instruction orders stores only to the VS-level
>  address-translation structures with subsequent VS-level address
> translations
> -for the same virtual machine, i.e., only when {\tt hgatp}.VMID is the
> same as
> -when the SFENCE.VMA executed.
> +for the same virtual machine, i.e., only when {\tt hgatp}.VMID and {\\tt
> hgatp}.PPN is
> +the same as when the SFENCE.VMA executed.
>
>  Hypervisor instructions HFENCE.GVMA and HFENCE.VVMA provide additional
>  memory-management fences to complement SFENCE.VMA.
> diff --git a/src/supervisor.tex b/src/supervisor.tex
> index ba3ced5..2877b7a 100644
> --- a/src/supervisor.tex
> +++ b/src/supervisor.tex
> @@ -47,10 +47,12 @@ register keeps track of the processor's current
> operating state.
>  \begin{center}
>  \setlength{\tabcolsep}{4pt}
>  \scalebox{0.95}{
> -\begin{tabular}{cWcccccWccccWcc}
> +\begin{tabular}{cccWcccccWccccWcc}
>  \\
>  \instbit{31} &
> -\instbitrange{30}{20} &
> +\instbit{30} &
> +\instbit{29} &
> +\instbitrange{28}{20} &
>  \instbit{19} &
>  \instbit{18} &
>  \instbit{17} &
> @@ -66,6 +68,8 @@ register keeps track of the processor's current
> operating state.
>  \instbit{0} \\
>  \hline
>  \multicolumn{1}{|c|}{SD} &
> +\multicolumn{1}{|c|}{TLBI} &
> +\multicolumn{1}{|c|}{TLBIC} &
>  \multicolumn{1}{c|}{\wpri} &
>  \multicolumn{1}{c|}{MXR} &
>  \multicolumn{1}{c|}{SUM} &
> @@ -82,7 +86,7 @@ register keeps track of the processor's current
> operating state.
>  \multicolumn{1}{c|}{\wpri}
>  \\
>  \hline
> -1 & 11 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1 \\
> +1 & 1 & 1 & 10 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1 \\
>  \end{tabular}}
>  \end{center}
>  }
> @@ -95,10 +99,12 @@ register keeps track of the processor's current
> operating state.
>  {\footnotesize
>  \begin{center}
>  \setlength{\tabcolsep}{4pt}
> -\begin{tabular}{cMFScccc}
> +\begin{tabular}{cccMFScccc}
>  \\
>  \instbit{SXLEN-1} &
> -\instbitrange{SXLEN-2}{34} &
> +\instbit{SXLEN-2} &
> +\instbit{SXLEN-3} &
> +\instbitrange{SXLEN-4}{34} &
>  \instbitrange{33}{32} &
>  \instbitrange{31}{20} &
>  \instbit{19} &
> @@ -107,6 +113,8 @@ register keeps track of the processor's current
> operating state.
>   \\
>  \hline
>  \multicolumn{1}{|c|}{SD} &
> +\multicolumn{1}{|c|}{TLBI} &
> +\multicolumn{1}{|c|}{TLBIC} &
>  \multicolumn{1}{c|}{\wpri} &
>  \multicolumn{1}{c|}{UXL[1:0]} &
>  \multicolumn{1}{c|}{\wpri} &
> @@ -115,7 +123,7 @@ register keeps track of the processor's current
> operating state.
>  \multicolumn{1}{c|}{\wpri} &
>   \\
>  \hline
> -1 & SXLEN-35 & 2 & 12 & 1 & 1 & 1 & \\
> +1 & 1 & 1 & SXLEN-37 & 2 & 12 & 1 & 1 & 1 & \\
>  \end{tabular}
>  \begin{tabular}{cWWFccccWcc}
>  \\
> @@ -152,6 +160,17 @@ register keeps track of the processor's current
> operating state.
>  \label{sstatusreg}
>  \end{figure*}
>
> +The TLBI (read-only) bit indicates that any async sfence.vma operations
> are
> +still pended on the hart. The value:0 means that there is no sfence.vma
> +operations pending and value:1 means that there are still sfence.vma
> operations
> +pending on the hart.
> +
> +When the sstatus:TLBIC bit is written 1, it triggers the hardware to
> check if
> +there are any TLB invalidate operations being pended. When all operations
> are
> +finished, a TLB Invalidate finish interrupt will be triggered
> +(see Section~\ref{sipreg}). When the sstatus:TLBIC bit is written 0, it
> will
> +cause nothing. Reading sstatus:TLBIC bit will alaways return 0.
> +
>  The SPP bit indicates the privilege level at which a hart was executing
> before
>  entering supervisor mode.  When a trap is taken, SPP is set to 0 if the
> trap
>  originated from user mode, or 1 otherwise.  When an SRET instruction
> @@ -329,8 +348,10 @@ SXLEN-bit read/write register containing interrupt
> enable bits.
>  {\footnotesize
>  \begin{center}
>  \setlength{\tabcolsep}{4pt}
> -\begin{tabular}{KcFcFcc}
> -\instbitrange{SXLEN-1}{10} &
> +\begin{tabular}{KcFcFcFcc}
> +\instbitrange{SXLEN-1}{14} &
> +\instbit{13} &
> +\instbitrange{12}{10} &
>  \instbit{9} &
>  \instbitrange{8}{6} &
>  \instbit{5} &
> @@ -339,6 +360,8 @@ SXLEN-bit read/write register containing interrupt
> enable bits.
>  \instbit{0} \\
>  \hline
>  \multicolumn{1}{|c|}{\wpri} &
> +\multicolumn{1}{c|}{STLBIP} &
> +\multicolumn{1}{|c|}{\wpri} &
>  \multicolumn{1}{c|}{SEIP} &
>  \multicolumn{1}{c|}{\wpri} &
>  \multicolumn{1}{c|}{STIP} &
> @@ -346,7 +369,7 @@ SXLEN-bit read/write register containing interrupt
> enable bits.
>  \multicolumn{1}{c|}{SSIP} &
>  \multicolumn{1}{c|}{\wpri} \\
>  \hline
> -SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
> +SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
>  \end{tabular}
>  \end{center}
>  }
> @@ -359,8 +382,10 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>  {\footnotesize
>  \begin{center}
>  \setlength{\tabcolsep}{4pt}
> -\begin{tabular}{KcFcFcc}
> -\instbitrange{SXLEN-1}{10} &
> +\begin{tabular}{KcFcFcFcc}
> +\instbitrange{SXLEN-1}{14} &
> +\instbit{13} &
> +\instbitrange{12}{10} &
>  \instbit{9} &
>  \instbitrange{8}{6} &
>  \instbit{5} &
> @@ -369,6 +394,8 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>  \instbit{0} \\
>  \hline
>  \multicolumn{1}{|c|}{\wpri} &
> +\multicolumn{1}{c|}{STLBIE} &
> +\multicolumn{1}{|c|}{\wpri} &
>  \multicolumn{1}{c|}{SEIE} &
>  \multicolumn{1}{c|}{\wpri} &
>  \multicolumn{1}{c|}{STIE} &
> @@ -376,7 +403,7 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>  \multicolumn{1}{c|}{SSIE} &
>  \multicolumn{1}{c|}{\wpri} \\
>  \hline
> -SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
> +SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
>  \end{tabular}
>  \end{center}
>  }
> @@ -410,6 +437,12 @@ when the SEIE bit in the {\tt sie} register is
> clear.  The implementation
>  should provide facilities to mask, unmask, and query the cause of external
>  interrupts.
>
> +A supervisor-level TLB Invalidate finish interrupt is pending if the
> STLBIP bit
> +in the {\tt sip} register is set.  Supervisor-level TLB Invalidate finish
> +interrupts are disabled when the STLBIE bit in the {\tt sie} register is
> clear.
> +When hart tlb invalidate operations are finished, hardware will change
> sstatus:TLBI
> +bit from 1 to 0 and trigger TLB Invalidate finish interrupt.
> +
>  \begin{commentary}
>  The {\tt sip} and {\tt sie} registers are subsets of the {\tt mip} and
> {\tt
>  mie} registers.  Reading any field, or writing any writable field, of {\tt
> @@ -598,7 +631,9 @@ so is only guaranteed to hold supported exception
> codes.
>    1         & 5               & Supervisor timer interrupt \\
>    1         & 6--8            & {\em Reserved} \\
>    1         & 9               & Supervisor external interrupt \\
> -  1         & 10--15          & {\em Reserved} \\
> +  1         & 10--11          & {\em Reserved} \\
> +  1         & 12              & Supervisor TLBI finish interrupt \\
> +  1         & 13--15          & {\em Reserved} \\
>    1         & $\ge$16         & {\em Available for platform use} \\ \hline
>    0         & 0               & Instruction address misaligned \\
>    0         & 1               & Instruction access fault \\
> @@ -884,7 +919,7 @@ provided.
>  \multicolumn{1}{c|}{opcode} \\
>  \hline
>  7 & 5 & 5 & 3 & 5 & 7 \\
> -SFENCE.VMA & asid & vaddr & PRIV & 0 & SYSTEM \\
> +SFENCE.VMA & mode:ppn:asid & vaddr & LOCAL & 0 & SYSTEM \\
>  \end{tabular}
>  \end{center}
>
> @@ -899,21 +934,70 @@ from that hart to the memory-management data
> structures.
>  Further details on the behavior of this instruction are
>  described in Section~\ref{virt-control} and Section~\ref{pmp-vmem}.
>
> +SFENCE.VMA is defined as an asynchronous completion instruction, which
> means
> +that the TLB operation is not guaranteed to complete when the instruction
> retires.
> +Software need check sstatus:TLBI to determine all TLB operations complete.
> +The sstatus:TLBI described in Section~\ref{sstatus}. When hardware change
> +sstatus:TLBI bit from 1 to 0, the TLB Invalidate finish interrupt will be
> +triggered.
> +
>  \begin{commentary}
> -The SFENCE.VMA is used to flush any local hardware caches related to
> +The SFENCE.VMA is used to flush any local/remote hardware caches related
> to
>  address translation.  It is specified as a fence rather than a TLB
>  flush to provide cleaner semantics with respect to which instructions
>  are affected by the flush operation and to support a wider variety of
>  dynamic caching structures and memory-management schemes.  SFENCE.VMA
>  is also used by higher privilege levels to synchronize page table
> -writes and the address translation hardware.
> +writes and the address translation hardware. There is a mode bit to
> determine
> +sfence.vma would broadcast on interconnect or not.
>  \end{commentary}
>
> -SFENCE.VMA orders only the local hart's implicit references to the
> -memory-management data structures.
> +\begin{figure}[h!]
> +{\footnotesize
> +\begin{center}
> +\begin{tabular}{c@{}E@{}K}
> +\instbit{31} &
> +\instbitrange{30}{9} &
> +\instbitrange{8}{0} \\
> +\hline
> +\multicolumn{1}{|c|}{{\tt MODE}} &
> +\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
> +\multicolumn{1}{|c|}{{\tt ASID}} \\
> +\hline
> +1 & 22 & 9 \\
> +\end{tabular}
> +\end{center}
> +}
> +\vspace{-0.1in}
> +\caption{RV32 sfence.vma rs2 format.}
> +\label{rv32satp}
> +\end{figure}
> +
> +\begin{figure}[h!]
> +{\footnotesize
> +\begin{center}
> +\begin{tabular}{@{}S@{}T@{}U}
> +\instbitrange{63}{60} &
> +\instbitrange{59}{16} &
> +\instbitrange{15}{0} \\
> +\hline
> +\multicolumn{1}{|c|}{{\tt MODE}} &
> +\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
> +\multicolumn{1}{|c|}{{\tt ASID}} \\
> +\hline
> +4 & 44 & 16 \\
> +\end{tabular}
> +\end{center}
> +}
> +\vspace{-0.1in}
> +\caption{RV64 sfence.vma rs2 format, for MODE values, only highest bit:63
> is
> +valid and others are reserved.}
> +\label{rv64satp}
> +\end{figure}
>
>  \begin{commentary}
> -Consequently, other harts must be notified separately when the
> +The mode's highest bit could control sfence.vma behavior with 1:broadcast
> or 0:local.
> +If only have mode:local, other harts must be notified separately when the
>  memory-management data structures have been modified.
>  One approach is to use 1)
>  a local data fence to ensure local writes are visible globally, then
> @@ -928,8 +1012,17 @@ modified for a single address mapping (i.e., one
> page or superpage), {\em rs1}
>  can specify a virtual address within that mapping to effect a translation
>  fence for that mapping only.  Furthermore, for the common case that the
>  translation data structures have only been modified for a single
> address-space
> -identifier, {\em rs2} can specify the address space.  The behavior of
> -SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
> +identifier, {\em rs2} can specify the address space with {\tt satp} format
> +which include asid and root page table's PPN information.
> +
> +\begin{commentary}
> +We use ASID and root page table's PPN to determine address space and the
> format
> +stored in rs2 is similar with {\tt satp} described in
> Section~\ref{sec:satp}.
> +ASID are used by local harts and root page table's PPN of the asid are
> used by
> +other different TLB systems, eg: IOMMU.
> +\end{commentary}
> +
> +The behavior of SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
>
>  \begin{itemize}
>  \item If {\em rs1}={\tt x0} and {\em rs2}={\tt x0}, the fence orders all
> @@ -939,23 +1032,18 @@ SFENCE.VMA depends on {\em rs1} and {\em rs2} as
> follows:
>        all reads and writes made to any level of the page tables, but only
>        for the address space identified by integer register {\em rs2}.
>        Accesses to {\em global} mappings (see
> Section~\ref{sec:translation})
> -      are not ordered.
> +      are not ordered. The mode field in rs2 is determine broadcast or
> local.
>  \item If {\em rs1}$\neq${\tt x0} and {\em rs2}={\tt x0}, the fence orders
>        only reads and writes made to the leaf page table entry
> corresponding
>        to the virtual address in {\em rs1}, for all address spaces.
>  \item If {\em rs1}$\neq${\tt x0} and {\em rs2}$\neq${\tt x0}, the fence
>        orders only reads and writes made to the leaf page table entry
>        corresponding to the virtual address in {\em rs1}, for the address
> -      space identified by integer register {\em rs2}.
> +      space identified by integer register {\em rs2}. The mode field in
> rs2
> +      is determine broadcast or local.
>        Accesses to global mappings are not ordered.
>  \end{itemize}
>
> -When {\em rs2}$\neq${\tt x0}, bits SXLEN-1:ASIDMAX of the value held in
> {\em
> -rs2} are reserved for future use and should be zeroed by software and
> ignored
> -by current implementations.  Furthermore, if ASIDLEN~$<$~ASIDMAX, the
> -implementation shall ignore bits ASIDMAX-1:ASIDLEN of the value held in
> {\em
> -rs2}.
> -
>  \begin{commentary}
>  Simpler implementations can ignore the virtual address in {\em rs1} and
>  the ASID value in {\em rs2} and always perform a global fence.
> @@ -994,7 +1082,7 @@ can execute the same SFENCE.VMA instruction while a
> different ASID is loaded
>  into {\tt satp}, provided the next time {\tt satp} is loaded with the
> recycled
>  ASID, it is simultaneously loaded with the new page table.
>
> -\item If the implementation does not provide ASIDs, or software chooses to
> +\item If the implementation does not provide ASIDs and PPNs, or software
> chooses to
>  always use ASID 0, then after every {\tt satp} write, software should
> execute
>  SFENCE.VMA with {\em rs1}={\tt x0}.  In the common case that no global
>  translations have been modified, {\em rs2} should be set to a register
> other than
> @@ -1003,13 +1091,14 @@ not flushed.
>
>  \item If software modifies a non-leaf PTE, it should execute SFENCE.VMA
> with
>  {\em rs1}={\tt x0}.  If any PTE along the traversal path had its G bit
> set,
> -{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the
> ASID for
> -which the translation is being modified.
> +{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the
> ASID and
> +root page table's PPN for which the translation is being modified.
>
>  \item If software modifies a leaf PTE, it should execute SFENCE.VMA with
> {\em
>  rs1} set to a virtual address within the page.  If any PTE along the
> traversal
>  path had its G bit set, {\em rs2} must be {\tt x0}; otherwise, {\em rs2}
> -should be set to the ASID for which the translation is being modified.
> +should be set to the ASID and root page table's PPN for which the
> translation
> +is being modified.
>
>  \item For the special cases of increasing the permissions on a leaf PTE
> and
>  changing an invalid PTE to a valid leaf, software may choose to execute
> --
> 2.7.4
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
>
> View/Reply Online (#810):
> https://lists.riscv.org/g/tech-privileged/message/810
> Mute This Topic: https://lists.riscv.org/mt/34198986/1677273
> Group Owner: tech-privileged+owner@lists.riscv.org
> Unsubscribe: https://lists.riscv.org/g/tech-privileged/unsub  [
> andrew@sifive.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>

[-- Attachment #1.2: Type: text/html, Size: 31562 bytes --]

<div><div dir="auto">This needs to be discussed and debated at length; proposing edits to the spec at this stage is putting the cart before the horse!</div></div><div dir="auto"><br></div><div dir="auto">We shouldn’t change the definition of the existing SFENCE.VMA instruction to accomplish this. It’s also not abundantly clear to me that this should be an instruction: TLB shootdown looks more like MMIO.</div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 19, 2019 at 5:36 AM Guo Ren &lt;<a href="mailto:guoren@kernel.org">guoren@kernel.org</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">From: Guo Ren &lt;<a href="mailto:ren_guo@c-sky.com" target="_blank">ren_guo@c-sky.com</a>&gt;<br>
<br>
The patch is for <a href="https://github.com/riscv/riscv-isa-manual" rel="noreferrer" target="_blank">https://github.com/riscv/riscv-isa-manual</a><br>
<br>
The proposal has been talked in LPC-2019 RISC-V MC ref [1]. Here is the<br>
formal patch.<br>
<br>
Introduction<br>
============<br>
<br>
Using the Hardware TLB broadcast invalidation instruction to maintain the<br>
system TLB is a good choice and it&#39;ll simplify the system software design.<br>
The proposal hopes to add a broadcast mode to the sfence.vma in the<br>
riscv-privilege specification. To support the sfence.vma broadcast mode,<br>
there are two modification introduced below:<br>
<br>
 1) Add PGD.PPN (root page table&#39;s PPN) as the unique identifier of the<br>
    address space in addition to asid/vmid. Compared to the dynamically<br>
    changed asid/vmid, PGD.PPN is fixed throughout the address space life<br>
    cycle. This feature enables uniform address space identification<br>
    between different TLB systems (actually, it&#39;s difficult to unify the<br>
    asid/vmid between the CPU system and the IOMMU system, because their<br>
    mechanisms are different)<br>
<br>
 2) Modify the definition of the sfence.vma instruction from synchronous<br>
    mode to asynchronous mode, which means that the completion of the TLB<br>
    operation is not guaranteed when the sfence.vma instruction retires.<br>
    It needs to be completed by checking the flag bit on the hart. The<br>
    sfence.vma request finish can notify the software by generating an<br>
    interrupt. This function alleviates the large delay of TLB invalidation<br>
    in the PCI ATS system.<br>
<br>
Add S1/S2.PGD.PPN for ASID/VMID<br>
===============================<br>
<br>
PGD is global directory (defined in linux) and PPN is page physical number<br>
(defined in riscv-spec). PGD.PNN corresponds to the root page table pointer<br>
of the address space, i.e. mm-&gt;pgd (linux concept).<br>
<br>
In CPU/IOMMU TLB, we use asid/vmid to distinguish the address space of<br>
process or virtual machine. Due to the limitation of id encoding, it can<br>
only represent a part(window) of the address space. S1/S2.PGD.PPN are the<br>
root page table&#39;s PPNs of the address spaces and S1/S2.PGD.PPN are the<br>
unique identifier of the address spaces.<br>
<br>
For the CPU SMP system, you can use context switch to perform the necessary<br>
software mechanism to ensure that the asid/vmid on all harts is consistent<br>
(please refer to the arm64 asid mechanism). In this way, the TLB broadcast<br>
invalidation instruction can determine the address space processed on all<br>
harts by asid/vmid.<br>
<br>
Different from the CPU SMP system, there is no context switch for the<br>
DMA-IOMMU system, so the unification with the CPU asid/vmid cannot be<br>
guaranteed. So we need a unique identifier for the address space to<br>
establish a communication bridge between the TLBs of different systems.<br>
<br>
That is PGD.PPN (for virtualization scenarios: S1/S2.PGD.PPN)<br>
<br>
current:<br>
 sfence.vma  rs1 = vaddr, rs2 = asid<br>
 hfence.vvma rs1 = vaddr, rs2 = asid<br>
 hfence.gvma rs1 = gaddr, rs2 = vmid<br>
<br>
proposed:<br>
 sfence.vma  rs1 = vaddr, rs2 = mode:ppn:asid<br>
 hfence.vvma rs1 = vaddr, rs2 = mode:ppn:asid<br>
 hfence.gvma rs1 = gaddr, rs2 = mode:ppn:vmid<br>
<br>
 mode      - broadcast | local<br>
 ppn       - the PPN of the address space of the root page table<br>
 vmid/asid - the window identifier of the address space<br>
<br>
At the Linux Plumber Conference 2019 RISCV-MC, ref:[1], we&#39;ve showed two<br>
IOMMU examples to explain how it work with hardware.<br>
<br>
1) In a lightweight IOMMU system (up to 64 address spaces), the hardware<br>
   could directly convert PGD.PPN into DID (IOMMU ASID)<br>
<br>
2) For the PCI ATS scenario, its IO ASID/VMID encoding space can support<br>
   a very large number of address spaces. We use two reverse mapping<br>
   tables to let the hardware translate S1/S2.PGD.PPN into IO ASID/VMID.<br>
<br>
ASYNC BROADCAST SFENCE.VMA<br>
===========================<br>
<br>
To support the high latency broadcast sfence.vma operation in the PCI ATS<br>
usage scenario, we modify the sfence.vma from synchronous mode to<br>
asynchronous mode. (For simpler implementation, if hardware only implement<br>
synchronous mode and software still work in asynchronous mode)<br>
<br>
To implement the asynchronous mode, 3 features are added:<br>
 1) sstatus:TLBI<br>
    A &quot;status bit - TLBI&quot; is added to the sstatus register. The TLBI status<br>
    bit indicates if there are still outstanding sfence.vma requests on the<br>
    current hart.<br>
    Value:<br>
      1: sfence.vma requests are not completed.<br>
      0: all sfece.vma requests completed, request queue is empty.<br>
<br>
 2) sstatus:TLBIC<br>
    A &quot;control bits - TLBIC&quot; is added to sstatus register. The TLBIC control<br>
    bits are controlled by software.<br>
    &quot;Write 1&quot; will trigger the current hart check to see if there are still<br>
    outstanding sfence.vma requests. If there are unfinished requests, an<br>
    interrupt will be generated when the request is completed, notifying the<br>
    software that all of the current sfence.vma requests have been completed.<br>
    &quot;Write 0&quot; will cause nothing.<br>
<br>
 3) supervisor interrupt register (sip &amp; sie):TLBI finish interrupt<br>
    A per-hart interrupt is added to supervisor interrupt registers.<br>
    When all sfence.vma requests are completed and sstatus:TLBIC has been<br>
    triggered, hart will receive a TLBI finish interrupt. Just like timer,<br>
    software and external interrupt&#39;s definition in sip &amp; sie.<br>
<br>
Fake code:<br>
<br>
flush_tlb_page(vma, addr) {<br>
    asid = cpu_asid(vma-&gt;vm_mm);<br>
    ppn = PFN_DOWN(vma-&gt;vm_mm-&gt;pgd);<br>
<br>
    sfence.vma (addr, 1|PPN_OFFSET(ppn)|asid); //1. start request<br>
<br>
    while(sstatus:TLBI) if (time_out() &gt; 1ms) break; //2. loop check<br>
<br>
    while (sstatus:TLBI) {<br>
        ...<br>
        set sstatus:TLBIC;<br>
        wait_TLBI_finish_interrupt(); //3. wait irq, io_schedule<br>
    }<br>
}<br>
<br>
Here we give 2 level check:<br>
 1) loop check sstatus:TLBI, CPU could response Interrupt.<br>
 2) set sstatus:TLBIC and wait for irq, CPU schedule out for other task.<br>
<br>
ACE-DVM Example<br>
===============<br>
<br>
Honestly, &quot;broadcasting addr, asid, vmid, S1/S2.PGD.PPN to interconnects&quot;<br>
and &quot;ASYNC SFENCE.VMA&quot; could be implemented by ACE-DVM protocol ref [2].<br>
<br>
There are 3 types of transactions in DVM:<br>
<br>
 - DVM operation<br>
   Send all information to the interconnect, including addr, asid,<br>
   S1.PGD.PPN, vmid, S2.PGD.PPN.<br>
<br>
 - DVM synchronization<br>
   Check that all DVM operations have been completed. If not, it will use<br>
   state machine to wait DVM complete requests.<br>
<br>
 - DVM complete<br>
   Return transaction from components, eg: IOMMU. If hart has received all<br>
   DVM completes which are triggered by sfence.vma instructions and<br>
   &quot;sstatus:TLBIC&quot; has been set, a TLBI finish interrupt is triggered.<br>
<br>
(Actually, we do not need to implement the above functions strictly<br>
 according to the ACE specification :P )<br>
<br>
 1: <a href="https://www.linuxplumbersconf.org/event/4/contributions/307/" rel="noreferrer" target="_blank">https://www.linuxplumbersconf.org/event/4/contributions/307/</a><br>
 2: AMBA AXI and ACE Protocol Specification - Distributed Virtual Memory<br>
    Transactions&quot;<br>
<br>
Signed-off-by: Guo Ren &lt;<a href="mailto:ren_guo@c-sky.com" target="_blank">ren_guo@c-sky.com</a>&gt;<br>
Reviewed-by: Li Feiteng &lt;<a href="mailto:feiteng_li@c-sky.com" target="_blank">feiteng_li@c-sky.com</a>&gt;<br>
---<br>
 src/hypervisor.tex |  43 ++++++++-------<br>
 src/supervisor.tex | 155 +++++++++++++++++++++++++++++++++++++++++------------<br>
 2 files changed, 143 insertions(+), 55 deletions(-)<br>
<br>
diff --git a/src/hypervisor.tex b/src/hypervisor.tex<br>
index 47b90b2..3718819 100644<br>
--- a/src/hypervisor.tex<br>
+++ b/src/hypervisor.tex<br>
@@ -1094,15 +1094,15 @@ The hypervisor extension adds two new privileged fence instructions.<br>
 \multicolumn{1}{c|}{opcode} \\<br>
 \hline<br>
 7 &amp; 5 &amp; 5 &amp; 3 &amp; 5 &amp; 7 \\<br>
-HFENCE.GVMA &amp; vmid &amp; gaddr &amp; PRIV &amp; 0 &amp; SYSTEM \\<br>
-HFENCE.VVMA &amp; asid &amp; vaddr &amp; PRIV &amp; 0 &amp; SYSTEM \\<br>
+HFENCE.GVMA &amp; mode:ppn:vmid &amp; gaddr &amp; PRIV &amp; 0 &amp; SYSTEM \\<br>
+HFENCE.VVMA &amp; mode:ppn:asid &amp; vaddr &amp; PRIV &amp; 0 &amp; SYSTEM \\<br>
 \end{tabular}<br>
 \end{center}<br>
<br>
 The hypervisor memory-management fence instructions, HFENCE.GVMA and<br>
 HFENCE.VVMA, are valid only in HS-mode when {\tt mstatus}.TVM=0, or in M-mode<br>
 (irrespective of {\tt mstatus}.TVM).<br>
-These instructions perform a function similar to SFENCE.VMA<br>
+These instructions perform a function similar to SFENCE.VMA (broadcast/local)<br>
 (Section~\ref{sec:sfence.vma}), except applying to the guest-physical<br>
 memory-management data structures controlled by CSR {\tt hgatp} (HFENCE.GVMA)<br>
 or the VS-level memory-management data structures controlled by CSR {\tt vsatp}<br>
@@ -1136,11 +1136,10 @@ An HFENCE.VVMA instruction applies only to a single virtual machine, identified<br>
 by the setting of {\tt hgatp}.VMID when HFENCE.VVMA executes.<br>
 \end{commentary}<br>
<br>
-When {\em rs2}$\neq${\tt x0}, bits XLEN-1:ASIDMAX of the value held in {\em<br>
-rs2} are reserved for future use and should be zeroed by software and ignored<br>
-by current implementations.<br>
-Furthermore, if ASIDLEN~$&lt;$~ASIDMAX, the implementation shall ignore bits<br>
-ASIDMAX-1:ASIDLEN of the value held in {\em rs2}.<br>
+When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, ppn, asid.<br>
+1) mode control HFENCE.VVMA broadcast or not.<br>
+2) ppn is the root page talbe&#39;s PPN of the asid address space.<br>
+3) asid is the identifier of process in virtual machine.<br>
<br>
 \begin{commentary}<br>
 Simpler implementations of HFENCE.VVMA can ignore the guest virtual address in<br>
@@ -1168,11 +1167,10 @@ physical addresses in PMP address registers (Section~\ref{sec:pmp}) and in page<br>
 table entries (Sections \ref{sec:sv32}, \ref{sec:sv39}, and~\ref{sec:sv48}).<br>
 \end{commentary}<br>
<br>
-When {\em rs2}$\neq${\tt x0}, bits XLEN-1:VMIDMAX of the value held in {\em<br>
-rs2} are reserved for future use and should be zeroed by software and ignored<br>
-by current implementations.<br>
-Furthermore, if VMIDLEN~$&lt;$~VMIDMAX, the implementation shall ignore bits<br>
-VMIDMAX-1:VMIDLEN of the value held in {\em rs2}.<br>
+When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, vmid, ppn.<br>
+1) mode control HFENCE.GVMA broadcast or not.<br>
+2) ppn is the root page talbe&#39;s PPN of the vmid address space.<br>
+3) vmid is the identifier of virtual machine.<br>
<br>
 \begin{commentary}<br>
 Simpler implementations of HFENCE.GVMA can ignore the guest physical address in<br>
@@ -1567,21 +1565,22 @@ register.<br>
 \subsection{Memory-Management Fences}<br>
<br>
 The behavior of the SFENCE.VMA instruction is affected by the current<br>
-virtualization mode V.  When V=0, the virtual-address argument is an HS-level<br>
-virtual address, and the ASID argument is an HS-level ASID.<br>
+virtualization mode V.  When V=0, the rs1 argument is an HS-level<br>
+virtual address, and the rs2 argument is an HS-level ASID and root page table&#39;s PPN.<br>
 The instruction orders stores only to HS-level address-translation structures<br>
 with subsequent HS-level address translations.<br>
<br>
-When V=1, the virtual-address argument to SFENCE.VMA is a guest virtual<br>
-address within the current virtual machine, and the ASID argument is a VS-level<br>
-ASID within the current virtual machine.<br>
+When V=1, the rs1 argument to SFENCE.VMA is a guest virtual<br>
+address within the current virtual machine, and the rs2 argument is a VS-level<br>
+ASID and root page table&#39;s PPN within the current virtual machine.<br>
 The current virtual machine is identified by the VMID field of CSR {\tt hgatp},<br>
-and the effective ASID can be considered to be the combination of this VMID<br>
-with the VS-level ASID.<br>
+and the effective ASID and root page table&#39;s PPN can be considered to be the<br>
+combination of this VMID and root page table&#39;s PPN with the VS-level ASID and<br>
+root page table&#39;s PPN.<br>
 The SFENCE.VMA instruction orders stores only to the VS-level<br>
 address-translation structures with subsequent VS-level address translations<br>
-for the same virtual machine, i.e., only when {\tt hgatp}.VMID is the same as<br>
-when the SFENCE.VMA executed.<br>
+for the same virtual machine, i.e., only when {\tt hgatp}.VMID and {\\tt hgatp}.PPN is<br>
+the same as when the SFENCE.VMA executed.<br>
<br>
 Hypervisor instructions HFENCE.GVMA and HFENCE.VVMA provide additional<br>
 memory-management fences to complement SFENCE.VMA.<br>
diff --git a/src/supervisor.tex b/src/supervisor.tex<br>
index ba3ced5..2877b7a 100644<br>
--- a/src/supervisor.tex<br>
+++ b/src/supervisor.tex<br>
@@ -47,10 +47,12 @@ register keeps track of the processor&#39;s current operating state.<br>
 \begin{center}<br>
 \setlength{\tabcolsep}{4pt}<br>
 \scalebox{0.95}{<br>
-\begin{tabular}{cWcccccWccccWcc}<br>
+\begin{tabular}{cccWcccccWccccWcc}<br>
 \\<br>
 \instbit{31} &amp;<br>
-\instbitrange{30}{20} &amp;<br>
+\instbit{30} &amp;<br>
+\instbit{29} &amp;<br>
+\instbitrange{28}{20} &amp;<br>
 \instbit{19} &amp;<br>
 \instbit{18} &amp;<br>
 \instbit{17} &amp;<br>
@@ -66,6 +68,8 @@ register keeps track of the processor&#39;s current operating state.<br>
 \instbit{0} \\<br>
 \hline<br>
 \multicolumn{1}{|c|}{SD} &amp;<br>
+\multicolumn{1}{|c|}{TLBI} &amp;<br>
+\multicolumn{1}{|c|}{TLBIC} &amp;<br>
 \multicolumn{1}{c|}{\wpri} &amp;<br>
 \multicolumn{1}{c|}{MXR} &amp;<br>
 \multicolumn{1}{c|}{SUM} &amp;<br>
@@ -82,7 +86,7 @@ register keeps track of the processor&#39;s current operating state.<br>
 \multicolumn{1}{c|}{\wpri}<br>
 \\<br>
 \hline<br>
-1 &amp; 11 &amp; 1 &amp; 1 &amp; 1 &amp; 2 &amp; 2 &amp; 4 &amp; 1 &amp; 1 &amp; 1 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
+1 &amp; 1 &amp; 1 &amp; 10 &amp; 1 &amp; 1 &amp; 1 &amp; 2 &amp; 2 &amp; 4 &amp; 1 &amp; 1 &amp; 1 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
 \end{tabular}}<br>
 \end{center}<br>
 }<br>
@@ -95,10 +99,12 @@ register keeps track of the processor&#39;s current operating state.<br>
 {\footnotesize<br>
 \begin{center}<br>
 \setlength{\tabcolsep}{4pt}<br>
-\begin{tabular}{cMFScccc}<br>
+\begin{tabular}{cccMFScccc}<br>
 \\<br>
 \instbit{SXLEN-1} &amp;<br>
-\instbitrange{SXLEN-2}{34} &amp;<br>
+\instbit{SXLEN-2} &amp;<br>
+\instbit{SXLEN-3} &amp;<br>
+\instbitrange{SXLEN-4}{34} &amp;<br>
 \instbitrange{33}{32} &amp;<br>
 \instbitrange{31}{20} &amp;<br>
 \instbit{19} &amp;<br>
@@ -107,6 +113,8 @@ register keeps track of the processor&#39;s current operating state.<br>
  \\<br>
 \hline<br>
 \multicolumn{1}{|c|}{SD} &amp;<br>
+\multicolumn{1}{|c|}{TLBI} &amp;<br>
+\multicolumn{1}{|c|}{TLBIC} &amp;<br>
 \multicolumn{1}{c|}{\wpri} &amp;<br>
 \multicolumn{1}{c|}{UXL[1:0]} &amp;<br>
 \multicolumn{1}{c|}{\wpri} &amp;<br>
@@ -115,7 +123,7 @@ register keeps track of the processor&#39;s current operating state.<br>
 \multicolumn{1}{c|}{\wpri} &amp;<br>
  \\<br>
 \hline<br>
-1 &amp; SXLEN-35 &amp; 2 &amp; 12 &amp; 1 &amp; 1 &amp; 1 &amp; \\<br>
+1 &amp; 1 &amp; 1 &amp; SXLEN-37 &amp; 2 &amp; 12 &amp; 1 &amp; 1 &amp; 1 &amp; \\<br>
 \end{tabular}<br>
 \begin{tabular}{cWWFccccWcc}<br>
 \\<br>
@@ -152,6 +160,17 @@ register keeps track of the processor&#39;s current operating state.<br>
 \label{sstatusreg}<br>
 \end{figure*}<br>
<br>
+The TLBI (read-only) bit indicates that any async sfence.vma operations are<br>
+still pended on the hart. The value:0 means that there is no sfence.vma<br>
+operations pending and value:1 means that there are still sfence.vma operations<br>
+pending on the hart.<br>
+<br>
+When the sstatus:TLBIC bit is written 1, it triggers the hardware to check if<br>
+there are any TLB invalidate operations being pended. When all operations are<br>
+finished, a TLB Invalidate finish interrupt will be triggered<br>
+(see Section~\ref{sipreg}). When the sstatus:TLBIC bit is written 0, it will<br>
+cause nothing. Reading sstatus:TLBIC bit will alaways return 0.<br>
+<br>
 The SPP bit indicates the privilege level at which a hart was executing before<br>
 entering supervisor mode.  When a trap is taken, SPP is set to 0 if the trap<br>
 originated from user mode, or 1 otherwise.  When an SRET instruction<br>
@@ -329,8 +348,10 @@ SXLEN-bit read/write register containing interrupt enable bits.<br>
 {\footnotesize<br>
 \begin{center}<br>
 \setlength{\tabcolsep}{4pt}<br>
-\begin{tabular}{KcFcFcc}<br>
-\instbitrange{SXLEN-1}{10} &amp;<br>
+\begin{tabular}{KcFcFcFcc}<br>
+\instbitrange{SXLEN-1}{14} &amp;<br>
+\instbit{13} &amp;<br>
+\instbitrange{12}{10} &amp;<br>
 \instbit{9} &amp;<br>
 \instbitrange{8}{6} &amp;<br>
 \instbit{5} &amp;<br>
@@ -339,6 +360,8 @@ SXLEN-bit read/write register containing interrupt enable bits.<br>
 \instbit{0} \\<br>
 \hline<br>
 \multicolumn{1}{|c|}{\wpri} &amp;<br>
+\multicolumn{1}{c|}{STLBIP} &amp;<br>
+\multicolumn{1}{|c|}{\wpri} &amp;<br>
 \multicolumn{1}{c|}{SEIP} &amp;<br>
 \multicolumn{1}{c|}{\wpri} &amp;<br>
 \multicolumn{1}{c|}{STIP} &amp;<br>
@@ -346,7 +369,7 @@ SXLEN-bit read/write register containing interrupt enable bits.<br>
 \multicolumn{1}{c|}{SSIP} &amp;<br>
 \multicolumn{1}{c|}{\wpri} \\<br>
 \hline<br>
-SXLEN-10 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
+SXLEN-14 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
 \end{tabular}<br>
 \end{center}<br>
 }<br>
@@ -359,8 +382,10 @@ SXLEN-10 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
 {\footnotesize<br>
 \begin{center}<br>
 \setlength{\tabcolsep}{4pt}<br>
-\begin{tabular}{KcFcFcc}<br>
-\instbitrange{SXLEN-1}{10} &amp;<br>
+\begin{tabular}{KcFcFcFcc}<br>
+\instbitrange{SXLEN-1}{14} &amp;<br>
+\instbit{13} &amp;<br>
+\instbitrange{12}{10} &amp;<br>
 \instbit{9} &amp;<br>
 \instbitrange{8}{6} &amp;<br>
 \instbit{5} &amp;<br>
@@ -369,6 +394,8 @@ SXLEN-10 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
 \instbit{0} \\<br>
 \hline<br>
 \multicolumn{1}{|c|}{\wpri} &amp;<br>
+\multicolumn{1}{c|}{STLBIE} &amp;<br>
+\multicolumn{1}{|c|}{\wpri} &amp;<br>
 \multicolumn{1}{c|}{SEIE} &amp;<br>
 \multicolumn{1}{c|}{\wpri} &amp;<br>
 \multicolumn{1}{c|}{STIE} &amp;<br>
@@ -376,7 +403,7 @@ SXLEN-10 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
 \multicolumn{1}{c|}{SSIE} &amp;<br>
 \multicolumn{1}{c|}{\wpri} \\<br>
 \hline<br>
-SXLEN-10 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
+SXLEN-14 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 3 &amp; 1 &amp; 1 \\<br>
 \end{tabular}<br>
 \end{center}<br>
 }<br>
@@ -410,6 +437,12 @@ when the SEIE bit in the {\tt sie} register is clear.  The implementation<br>
 should provide facilities to mask, unmask, and query the cause of external<br>
 interrupts.<br>
<br>
+A supervisor-level TLB Invalidate finish interrupt is pending if the STLBIP bit<br>
+in the {\tt sip} register is set.  Supervisor-level TLB Invalidate finish<br>
+interrupts are disabled when the STLBIE bit in the {\tt sie} register is clear.<br>
+When hart tlb invalidate operations are finished, hardware will change sstatus:TLBI<br>
+bit from 1 to 0 and trigger TLB Invalidate finish interrupt.<br>
+<br>
 \begin{commentary}<br>
 The {\tt sip} and {\tt sie} registers are subsets of the {\tt mip} and {\tt<br>
 mie} registers.  Reading any field, or writing any writable field, of {\tt<br>
@@ -598,7 +631,9 @@ so is only guaranteed to hold supported exception codes.<br>
   1         &amp; 5               &amp; Supervisor timer interrupt \\<br>
   1         &amp; 6--8            &amp; {\em Reserved} \\<br>
   1         &amp; 9               &amp; Supervisor external interrupt \\<br>
-  1         &amp; 10--15          &amp; {\em Reserved} \\<br>
+  1         &amp; 10--11          &amp; {\em Reserved} \\<br>
+  1         &amp; 12              &amp; Supervisor TLBI finish interrupt \\<br>
+  1         &amp; 13--15          &amp; {\em Reserved} \\<br>
   1         &amp; $\ge$16         &amp; {\em Available for platform use} \\ \hline<br>
   0         &amp; 0               &amp; Instruction address misaligned \\<br>
   0         &amp; 1               &amp; Instruction access fault \\<br>
@@ -884,7 +919,7 @@ provided.<br>
 \multicolumn{1}{c|}{opcode} \\<br>
 \hline<br>
 7 &amp; 5 &amp; 5 &amp; 3 &amp; 5 &amp; 7 \\<br>
-SFENCE.VMA &amp; asid &amp; vaddr &amp; PRIV &amp; 0 &amp; SYSTEM \\<br>
+SFENCE.VMA &amp; mode:ppn:asid &amp; vaddr &amp; LOCAL &amp; 0 &amp; SYSTEM \\<br>
 \end{tabular}<br>
 \end{center}<br>
<br>
@@ -899,21 +934,70 @@ from that hart to the memory-management data structures.<br>
 Further details on the behavior of this instruction are<br>
 described in Section~\ref{virt-control} and Section~\ref{pmp-vmem}.<br>
<br>
+SFENCE.VMA is defined as an asynchronous completion instruction, which means<br>
+that the TLB operation is not guaranteed to complete when the instruction retires.<br>
+Software need check sstatus:TLBI to determine all TLB operations complete.<br>
+The sstatus:TLBI described in Section~\ref{sstatus}. When hardware change<br>
+sstatus:TLBI bit from 1 to 0, the TLB Invalidate finish interrupt will be<br>
+triggered.<br>
+<br>
 \begin{commentary}<br>
-The SFENCE.VMA is used to flush any local hardware caches related to<br>
+The SFENCE.VMA is used to flush any local/remote hardware caches related to<br>
 address translation.  It is specified as a fence rather than a TLB<br>
 flush to provide cleaner semantics with respect to which instructions<br>
 are affected by the flush operation and to support a wider variety of<br>
 dynamic caching structures and memory-management schemes.  SFENCE.VMA<br>
 is also used by higher privilege levels to synchronize page table<br>
-writes and the address translation hardware.<br>
+writes and the address translation hardware. There is a mode bit to determine<br>
+sfence.vma would broadcast on interconnect or not.<br>
 \end{commentary}<br>
<br>
-SFENCE.VMA orders only the local hart&#39;s implicit references to the<br>
-memory-management data structures.<br>
+\begin{figure}[h!]<br>
+{\footnotesize<br>
+\begin{center}<br>
+\begin{tabular}{c@{}E@{}K}<br>
+\instbit{31} &amp;<br>
+\instbitrange{30}{9} &amp;<br>
+\instbitrange{8}{0} \\<br>
+\hline<br>
+\multicolumn{1}{|c|}{{\tt MODE}} &amp;<br>
+\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &amp;<br>
+\multicolumn{1}{|c|}{{\tt ASID}} \\<br>
+\hline<br>
+1 &amp; 22 &amp; 9 \\<br>
+\end{tabular}<br>
+\end{center}<br>
+}<br>
+\vspace{-0.1in}<br>
+\caption{RV32 sfence.vma rs2 format.}<br>
+\label{rv32satp}<br>
+\end{figure}<br>
+<br>
+\begin{figure}[h!]<br>
+{\footnotesize<br>
+\begin{center}<br>
+\begin{tabular}{@{}S@{}T@{}U}<br>
+\instbitrange{63}{60} &amp;<br>
+\instbitrange{59}{16} &amp;<br>
+\instbitrange{15}{0} \\<br>
+\hline<br>
+\multicolumn{1}{|c|}{{\tt MODE}} &amp;<br>
+\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &amp;<br>
+\multicolumn{1}{|c|}{{\tt ASID}} \\<br>
+\hline<br>
+4 &amp; 44 &amp; 16 \\<br>
+\end{tabular}<br>
+\end{center}<br>
+}<br>
+\vspace{-0.1in}<br>
+\caption{RV64 sfence.vma rs2 format, for MODE values, only highest bit:63 is<br>
+valid and others are reserved.}<br>
+\label{rv64satp}<br>
+\end{figure}<br>
<br>
 \begin{commentary}<br>
-Consequently, other harts must be notified separately when the<br>
+The mode&#39;s highest bit could control sfence.vma behavior with 1:broadcast or 0:local.<br>
+If only have mode:local, other harts must be notified separately when the<br>
 memory-management data structures have been modified.<br>
 One approach is to use 1)<br>
 a local data fence to ensure local writes are visible globally, then<br>
@@ -928,8 +1012,17 @@ modified for a single address mapping (i.e., one page or superpage), {\em rs1}<br>
 can specify a virtual address within that mapping to effect a translation<br>
 fence for that mapping only.  Furthermore, for the common case that the<br>
 translation data structures have only been modified for a single address-space<br>
-identifier, {\em rs2} can specify the address space.  The behavior of<br>
-SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:<br>
+identifier, {\em rs2} can specify the address space with {\tt satp} format<br>
+which include asid and root page table&#39;s PPN information.<br>
+<br>
+\begin{commentary}<br>
+We use ASID and root page table&#39;s PPN to determine address space and the format<br>
+stored in rs2 is similar with {\tt satp} described in Section~\ref{sec:satp}.<br>
+ASID are used by local harts and root page table&#39;s PPN of the asid are used by<br>
+other different TLB systems, eg: IOMMU.<br>
+\end{commentary}<br>
+<br>
+The behavior of SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:<br>
<br>
 \begin{itemize}<br>
 \item If {\em rs1}={\tt x0} and {\em rs2}={\tt x0}, the fence orders all<br>
@@ -939,23 +1032,18 @@ SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:<br>
       all reads and writes made to any level of the page tables, but only<br>
       for the address space identified by integer register {\em rs2}.<br>
       Accesses to {\em global} mappings (see Section~\ref{sec:translation})<br>
-      are not ordered.<br>
+      are not ordered. The mode field in rs2 is determine broadcast or local.<br>
 \item If {\em rs1}$\neq${\tt x0} and {\em rs2}={\tt x0}, the fence orders<br>
       only reads and writes made to the leaf page table entry corresponding<br>
       to the virtual address in {\em rs1}, for all address spaces.<br>
 \item If {\em rs1}$\neq${\tt x0} and {\em rs2}$\neq${\tt x0}, the fence<br>
       orders only reads and writes made to the leaf page table entry<br>
       corresponding to the virtual address in {\em rs1}, for the address<br>
-      space identified by integer register {\em rs2}.<br>
+      space identified by integer register {\em rs2}. The mode field in rs2<br>
+      is determine broadcast or local.<br>
       Accesses to global mappings are not ordered.<br>
 \end{itemize}<br>
<br>
-When {\em rs2}$\neq${\tt x0}, bits SXLEN-1:ASIDMAX of the value held in {\em<br>
-rs2} are reserved for future use and should be zeroed by software and ignored<br>
-by current implementations.  Furthermore, if ASIDLEN~$&lt;$~ASIDMAX, the<br>
-implementation shall ignore bits ASIDMAX-1:ASIDLEN of the value held in {\em<br>
-rs2}.<br>
-<br>
 \begin{commentary}<br>
 Simpler implementations can ignore the virtual address in {\em rs1} and<br>
 the ASID value in {\em rs2} and always perform a global fence.<br>
@@ -994,7 +1082,7 @@ can execute the same SFENCE.VMA instruction while a different ASID is loaded<br>
 into {\tt satp}, provided the next time {\tt satp} is loaded with the recycled<br>
 ASID, it is simultaneously loaded with the new page table.<br>
<br>
-\item If the implementation does not provide ASIDs, or software chooses to<br>
+\item If the implementation does not provide ASIDs and PPNs, or software chooses to<br>
 always use ASID 0, then after every {\tt satp} write, software should execute<br>
 SFENCE.VMA with {\em rs1}={\tt x0}.  In the common case that no global<br>
 translations have been modified, {\em rs2} should be set to a register other than<br>
@@ -1003,13 +1091,14 @@ not flushed.<br>
<br>
 \item If software modifies a non-leaf PTE, it should execute SFENCE.VMA with<br>
 {\em rs1}={\tt x0}.  If any PTE along the traversal path had its G bit set,<br>
-{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the ASID for<br>
-which the translation is being modified.<br>
+{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the ASID and<br>
+root page table&#39;s PPN for which the translation is being modified.<br>
<br>
 \item If software modifies a leaf PTE, it should execute SFENCE.VMA with {\em<br>
 rs1} set to a virtual address within the page.  If any PTE along the traversal<br>
 path had its G bit set, {\em rs2} must be {\tt x0}; otherwise, {\em rs2}<br>
-should be set to the ASID for which the translation is being modified.<br>
+should be set to the ASID and root page table&#39;s PPN for which the translation<br>
+is being modified.<br>
<br>
 \item For the special cases of increasing the permissions on a leaf PTE and<br>
 changing an invalid PTE to a valid leaf, software may choose to execute<br>
-- <br>
2.7.4<br>
<br>
<br>
-=-=-=-=-=-=-=-=-=-=-=-<br>
Links: You receive all messages sent to this group.<br>
<br>
View/Reply Online (#810): <a href="https://lists.riscv.org/g/tech-privileged/message/810" rel="noreferrer" target="_blank">https://lists.riscv.org/g/tech-privileged/message/810</a><br>
Mute This Topic: <a href="https://lists.riscv.org/mt/34198986/1677273" rel="noreferrer" target="_blank">https://lists.riscv.org/mt/34198986/1677273</a><br>
Group Owner: <a href="mailto:tech-privileged%2Bowner@lists.riscv.org" target="_blank">tech-privileged+owner@lists.riscv.org</a><br>
Unsubscribe: <a href="https://lists.riscv.org/g/tech-privileged/unsub" rel="noreferrer" target="_blank">https://lists.riscv.org/g/tech-privileged/unsub</a>  [<a href="mailto:andrew@sifive.com" target="_blank">andrew@sifive.com</a>]<br>
-=-=-=-=-=-=-=-=-=-=-=-<br>
<br>
</blockquote></div></div>

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [tech-privileged] [RFC PATCH V1] riscv-privileged: Add broadcast mode to sfence.vma
  2019-09-19 16:04 ` [tech-privileged] " Andrew Waterman
@ 2019-09-20  0:13   ` Guo Ren
  2019-09-20  2:27   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 4+ messages in thread
From: Guo Ren @ 2019-09-20  0:13 UTC (permalink / raw)
  To: Andrew Waterman
  Cc: julien.thierry, Catalin Marinas, Palmer Dabbelt, Will Deacon,
	Atish Patra, Julien Grall, gary, linux-riscv, kvmarm,
	Jean-Philippe Brucker, linux-csky, Mike Rapoport, Guo Ren, benh,
	tech-privileged, Marc Zyngier, linux-arm-kernel, feiteng_li,
	Anup Patel, Linux Kernel Mailing List, iommu, dwmw2

Hi,

On Fri, Sep 20, 2019 at 12:10 AM Andrew Waterman <andrew@sifive.com> wrote:
>
> This needs to be discussed and debated at length; proposing edits to the spec at this stage is putting the cart before the horse!
Agree :)

>
> We shouldn’t change the definition of the existing SFENCE.VMA instruction to accomplish this. It’s also not abundantly clear to me that this should be an instruction:
If you implement sfence.vma as current define, it also could work with
new mechanism, they are compatible.

> TLB shootdown looks more like MMIO.
Per-CPU MMIO ? I the proposal, every hart only takes care of its own request.




>
> On Thu, Sep 19, 2019 at 5:36 AM Guo Ren <guoren@kernel.org> wrote:
>>
>> From: Guo Ren <ren_guo@c-sky.com>
>>
>> The patch is for https://github.com/riscv/riscv-isa-manual
>>
>> The proposal has been talked in LPC-2019 RISC-V MC ref [1]. Here is the
>> formal patch.
>>
>> Introduction
>> ============
>>
>> Using the Hardware TLB broadcast invalidation instruction to maintain the
>> system TLB is a good choice and it'll simplify the system software design.
>> The proposal hopes to add a broadcast mode to the sfence.vma in the
>> riscv-privilege specification. To support the sfence.vma broadcast mode,
>> there are two modification introduced below:
>>
>>  1) Add PGD.PPN (root page table's PPN) as the unique identifier of the
>>     address space in addition to asid/vmid. Compared to the dynamically
>>     changed asid/vmid, PGD.PPN is fixed throughout the address space life
>>     cycle. This feature enables uniform address space identification
>>     between different TLB systems (actually, it's difficult to unify the
>>     asid/vmid between the CPU system and the IOMMU system, because their
>>     mechanisms are different)
>>
>>  2) Modify the definition of the sfence.vma instruction from synchronous
>>     mode to asynchronous mode, which means that the completion of the TLB
>>     operation is not guaranteed when the sfence.vma instruction retires.
>>     It needs to be completed by checking the flag bit on the hart. The
>>     sfence.vma request finish can notify the software by generating an
>>     interrupt. This function alleviates the large delay of TLB invalidation
>>     in the PCI ATS system.
>>
>> Add S1/S2.PGD.PPN for ASID/VMID
>> ===============================
>>
>> PGD is global directory (defined in linux) and PPN is page physical number
>> (defined in riscv-spec). PGD.PNN corresponds to the root page table pointer
>> of the address space, i.e. mm->pgd (linux concept).
>>
>> In CPU/IOMMU TLB, we use asid/vmid to distinguish the address space of
>> process or virtual machine. Due to the limitation of id encoding, it can
>> only represent a part(window) of the address space. S1/S2.PGD.PPN are the
>> root page table's PPNs of the address spaces and S1/S2.PGD.PPN are the
>> unique identifier of the address spaces.
>>
>> For the CPU SMP system, you can use context switch to perform the necessary
>> software mechanism to ensure that the asid/vmid on all harts is consistent
>> (please refer to the arm64 asid mechanism). In this way, the TLB broadcast
>> invalidation instruction can determine the address space processed on all
>> harts by asid/vmid.
>>
>> Different from the CPU SMP system, there is no context switch for the
>> DMA-IOMMU system, so the unification with the CPU asid/vmid cannot be
>> guaranteed. So we need a unique identifier for the address space to
>> establish a communication bridge between the TLBs of different systems.
>>
>> That is PGD.PPN (for virtualization scenarios: S1/S2.PGD.PPN)
>>
>> current:
>>  sfence.vma  rs1 = vaddr, rs2 = asid
>>  hfence.vvma rs1 = vaddr, rs2 = asid
>>  hfence.gvma rs1 = gaddr, rs2 = vmid
>>
>> proposed:
>>  sfence.vma  rs1 = vaddr, rs2 = mode:ppn:asid
>>  hfence.vvma rs1 = vaddr, rs2 = mode:ppn:asid
>>  hfence.gvma rs1 = gaddr, rs2 = mode:ppn:vmid
>>
>>  mode      - broadcast | local
>>  ppn       - the PPN of the address space of the root page table
>>  vmid/asid - the window identifier of the address space
>>
>> At the Linux Plumber Conference 2019 RISCV-MC, ref:[1], we've showed two
>> IOMMU examples to explain how it work with hardware.
>>
>> 1) In a lightweight IOMMU system (up to 64 address spaces), the hardware
>>    could directly convert PGD.PPN into DID (IOMMU ASID)
>>
>> 2) For the PCI ATS scenario, its IO ASID/VMID encoding space can support
>>    a very large number of address spaces. We use two reverse mapping
>>    tables to let the hardware translate S1/S2.PGD.PPN into IO ASID/VMID.
>>
>> ASYNC BROADCAST SFENCE.VMA
>> ===========================
>>
>> To support the high latency broadcast sfence.vma operation in the PCI ATS
>> usage scenario, we modify the sfence.vma from synchronous mode to
>> asynchronous mode. (For simpler implementation, if hardware only implement
>> synchronous mode and software still work in asynchronous mode)
>>
>> To implement the asynchronous mode, 3 features are added:
>>  1) sstatus:TLBI
>>     A "status bit - TLBI" is added to the sstatus register. The TLBI status
>>     bit indicates if there are still outstanding sfence.vma requests on the
>>     current hart.
>>     Value:
>>       1: sfence.vma requests are not completed.
>>       0: all sfece.vma requests completed, request queue is empty.
>>
>>  2) sstatus:TLBIC
>>     A "control bits - TLBIC" is added to sstatus register. The TLBIC control
>>     bits are controlled by software.
>>     "Write 1" will trigger the current hart check to see if there are still
>>     outstanding sfence.vma requests. If there are unfinished requests, an
>>     interrupt will be generated when the request is completed, notifying the
>>     software that all of the current sfence.vma requests have been completed.
>>     "Write 0" will cause nothing.
>>
>>  3) supervisor interrupt register (sip & sie):TLBI finish interrupt
>>     A per-hart interrupt is added to supervisor interrupt registers.
>>     When all sfence.vma requests are completed and sstatus:TLBIC has been
>>     triggered, hart will receive a TLBI finish interrupt. Just like timer,
>>     software and external interrupt's definition in sip & sie.
>>
>> Fake code:
>>
>> flush_tlb_page(vma, addr) {
>>     asid = cpu_asid(vma->vm_mm);
>>     ppn = PFN_DOWN(vma->vm_mm->pgd);
>>
>>     sfence.vma (addr, 1|PPN_OFFSET(ppn)|asid); //1. start request
>>
>>     while(sstatus:TLBI) if (time_out() > 1ms) break; //2. loop check
>>
>>     while (sstatus:TLBI) {
>>         ...
>>         set sstatus:TLBIC;
>>         wait_TLBI_finish_interrupt(); //3. wait irq, io_schedule
>>     }
>> }
>>
>> Here we give 2 level check:
>>  1) loop check sstatus:TLBI, CPU could response Interrupt.
>>  2) set sstatus:TLBIC and wait for irq, CPU schedule out for other task.
>>
>> ACE-DVM Example
>> ===============
>>
>> Honestly, "broadcasting addr, asid, vmid, S1/S2.PGD.PPN to interconnects"
>> and "ASYNC SFENCE.VMA" could be implemented by ACE-DVM protocol ref [2].
>>
>> There are 3 types of transactions in DVM:
>>
>>  - DVM operation
>>    Send all information to the interconnect, including addr, asid,
>>    S1.PGD.PPN, vmid, S2.PGD.PPN.
>>
>>  - DVM synchronization
>>    Check that all DVM operations have been completed. If not, it will use
>>    state machine to wait DVM complete requests.
>>
>>  - DVM complete
>>    Return transaction from components, eg: IOMMU. If hart has received all
>>    DVM completes which are triggered by sfence.vma instructions and
>>    "sstatus:TLBIC" has been set, a TLBI finish interrupt is triggered.
>>
>> (Actually, we do not need to implement the above functions strictly
>>  according to the ACE specification :P )
>>
>>  1: https://www.linuxplumbersconf.org/event/4/contributions/307/
>>  2: AMBA AXI and ACE Protocol Specification - Distributed Virtual Memory
>>     Transactions"
>>
>> Signed-off-by: Guo Ren <ren_guo@c-sky.com>
>> Reviewed-by: Li Feiteng <feiteng_li@c-sky.com>
>> ---
>>  src/hypervisor.tex |  43 ++++++++-------
>>  src/supervisor.tex | 155 +++++++++++++++++++++++++++++++++++++++++------------
>>  2 files changed, 143 insertions(+), 55 deletions(-)
>>
>> diff --git a/src/hypervisor.tex b/src/hypervisor.tex
>> index 47b90b2..3718819 100644
>> --- a/src/hypervisor.tex
>> +++ b/src/hypervisor.tex
>> @@ -1094,15 +1094,15 @@ The hypervisor extension adds two new privileged fence instructions.
>>  \multicolumn{1}{c|}{opcode} \\
>>  \hline
>>  7 & 5 & 5 & 3 & 5 & 7 \\
>> -HFENCE.GVMA & vmid & gaddr & PRIV & 0 & SYSTEM \\
>> -HFENCE.VVMA & asid & vaddr & PRIV & 0 & SYSTEM \\
>> +HFENCE.GVMA & mode:ppn:vmid & gaddr & PRIV & 0 & SYSTEM \\
>> +HFENCE.VVMA & mode:ppn:asid & vaddr & PRIV & 0 & SYSTEM \\
>>  \end{tabular}
>>  \end{center}
>>
>>  The hypervisor memory-management fence instructions, HFENCE.GVMA and
>>  HFENCE.VVMA, are valid only in HS-mode when {\tt mstatus}.TVM=0, or in M-mode
>>  (irrespective of {\tt mstatus}.TVM).
>> -These instructions perform a function similar to SFENCE.VMA
>> +These instructions perform a function similar to SFENCE.VMA (broadcast/local)
>>  (Section~\ref{sec:sfence.vma}), except applying to the guest-physical
>>  memory-management data structures controlled by CSR {\tt hgatp} (HFENCE.GVMA)
>>  or the VS-level memory-management data structures controlled by CSR {\tt vsatp}
>> @@ -1136,11 +1136,10 @@ An HFENCE.VVMA instruction applies only to a single virtual machine, identified
>>  by the setting of {\tt hgatp}.VMID when HFENCE.VVMA executes.
>>  \end{commentary}
>>
>> -When {\em rs2}$\neq${\tt x0}, bits XLEN-1:ASIDMAX of the value held in {\em
>> -rs2} are reserved for future use and should be zeroed by software and ignored
>> -by current implementations.
>> -Furthermore, if ASIDLEN~$<$~ASIDMAX, the implementation shall ignore bits
>> -ASIDMAX-1:ASIDLEN of the value held in {\em rs2}.
>> +When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, ppn, asid.
>> +1) mode control HFENCE.VVMA broadcast or not.
>> +2) ppn is the root page talbe's PPN of the asid address space.
>> +3) asid is the identifier of process in virtual machine.
>>
>>  \begin{commentary}
>>  Simpler implementations of HFENCE.VVMA can ignore the guest virtual address in
>> @@ -1168,11 +1167,10 @@ physical addresses in PMP address registers (Section~\ref{sec:pmp}) and in page
>>  table entries (Sections \ref{sec:sv32}, \ref{sec:sv39}, and~\ref{sec:sv48}).
>>  \end{commentary}
>>
>> -When {\em rs2}$\neq${\tt x0}, bits XLEN-1:VMIDMAX of the value held in {\em
>> -rs2} are reserved for future use and should be zeroed by software and ignored
>> -by current implementations.
>> -Furthermore, if VMIDLEN~$<$~VMIDMAX, the implementation shall ignore bits
>> -VMIDMAX-1:VMIDLEN of the value held in {\em rs2}.
>> +When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode, vmid, ppn.
>> +1) mode control HFENCE.GVMA broadcast or not.
>> +2) ppn is the root page talbe's PPN of the vmid address space.
>> +3) vmid is the identifier of virtual machine.
>>
>>  \begin{commentary}
>>  Simpler implementations of HFENCE.GVMA can ignore the guest physical address in
>> @@ -1567,21 +1565,22 @@ register.
>>  \subsection{Memory-Management Fences}
>>
>>  The behavior of the SFENCE.VMA instruction is affected by the current
>> -virtualization mode V.  When V=0, the virtual-address argument is an HS-level
>> -virtual address, and the ASID argument is an HS-level ASID.
>> +virtualization mode V.  When V=0, the rs1 argument is an HS-level
>> +virtual address, and the rs2 argument is an HS-level ASID and root page table's PPN.
>>  The instruction orders stores only to HS-level address-translation structures
>>  with subsequent HS-level address translations.
>>
>> -When V=1, the virtual-address argument to SFENCE.VMA is a guest virtual
>> -address within the current virtual machine, and the ASID argument is a VS-level
>> -ASID within the current virtual machine.
>> +When V=1, the rs1 argument to SFENCE.VMA is a guest virtual
>> +address within the current virtual machine, and the rs2 argument is a VS-level
>> +ASID and root page table's PPN within the current virtual machine.
>>  The current virtual machine is identified by the VMID field of CSR {\tt hgatp},
>> -and the effective ASID can be considered to be the combination of this VMID
>> -with the VS-level ASID.
>> +and the effective ASID and root page table's PPN can be considered to be the
>> +combination of this VMID and root page table's PPN with the VS-level ASID and
>> +root page table's PPN.
>>  The SFENCE.VMA instruction orders stores only to the VS-level
>>  address-translation structures with subsequent VS-level address translations
>> -for the same virtual machine, i.e., only when {\tt hgatp}.VMID is the same as
>> -when the SFENCE.VMA executed.
>> +for the same virtual machine, i.e., only when {\tt hgatp}.VMID and {\\tt hgatp}.PPN is
>> +the same as when the SFENCE.VMA executed.
>>
>>  Hypervisor instructions HFENCE.GVMA and HFENCE.VVMA provide additional
>>  memory-management fences to complement SFENCE.VMA.
>> diff --git a/src/supervisor.tex b/src/supervisor.tex
>> index ba3ced5..2877b7a 100644
>> --- a/src/supervisor.tex
>> +++ b/src/supervisor.tex
>> @@ -47,10 +47,12 @@ register keeps track of the processor's current operating state.
>>  \begin{center}
>>  \setlength{\tabcolsep}{4pt}
>>  \scalebox{0.95}{
>> -\begin{tabular}{cWcccccWccccWcc}
>> +\begin{tabular}{cccWcccccWccccWcc}
>>  \\
>>  \instbit{31} &
>> -\instbitrange{30}{20} &
>> +\instbit{30} &
>> +\instbit{29} &
>> +\instbitrange{28}{20} &
>>  \instbit{19} &
>>  \instbit{18} &
>>  \instbit{17} &
>> @@ -66,6 +68,8 @@ register keeps track of the processor's current operating state.
>>  \instbit{0} \\
>>  \hline
>>  \multicolumn{1}{|c|}{SD} &
>> +\multicolumn{1}{|c|}{TLBI} &
>> +\multicolumn{1}{|c|}{TLBIC} &
>>  \multicolumn{1}{c|}{\wpri} &
>>  \multicolumn{1}{c|}{MXR} &
>>  \multicolumn{1}{c|}{SUM} &
>> @@ -82,7 +86,7 @@ register keeps track of the processor's current operating state.
>>  \multicolumn{1}{c|}{\wpri}
>>  \\
>>  \hline
>> -1 & 11 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1 \\
>> +1 & 1 & 1 & 10 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1 \\
>>  \end{tabular}}
>>  \end{center}
>>  }
>> @@ -95,10 +99,12 @@ register keeps track of the processor's current operating state.
>>  {\footnotesize
>>  \begin{center}
>>  \setlength{\tabcolsep}{4pt}
>> -\begin{tabular}{cMFScccc}
>> +\begin{tabular}{cccMFScccc}
>>  \\
>>  \instbit{SXLEN-1} &
>> -\instbitrange{SXLEN-2}{34} &
>> +\instbit{SXLEN-2} &
>> +\instbit{SXLEN-3} &
>> +\instbitrange{SXLEN-4}{34} &
>>  \instbitrange{33}{32} &
>>  \instbitrange{31}{20} &
>>  \instbit{19} &
>> @@ -107,6 +113,8 @@ register keeps track of the processor's current operating state.
>>   \\
>>  \hline
>>  \multicolumn{1}{|c|}{SD} &
>> +\multicolumn{1}{|c|}{TLBI} &
>> +\multicolumn{1}{|c|}{TLBIC} &
>>  \multicolumn{1}{c|}{\wpri} &
>>  \multicolumn{1}{c|}{UXL[1:0]} &
>>  \multicolumn{1}{c|}{\wpri} &
>> @@ -115,7 +123,7 @@ register keeps track of the processor's current operating state.
>>  \multicolumn{1}{c|}{\wpri} &
>>   \\
>>  \hline
>> -1 & SXLEN-35 & 2 & 12 & 1 & 1 & 1 & \\
>> +1 & 1 & 1 & SXLEN-37 & 2 & 12 & 1 & 1 & 1 & \\
>>  \end{tabular}
>>  \begin{tabular}{cWWFccccWcc}
>>  \\
>> @@ -152,6 +160,17 @@ register keeps track of the processor's current operating state.
>>  \label{sstatusreg}
>>  \end{figure*}
>>
>> +The TLBI (read-only) bit indicates that any async sfence.vma operations are
>> +still pended on the hart. The value:0 means that there is no sfence.vma
>> +operations pending and value:1 means that there are still sfence.vma operations
>> +pending on the hart.
>> +
>> +When the sstatus:TLBIC bit is written 1, it triggers the hardware to check if
>> +there are any TLB invalidate operations being pended. When all operations are
>> +finished, a TLB Invalidate finish interrupt will be triggered
>> +(see Section~\ref{sipreg}). When the sstatus:TLBIC bit is written 0, it will
>> +cause nothing. Reading sstatus:TLBIC bit will alaways return 0.
>> +
>>  The SPP bit indicates the privilege level at which a hart was executing before
>>  entering supervisor mode.  When a trap is taken, SPP is set to 0 if the trap
>>  originated from user mode, or 1 otherwise.  When an SRET instruction
>> @@ -329,8 +348,10 @@ SXLEN-bit read/write register containing interrupt enable bits.
>>  {\footnotesize
>>  \begin{center}
>>  \setlength{\tabcolsep}{4pt}
>> -\begin{tabular}{KcFcFcc}
>> -\instbitrange{SXLEN-1}{10} &
>> +\begin{tabular}{KcFcFcFcc}
>> +\instbitrange{SXLEN-1}{14} &
>> +\instbit{13} &
>> +\instbitrange{12}{10} &
>>  \instbit{9} &
>>  \instbitrange{8}{6} &
>>  \instbit{5} &
>> @@ -339,6 +360,8 @@ SXLEN-bit read/write register containing interrupt enable bits.
>>  \instbit{0} \\
>>  \hline
>>  \multicolumn{1}{|c|}{\wpri} &
>> +\multicolumn{1}{c|}{STLBIP} &
>> +\multicolumn{1}{|c|}{\wpri} &
>>  \multicolumn{1}{c|}{SEIP} &
>>  \multicolumn{1}{c|}{\wpri} &
>>  \multicolumn{1}{c|}{STIP} &
>> @@ -346,7 +369,7 @@ SXLEN-bit read/write register containing interrupt enable bits.
>>  \multicolumn{1}{c|}{SSIP} &
>>  \multicolumn{1}{c|}{\wpri} \\
>>  \hline
>> -SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>> +SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
>>  \end{tabular}
>>  \end{center}
>>  }
>> @@ -359,8 +382,10 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>>  {\footnotesize
>>  \begin{center}
>>  \setlength{\tabcolsep}{4pt}
>> -\begin{tabular}{KcFcFcc}
>> -\instbitrange{SXLEN-1}{10} &
>> +\begin{tabular}{KcFcFcFcc}
>> +\instbitrange{SXLEN-1}{14} &
>> +\instbit{13} &
>> +\instbitrange{12}{10} &
>>  \instbit{9} &
>>  \instbitrange{8}{6} &
>>  \instbit{5} &
>> @@ -369,6 +394,8 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>>  \instbit{0} \\
>>  \hline
>>  \multicolumn{1}{|c|}{\wpri} &
>> +\multicolumn{1}{c|}{STLBIE} &
>> +\multicolumn{1}{|c|}{\wpri} &
>>  \multicolumn{1}{c|}{SEIE} &
>>  \multicolumn{1}{c|}{\wpri} &
>>  \multicolumn{1}{c|}{STIE} &
>> @@ -376,7 +403,7 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>>  \multicolumn{1}{c|}{SSIE} &
>>  \multicolumn{1}{c|}{\wpri} \\
>>  \hline
>> -SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
>> +SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
>>  \end{tabular}
>>  \end{center}
>>  }
>> @@ -410,6 +437,12 @@ when the SEIE bit in the {\tt sie} register is clear.  The implementation
>>  should provide facilities to mask, unmask, and query the cause of external
>>  interrupts.
>>
>> +A supervisor-level TLB Invalidate finish interrupt is pending if the STLBIP bit
>> +in the {\tt sip} register is set.  Supervisor-level TLB Invalidate finish
>> +interrupts are disabled when the STLBIE bit in the {\tt sie} register is clear.
>> +When hart tlb invalidate operations are finished, hardware will change sstatus:TLBI
>> +bit from 1 to 0 and trigger TLB Invalidate finish interrupt.
>> +
>>  \begin{commentary}
>>  The {\tt sip} and {\tt sie} registers are subsets of the {\tt mip} and {\tt
>>  mie} registers.  Reading any field, or writing any writable field, of {\tt
>> @@ -598,7 +631,9 @@ so is only guaranteed to hold supported exception codes.
>>    1         & 5               & Supervisor timer interrupt \\
>>    1         & 6--8            & {\em Reserved} \\
>>    1         & 9               & Supervisor external interrupt \\
>> -  1         & 10--15          & {\em Reserved} \\
>> +  1         & 10--11          & {\em Reserved} \\
>> +  1         & 12              & Supervisor TLBI finish interrupt \\
>> +  1         & 13--15          & {\em Reserved} \\
>>    1         & $\ge$16         & {\em Available for platform use} \\ \hline
>>    0         & 0               & Instruction address misaligned \\
>>    0         & 1               & Instruction access fault \\
>> @@ -884,7 +919,7 @@ provided.
>>  \multicolumn{1}{c|}{opcode} \\
>>  \hline
>>  7 & 5 & 5 & 3 & 5 & 7 \\
>> -SFENCE.VMA & asid & vaddr & PRIV & 0 & SYSTEM \\
>> +SFENCE.VMA & mode:ppn:asid & vaddr & LOCAL & 0 & SYSTEM \\
>>  \end{tabular}
>>  \end{center}
>>
>> @@ -899,21 +934,70 @@ from that hart to the memory-management data structures.
>>  Further details on the behavior of this instruction are
>>  described in Section~\ref{virt-control} and Section~\ref{pmp-vmem}.
>>
>> +SFENCE.VMA is defined as an asynchronous completion instruction, which means
>> +that the TLB operation is not guaranteed to complete when the instruction retires.
>> +Software need check sstatus:TLBI to determine all TLB operations complete.
>> +The sstatus:TLBI described in Section~\ref{sstatus}. When hardware change
>> +sstatus:TLBI bit from 1 to 0, the TLB Invalidate finish interrupt will be
>> +triggered.
>> +
>>  \begin{commentary}
>> -The SFENCE.VMA is used to flush any local hardware caches related to
>> +The SFENCE.VMA is used to flush any local/remote hardware caches related to
>>  address translation.  It is specified as a fence rather than a TLB
>>  flush to provide cleaner semantics with respect to which instructions
>>  are affected by the flush operation and to support a wider variety of
>>  dynamic caching structures and memory-management schemes.  SFENCE.VMA
>>  is also used by higher privilege levels to synchronize page table
>> -writes and the address translation hardware.
>> +writes and the address translation hardware. There is a mode bit to determine
>> +sfence.vma would broadcast on interconnect or not.
>>  \end{commentary}
>>
>> -SFENCE.VMA orders only the local hart's implicit references to the
>> -memory-management data structures.
>> +\begin{figure}[h!]
>> +{\footnotesize
>> +\begin{center}
>> +\begin{tabular}{c@{}E@{}K}
>> +\instbit{31} &
>> +\instbitrange{30}{9} &
>> +\instbitrange{8}{0} \\
>> +\hline
>> +\multicolumn{1}{|c|}{{\tt MODE}} &
>> +\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
>> +\multicolumn{1}{|c|}{{\tt ASID}} \\
>> +\hline
>> +1 & 22 & 9 \\
>> +\end{tabular}
>> +\end{center}
>> +}
>> +\vspace{-0.1in}
>> +\caption{RV32 sfence.vma rs2 format.}
>> +\label{rv32satp}
>> +\end{figure}
>> +
>> +\begin{figure}[h!]
>> +{\footnotesize
>> +\begin{center}
>> +\begin{tabular}{@{}S@{}T@{}U}
>> +\instbitrange{63}{60} &
>> +\instbitrange{59}{16} &
>> +\instbitrange{15}{0} \\
>> +\hline
>> +\multicolumn{1}{|c|}{{\tt MODE}} &
>> +\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
>> +\multicolumn{1}{|c|}{{\tt ASID}} \\
>> +\hline
>> +4 & 44 & 16 \\
>> +\end{tabular}
>> +\end{center}
>> +}
>> +\vspace{-0.1in}
>> +\caption{RV64 sfence.vma rs2 format, for MODE values, only highest bit:63 is
>> +valid and others are reserved.}
>> +\label{rv64satp}
>> +\end{figure}
>>
>>  \begin{commentary}
>> -Consequently, other harts must be notified separately when the
>> +The mode's highest bit could control sfence.vma behavior with 1:broadcast or 0:local.
>> +If only have mode:local, other harts must be notified separately when the
>>  memory-management data structures have been modified.
>>  One approach is to use 1)
>>  a local data fence to ensure local writes are visible globally, then
>> @@ -928,8 +1012,17 @@ modified for a single address mapping (i.e., one page or superpage), {\em rs1}
>>  can specify a virtual address within that mapping to effect a translation
>>  fence for that mapping only.  Furthermore, for the common case that the
>>  translation data structures have only been modified for a single address-space
>> -identifier, {\em rs2} can specify the address space.  The behavior of
>> -SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
>> +identifier, {\em rs2} can specify the address space with {\tt satp} format
>> +which include asid and root page table's PPN information.
>> +
>> +\begin{commentary}
>> +We use ASID and root page table's PPN to determine address space and the format
>> +stored in rs2 is similar with {\tt satp} described in Section~\ref{sec:satp}.
>> +ASID are used by local harts and root page table's PPN of the asid are used by
>> +other different TLB systems, eg: IOMMU.
>> +\end{commentary}
>> +
>> +The behavior of SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
>>
>>  \begin{itemize}
>>  \item If {\em rs1}={\tt x0} and {\em rs2}={\tt x0}, the fence orders all
>> @@ -939,23 +1032,18 @@ SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
>>        all reads and writes made to any level of the page tables, but only
>>        for the address space identified by integer register {\em rs2}.
>>        Accesses to {\em global} mappings (see Section~\ref{sec:translation})
>> -      are not ordered.
>> +      are not ordered. The mode field in rs2 is determine broadcast or local.
>>  \item If {\em rs1}$\neq${\tt x0} and {\em rs2}={\tt x0}, the fence orders
>>        only reads and writes made to the leaf page table entry corresponding
>>        to the virtual address in {\em rs1}, for all address spaces.
>>  \item If {\em rs1}$\neq${\tt x0} and {\em rs2}$\neq${\tt x0}, the fence
>>        orders only reads and writes made to the leaf page table entry
>>        corresponding to the virtual address in {\em rs1}, for the address
>> -      space identified by integer register {\em rs2}.
>> +      space identified by integer register {\em rs2}. The mode field in rs2
>> +      is determine broadcast or local.
>>        Accesses to global mappings are not ordered.
>>  \end{itemize}
>>
>> -When {\em rs2}$\neq${\tt x0}, bits SXLEN-1:ASIDMAX of the value held in {\em
>> -rs2} are reserved for future use and should be zeroed by software and ignored
>> -by current implementations.  Furthermore, if ASIDLEN~$<$~ASIDMAX, the
>> -implementation shall ignore bits ASIDMAX-1:ASIDLEN of the value held in {\em
>> -rs2}.
>> -
>>  \begin{commentary}
>>  Simpler implementations can ignore the virtual address in {\em rs1} and
>>  the ASID value in {\em rs2} and always perform a global fence.
>> @@ -994,7 +1082,7 @@ can execute the same SFENCE.VMA instruction while a different ASID is loaded
>>  into {\tt satp}, provided the next time {\tt satp} is loaded with the recycled
>>  ASID, it is simultaneously loaded with the new page table.
>>
>> -\item If the implementation does not provide ASIDs, or software chooses to
>> +\item If the implementation does not provide ASIDs and PPNs, or software chooses to
>>  always use ASID 0, then after every {\tt satp} write, software should execute
>>  SFENCE.VMA with {\em rs1}={\tt x0}.  In the common case that no global
>>  translations have been modified, {\em rs2} should be set to a register other than
>> @@ -1003,13 +1091,14 @@ not flushed.
>>
>>  \item If software modifies a non-leaf PTE, it should execute SFENCE.VMA with
>>  {\em rs1}={\tt x0}.  If any PTE along the traversal path had its G bit set,
>> -{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the ASID for
>> -which the translation is being modified.
>> +{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to the ASID and
>> +root page table's PPN for which the translation is being modified.
>>
>>  \item If software modifies a leaf PTE, it should execute SFENCE.VMA with {\em
>>  rs1} set to a virtual address within the page.  If any PTE along the traversal
>>  path had its G bit set, {\em rs2} must be {\tt x0}; otherwise, {\em rs2}
>> -should be set to the ASID for which the translation is being modified.
>> +should be set to the ASID and root page table's PPN for which the translation
>> +is being modified.
>>
>>  \item For the special cases of increasing the permissions on a leaf PTE and
>>  changing an invalid PTE to a valid leaf, software may choose to execute
>> --
>> 2.7.4
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-
>> Links: You receive all messages sent to this group.
>>
>> View/Reply Online (#810): https://lists.riscv.org/g/tech-privileged/message/810
>> Mute This Topic: https://lists.riscv.org/mt/34198986/1677273
>> Group Owner: tech-privileged+owner@lists.riscv.org
>> Unsubscribe: https://lists.riscv.org/g/tech-privileged/unsub  [andrew@sifive.com]
>> -=-=-=-=-=-=-=-=-=-=-=-
>>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [tech-privileged] [RFC PATCH V1] riscv-privileged: Add broadcast mode to sfence.vma
  2019-09-19 16:04 ` [tech-privileged] " Andrew Waterman
  2019-09-20  0:13   ` Guo Ren
@ 2019-09-20  2:27   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2019-09-20  2:27 UTC (permalink / raw)
  To: Andrew Waterman, Guo Ren
  Cc: julien.thierry, catalin.marinas, palmer, will.deacon,
	Atish.Patra, julien.grall, gary, linux-riscv, kvmarm,
	jean-philippe, linux-csky, rppt, Guo Ren, tech-privileged,
	marc.zyngier, linux-arm-kernel, feiteng_li, Anup.Patel,
	linux-kernel, iommu, dwmw2

On Thu, 2019-09-19 at 09:04 -0700, Andrew Waterman wrote:
> This needs to be discussed and debated at length; proposing edits to
> the spec at this stage is putting the cart before the horse!
> 
> We shouldn’t change the definition of the existing SFENCE.VMA
> instruction to accomplish this. It’s also not abundantly clear to me
> that this should be an instruction: TLB shootdown looks more like
> MMIO.

Quite a few points to make here:

 - TLB shootdown as MMIO is a problem when you start trying to do it
directly from guests (which is very desirable). I can elaborate if you
want, but it generally boils down to having a pile of copies of the
MMIO resources to assign to guests and having to constantly change
mappings which is unrealistic.

 - I generally have very serious doubts as to the value of doing
broadcast TLB shootdowns in HW. I went at lenght about it during Linux
Plumbers Conference, but I'll try to summarise here, from my experience
back at IBM working on POWER. In no special order:

   * It doesn't scale well. You have to drain all the target CPU queues
of already translated load stores, keep ordering, etc... it causes a
giant fabric traffic jam. Experience has shown that on some POWER
systems, it becomes extremely expensive.

   * Some OS such as Linux track which CPU has seen a given context.
That allows to "target" TLB invalidations in a more sensible way.
Broadcast instructions tend to lose that ability (it's hard to do in HW
esp. in a way that can be virtualized properly).

   * Because those instructions can take a very long time (for the
above reasons and some of below ones), or at least whatever context
synchronizing instruction that follow which waits for completion of the
invalidations, you end up with a CPU effectively "stuck" for a long
time, not taking interrupts, including those routed to higher priority
levels (ie. hypervisor etc...). This is problematic. A completion
polling mechanism is preferable so that once can still handle such work
while waiting but is hard to architect/implement properly when done as
"instructions" since they can happen concurrently from multiple
contexts. It's easier with MMIO but that has other issues.

   * It introduces races with anything that does SW walk of page
tables. For example MMIO emulation by a hypervisor cannot be done race-
free if the guest can do its own broadcast invalidations and the
hypervisor has to walk the guest page tables to translate. I can
elaborate if requested.

   * Those invalidations need to also target nest agents that can hold
translations, such as IOMMUs that can operate in usr contexts etc...
Such IOMMUs can take a VERY LONG time to process invalidations,
especially if translations have been checked out by devices such as
PCIe devices using ATS. Some GPUs for example can hit a worst case of
hundreds of *milliseconds* to process a TLB invalidation.

 - Now for the proposed scheme. I really don't like introducing a *new*
way of tagging an address space using the PPN. It's a hack. The right
way is to ensure that the existing context tags are big enough to not
require re-use and thus can be treated as global context tags by the
hypervisor and OS. IE. have big enough VMIDs and ASIDs that each
running VM can have a global stable VMID and each process within a
given VM can have a global (within that VM) ASID instead of playing
reallocation of ASID tricks at context switch (that's very inefficient
anyway, so that should be fixed for anything that claims performance
and scalability).

Now all of these things can probably have solutions but experience
doesn't seem to indicate that it's really worthwhile. We are better off
making sure we have a really fast IPI path to perform those via
interrupts and locally to the targetted CPUs IMHO.

Cheers,
Ben.

> 
> On Thu, Sep 19, 2019 at 5:36 AM Guo Ren <guoren@kernel.org> wrote:
> > From: Guo Ren <ren_guo@c-sky.com>
> > 
> > The patch is for https://github.com/riscv/riscv-isa-manual
> > 
> > The proposal has been talked in LPC-2019 RISC-V MC ref [1]. Here is
> > the
> > formal patch.
> > 
> > Introduction
> > ============
> > 
> > Using the Hardware TLB broadcast invalidation instruction to
> > maintain the
> > system TLB is a good choice and it'll simplify the system software
> > design.
> > The proposal hopes to add a broadcast mode to the sfence.vma in the
> > riscv-privilege specification. To support the sfence.vma broadcast
> > mode,
> > there are two modification introduced below:
> > 
> >  1) Add PGD.PPN (root page table's PPN) as the unique identifier of
> > the
> >     address space in addition to asid/vmid. Compared to the
> > dynamically
> >     changed asid/vmid, PGD.PPN is fixed throughout the address
> > space life
> >     cycle. This feature enables uniform address space
> > identification
> >     between different TLB systems (actually, it's difficult to
> > unify the
> >     asid/vmid between the CPU system and the IOMMU system, because
> > their
> >     mechanisms are different)
> > 
> >  2) Modify the definition of the sfence.vma instruction from
> > synchronous
> >     mode to asynchronous mode, which means that the completion of
> > the TLB
> >     operation is not guaranteed when the sfence.vma instruction
> > retires.
> >     It needs to be completed by checking the flag bit on the hart.
> > The
> >     sfence.vma request finish can notify the software by generating
> > an
> >     interrupt. This function alleviates the large delay of TLB
> > invalidation
> >     in the PCI ATS system.
> > 
> > Add S1/S2.PGD.PPN for ASID/VMID
> > ===============================
> > 
> > PGD is global directory (defined in linux) and PPN is page physical
> > number
> > (defined in riscv-spec). PGD.PNN corresponds to the root page table
> > pointer
> > of the address space, i.e. mm->pgd (linux concept).
> > 
> > In CPU/IOMMU TLB, we use asid/vmid to distinguish the address space
> > of
> > process or virtual machine. Due to the limitation of id encoding,
> > it can
> > only represent a part(window) of the address space. S1/S2.PGD.PPN
> > are the
> > root page table's PPNs of the address spaces and S1/S2.PGD.PPN are
> > the
> > unique identifier of the address spaces.
> > 
> > For the CPU SMP system, you can use context switch to perform the
> > necessary
> > software mechanism to ensure that the asid/vmid on all harts is
> > consistent
> > (please refer to the arm64 asid mechanism). In this way, the TLB
> > broadcast
> > invalidation instruction can determine the address space processed
> > on all
> > harts by asid/vmid.
> > 
> > Different from the CPU SMP system, there is no context switch for
> > the
> > DMA-IOMMU system, so the unification with the CPU asid/vmid cannot
> > be
> > guaranteed. So we need a unique identifier for the address space to
> > establish a communication bridge between the TLBs of different
> > systems.
> > 
> > That is PGD.PPN (for virtualization scenarios: S1/S2.PGD.PPN)
> > 
> > current:
> >  sfence.vma  rs1 = vaddr, rs2 = asid
> >  hfence.vvma rs1 = vaddr, rs2 = asid
> >  hfence.gvma rs1 = gaddr, rs2 = vmid
> > 
> > proposed:
> >  sfence.vma  rs1 = vaddr, rs2 = mode:ppn:asid
> >  hfence.vvma rs1 = vaddr, rs2 = mode:ppn:asid
> >  hfence.gvma rs1 = gaddr, rs2 = mode:ppn:vmid
> > 
> >  mode      - broadcast | local
> >  ppn       - the PPN of the address space of the root page table
> >  vmid/asid - the window identifier of the address space
> > 
> > At the Linux Plumber Conference 2019 RISCV-MC, ref:[1], we've
> > showed two
> > IOMMU examples to explain how it work with hardware.
> > 
> > 1) In a lightweight IOMMU system (up to 64 address spaces), the
> > hardware
> >    could directly convert PGD.PPN into DID (IOMMU ASID)
> > 
> > 2) For the PCI ATS scenario, its IO ASID/VMID encoding space can
> > support
> >    a very large number of address spaces. We use two reverse
> > mapping
> >    tables to let the hardware translate S1/S2.PGD.PPN into IO
> > ASID/VMID.
> > 
> > ASYNC BROADCAST SFENCE.VMA
> > ===========================
> > 
> > To support the high latency broadcast sfence.vma operation in the
> > PCI ATS
> > usage scenario, we modify the sfence.vma from synchronous mode to
> > asynchronous mode. (For simpler implementation, if hardware only
> > implement
> > synchronous mode and software still work in asynchronous mode)
> > 
> > To implement the asynchronous mode, 3 features are added:
> >  1) sstatus:TLBI
> >     A "status bit - TLBI" is added to the sstatus register. The
> > TLBI status
> >     bit indicates if there are still outstanding sfence.vma
> > requests on the
> >     current hart.
> >     Value:
> >       1: sfence.vma requests are not completed.
> >       0: all sfece.vma requests completed, request queue is empty.
> > 
> >  2) sstatus:TLBIC
> >     A "control bits - TLBIC" is added to sstatus register. The
> > TLBIC control
> >     bits are controlled by software.
> >     "Write 1" will trigger the current hart check to see if there
> > are still
> >     outstanding sfence.vma requests. If there are unfinished
> > requests, an
> >     interrupt will be generated when the request is completed,
> > notifying the
> >     software that all of the current sfence.vma requests have been
> > completed.
> >     "Write 0" will cause nothing.
> > 
> >  3) supervisor interrupt register (sip & sie):TLBI finish interrupt
> >     A per-hart interrupt is added to supervisor interrupt
> > registers.
> >     When all sfence.vma requests are completed and sstatus:TLBIC
> > has been
> >     triggered, hart will receive a TLBI finish interrupt. Just like
> > timer,
> >     software and external interrupt's definition in sip & sie.
> > 
> > Fake code:
> > 
> > flush_tlb_page(vma, addr) {
> >     asid = cpu_asid(vma->vm_mm);
> >     ppn = PFN_DOWN(vma->vm_mm->pgd);
> > 
> >     sfence.vma (addr, 1|PPN_OFFSET(ppn)|asid); //1. start request
> > 
> >     while(sstatus:TLBI) if (time_out() > 1ms) break; //2. loop
> > check
> > 
> >     while (sstatus:TLBI) {
> >         ...
> >         set sstatus:TLBIC;
> >         wait_TLBI_finish_interrupt(); //3. wait irq, io_schedule
> >     }
> > }
> > 
> > Here we give 2 level check:
> >  1) loop check sstatus:TLBI, CPU could response Interrupt.
> >  2) set sstatus:TLBIC and wait for irq, CPU schedule out for other
> > task.
> > 
> > ACE-DVM Example
> > ===============
> > 
> > Honestly, "broadcasting addr, asid, vmid, S1/S2.PGD.PPN to
> > interconnects"
> > and "ASYNC SFENCE.VMA" could be implemented by ACE-DVM protocol ref
> > [2].
> > 
> > There are 3 types of transactions in DVM:
> > 
> >  - DVM operation
> >    Send all information to the interconnect, including addr, asid,
> >    S1.PGD.PPN, vmid, S2.PGD.PPN.
> > 
> >  - DVM synchronization
> >    Check that all DVM operations have been completed. If not, it
> > will use
> >    state machine to wait DVM complete requests.
> > 
> >  - DVM complete
> >    Return transaction from components, eg: IOMMU. If hart has
> > received all
> >    DVM completes which are triggered by sfence.vma instructions and
> >    "sstatus:TLBIC" has been set, a TLBI finish interrupt is
> > triggered.
> > 
> > (Actually, we do not need to implement the above functions strictly
> >  according to the ACE specification :P )
> > 
> >  1: https://www.linuxplumbersconf.org/event/4/contributions/307/
> >  2: AMBA AXI and ACE Protocol Specification - Distributed Virtual
> > Memory
> >     Transactions"
> > 
> > Signed-off-by: Guo Ren <ren_guo@c-sky.com>
> > Reviewed-by: Li Feiteng <feiteng_li@c-sky.com>
> > ---
> >  src/hypervisor.tex |  43 ++++++++-------
> >  src/supervisor.tex | 155
> > +++++++++++++++++++++++++++++++++++++++++------------
> >  2 files changed, 143 insertions(+), 55 deletions(-)
> > 
> > diff --git a/src/hypervisor.tex b/src/hypervisor.tex
> > index 47b90b2..3718819 100644
> > --- a/src/hypervisor.tex
> > +++ b/src/hypervisor.tex
> > @@ -1094,15 +1094,15 @@ The hypervisor extension adds two new
> > privileged fence instructions.
> >  \multicolumn{1}{c|}{opcode} \\
> >  \hline
> >  7 & 5 & 5 & 3 & 5 & 7 \\
> > -HFENCE.GVMA & vmid & gaddr & PRIV & 0 & SYSTEM \\
> > -HFENCE.VVMA & asid & vaddr & PRIV & 0 & SYSTEM \\
> > +HFENCE.GVMA & mode:ppn:vmid & gaddr & PRIV & 0 & SYSTEM \\
> > +HFENCE.VVMA & mode:ppn:asid & vaddr & PRIV & 0 & SYSTEM \\
> >  \end{tabular}
> >  \end{center}
> > 
> >  The hypervisor memory-management fence instructions, HFENCE.GVMA
> > and
> >  HFENCE.VVMA, are valid only in HS-mode when {\tt mstatus}.TVM=0,
> > or in M-mode
> >  (irrespective of {\tt mstatus}.TVM).
> > -These instructions perform a function similar to SFENCE.VMA
> > +These instructions perform a function similar to SFENCE.VMA
> > (broadcast/local)
> >  (Section~\ref{sec:sfence.vma}), except applying to the guest-
> > physical
> >  memory-management data structures controlled by CSR {\tt hgatp}
> > (HFENCE.GVMA)
> >  or the VS-level memory-management data structures controlled by
> > CSR {\tt vsatp}
> > @@ -1136,11 +1136,10 @@ An HFENCE.VVMA instruction applies only to
> > a single virtual machine, identified
> >  by the setting of {\tt hgatp}.VMID when HFENCE.VVMA executes.
> >  \end{commentary}
> > 
> > -When {\em rs2}$\neq${\tt x0}, bits XLEN-1:ASIDMAX of the value
> > held in {\em
> > -rs2} are reserved for future use and should be zeroed by software
> > and ignored
> > -by current implementations.
> > -Furthermore, if ASIDLEN~$<$~ASIDMAX, the implementation shall
> > ignore bits
> > -ASIDMAX-1:ASIDLEN of the value held in {\em rs2}.
> > +When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode,
> > ppn, asid.
> > +1) mode control HFENCE.VVMA broadcast or not.
> > +2) ppn is the root page talbe's PPN of the asid address space.
> > +3) asid is the identifier of process in virtual machine.
> > 
> >  \begin{commentary}
> >  Simpler implementations of HFENCE.VVMA can ignore the guest
> > virtual address in
> > @@ -1168,11 +1167,10 @@ physical addresses in PMP address registers
> > (Section~\ref{sec:pmp}) and in page
> >  table entries (Sections \ref{sec:sv32}, \ref{sec:sv39},
> > and~\ref{sec:sv48}).
> >  \end{commentary}
> > 
> > -When {\em rs2}$\neq${\tt x0}, bits XLEN-1:VMIDMAX of the value
> > held in {\em
> > -rs2} are reserved for future use and should be zeroed by software
> > and ignored
> > -by current implementations.
> > -Furthermore, if VMIDLEN~$<$~VMIDMAX, the implementation shall
> > ignore bits
> > -VMIDMAX-1:VMIDLEN of the value held in {\em rs2}.
> > +When {\em rs2}$\neq${\tt x0}, bits contain 3 informations: mode,
> > vmid, ppn.
> > +1) mode control HFENCE.GVMA broadcast or not.
> > +2) ppn is the root page talbe's PPN of the vmid address space.
> > +3) vmid is the identifier of virtual machine.
> > 
> >  \begin{commentary}
> >  Simpler implementations of HFENCE.GVMA can ignore the guest
> > physical address in
> > @@ -1567,21 +1565,22 @@ register.
> >  \subsection{Memory-Management Fences}
> > 
> >  The behavior of the SFENCE.VMA instruction is affected by the
> > current
> > -virtualization mode V.  When V=0, the virtual-address argument is
> > an HS-level
> > -virtual address, and the ASID argument is an HS-level ASID.
> > +virtualization mode V.  When V=0, the rs1 argument is an HS-level
> > +virtual address, and the rs2 argument is an HS-level ASID and root
> > page table's PPN.
> >  The instruction orders stores only to HS-level address-translation 
> > structures
> >  with subsequent HS-level address translations.
> > 
> > -When V=1, the virtual-address argument to SFENCE.VMA is a guest
> > virtual
> > -address within the current virtual machine, and the ASID argument
> > is a VS-level
> > -ASID within the current virtual machine.
> > +When V=1, the rs1 argument to SFENCE.VMA is a guest virtual
> > +address within the current virtual machine, and the rs2 argument
> > is a VS-level
> > +ASID and root page table's PPN within the current virtual machine.
> >  The current virtual machine is identified by the VMID field of CSR
> > {\tt hgatp},
> > -and the effective ASID can be considered to be the combination of
> > this VMID
> > -with the VS-level ASID.
> > +and the effective ASID and root page table's PPN can be considered
> > to be the
> > +combination of this VMID and root page table's PPN with the VS-
> > level ASID and
> > +root page table's PPN.
> >  The SFENCE.VMA instruction orders stores only to the VS-level
> >  address-translation structures with subsequent VS-level address
> > translations
> > -for the same virtual machine, i.e., only when {\tt hgatp}.VMID is
> > the same as
> > -when the SFENCE.VMA executed.
> > +for the same virtual machine, i.e., only when {\tt hgatp}.VMID and
> > {\\tt hgatp}.PPN is
> > +the same as when the SFENCE.VMA executed.
> > 
> >  Hypervisor instructions HFENCE.GVMA and HFENCE.VVMA provide
> > additional
> >  memory-management fences to complement SFENCE.VMA.
> > diff --git a/src/supervisor.tex b/src/supervisor.tex
> > index ba3ced5..2877b7a 100644
> > --- a/src/supervisor.tex
> > +++ b/src/supervisor.tex
> > @@ -47,10 +47,12 @@ register keeps track of the processor's current
> > operating state.
> >  \begin{center}
> >  \setlength{\tabcolsep}{4pt}
> >  \scalebox{0.95}{
> > -\begin{tabular}{cWcccccWccccWcc}
> > +\begin{tabular}{cccWcccccWccccWcc}
> >  \\
> >  \instbit{31} &
> > -\instbitrange{30}{20} &
> > +\instbit{30} &
> > +\instbit{29} &
> > +\instbitrange{28}{20} &
> >  \instbit{19} &
> >  \instbit{18} &
> >  \instbit{17} &
> > @@ -66,6 +68,8 @@ register keeps track of the processor's current
> > operating state.
> >  \instbit{0} \\
> >  \hline
> >  \multicolumn{1}{|c|}{SD} &
> > +\multicolumn{1}{|c|}{TLBI} &
> > +\multicolumn{1}{|c|}{TLBIC} &
> >  \multicolumn{1}{c|}{\wpri} &
> >  \multicolumn{1}{c|}{MXR} &
> >  \multicolumn{1}{c|}{SUM} &
> > @@ -82,7 +86,7 @@ register keeps track of the processor's current
> > operating state.
> >  \multicolumn{1}{c|}{\wpri}
> >  \\
> >  \hline
> > -1 & 11 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1 \\
> > +1 & 1 & 1 & 10 & 1 & 1 & 1 & 2 & 2 & 4 & 1 & 1 & 1 & 1 & 3 & 1 & 1
> > \\
> >  \end{tabular}}
> >  \end{center}
> >  }
> > @@ -95,10 +99,12 @@ register keeps track of the processor's current
> > operating state.
> >  {\footnotesize
> >  \begin{center}
> >  \setlength{\tabcolsep}{4pt}
> > -\begin{tabular}{cMFScccc}
> > +\begin{tabular}{cccMFScccc}
> >  \\
> >  \instbit{SXLEN-1} &
> > -\instbitrange{SXLEN-2}{34} &
> > +\instbit{SXLEN-2} &
> > +\instbit{SXLEN-3} &
> > +\instbitrange{SXLEN-4}{34} &
> >  \instbitrange{33}{32} &
> >  \instbitrange{31}{20} &
> >  \instbit{19} &
> > @@ -107,6 +113,8 @@ register keeps track of the processor's current
> > operating state.
> >   \\
> >  \hline
> >  \multicolumn{1}{|c|}{SD} &
> > +\multicolumn{1}{|c|}{TLBI} &
> > +\multicolumn{1}{|c|}{TLBIC} &
> >  \multicolumn{1}{c|}{\wpri} &
> >  \multicolumn{1}{c|}{UXL[1:0]} &
> >  \multicolumn{1}{c|}{\wpri} &
> > @@ -115,7 +123,7 @@ register keeps track of the processor's current
> > operating state.
> >  \multicolumn{1}{c|}{\wpri} &
> >   \\
> >  \hline
> > -1 & SXLEN-35 & 2 & 12 & 1 & 1 & 1 & \\
> > +1 & 1 & 1 & SXLEN-37 & 2 & 12 & 1 & 1 & 1 & \\
> >  \end{tabular}
> >  \begin{tabular}{cWWFccccWcc}
> >  \\
> > @@ -152,6 +160,17 @@ register keeps track of the processor's
> > current operating state.
> >  \label{sstatusreg}
> >  \end{figure*}
> > 
> > +The TLBI (read-only) bit indicates that any async sfence.vma
> > operations are
> > +still pended on the hart. The value:0 means that there is no
> > sfence.vma
> > +operations pending and value:1 means that there are still
> > sfence.vma operations
> > +pending on the hart.
> > +
> > +When the sstatus:TLBIC bit is written 1, it triggers the hardware
> > to check if
> > +there are any TLB invalidate operations being pended. When all
> > operations are
> > +finished, a TLB Invalidate finish interrupt will be triggered
> > +(see Section~\ref{sipreg}). When the sstatus:TLBIC bit is written
> > 0, it will
> > +cause nothing. Reading sstatus:TLBIC bit will alaways return 0.
> > +
> >  The SPP bit indicates the privilege level at which a hart was
> > executing before
> >  entering supervisor mode.  When a trap is taken, SPP is set to 0
> > if the trap
> >  originated from user mode, or 1 otherwise.  When an SRET
> > instruction
> > @@ -329,8 +348,10 @@ SXLEN-bit read/write register containing
> > interrupt enable bits.
> >  {\footnotesize
> >  \begin{center}
> >  \setlength{\tabcolsep}{4pt}
> > -\begin{tabular}{KcFcFcc}
> > -\instbitrange{SXLEN-1}{10} &
> > +\begin{tabular}{KcFcFcFcc}
> > +\instbitrange{SXLEN-1}{14} &
> > +\instbit{13} &
> > +\instbitrange{12}{10} &
> >  \instbit{9} &
> >  \instbitrange{8}{6} &
> >  \instbit{5} &
> > @@ -339,6 +360,8 @@ SXLEN-bit read/write register containing
> > interrupt enable bits.
> >  \instbit{0} \\
> >  \hline
> >  \multicolumn{1}{|c|}{\wpri} &
> > +\multicolumn{1}{c|}{STLBIP} &
> > +\multicolumn{1}{|c|}{\wpri} &
> >  \multicolumn{1}{c|}{SEIP} &
> >  \multicolumn{1}{c|}{\wpri} &
> >  \multicolumn{1}{c|}{STIP} &
> > @@ -346,7 +369,7 @@ SXLEN-bit read/write register containing
> > interrupt enable bits.
> >  \multicolumn{1}{c|}{SSIP} &
> >  \multicolumn{1}{c|}{\wpri} \\
> >  \hline
> > -SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
> > +SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
> >  \end{tabular}
> >  \end{center}
> >  }
> > @@ -359,8 +382,10 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
> >  {\footnotesize
> >  \begin{center}
> >  \setlength{\tabcolsep}{4pt}
> > -\begin{tabular}{KcFcFcc}
> > -\instbitrange{SXLEN-1}{10} &
> > +\begin{tabular}{KcFcFcFcc}
> > +\instbitrange{SXLEN-1}{14} &
> > +\instbit{13} &
> > +\instbitrange{12}{10} &
> >  \instbit{9} &
> >  \instbitrange{8}{6} &
> >  \instbit{5} &
> > @@ -369,6 +394,8 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
> >  \instbit{0} \\
> >  \hline
> >  \multicolumn{1}{|c|}{\wpri} &
> > +\multicolumn{1}{c|}{STLBIE} &
> > +\multicolumn{1}{|c|}{\wpri} &
> >  \multicolumn{1}{c|}{SEIE} &
> >  \multicolumn{1}{c|}{\wpri} &
> >  \multicolumn{1}{c|}{STIE} &
> > @@ -376,7 +403,7 @@ SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
> >  \multicolumn{1}{c|}{SSIE} &
> >  \multicolumn{1}{c|}{\wpri} \\
> >  \hline
> > -SXLEN-10 & 1 & 3 & 1 & 3 & 1 & 1 \\
> > +SXLEN-14 & 1 & 3 & 1 & 3 & 1 & 3 & 1 & 1 \\
> >  \end{tabular}
> >  \end{center}
> >  }
> > @@ -410,6 +437,12 @@ when the SEIE bit in the {\tt sie} register is
> > clear.  The implementation
> >  should provide facilities to mask, unmask, and query the cause of
> > external
> >  interrupts.
> > 
> > +A supervisor-level TLB Invalidate finish interrupt is pending if
> > the STLBIP bit
> > +in the {\tt sip} register is set.  Supervisor-level TLB Invalidate
> > finish
> > +interrupts are disabled when the STLBIE bit in the {\tt sie}
> > register is clear.
> > +When hart tlb invalidate operations are finished, hardware will
> > change sstatus:TLBI
> > +bit from 1 to 0 and trigger TLB Invalidate finish interrupt.
> > +
> >  \begin{commentary}
> >  The {\tt sip} and {\tt sie} registers are subsets of the {\tt mip}
> > and {\tt
> >  mie} registers.  Reading any field, or writing any writable field,
> > of {\tt
> > @@ -598,7 +631,9 @@ so is only guaranteed to hold supported
> > exception codes.
> >    1         & 5               & Supervisor timer interrupt \\
> >    1         & 6--8            & {\em Reserved} \\
> >    1         & 9               & Supervisor external interrupt \\
> > -  1         & 10--15          & {\em Reserved} \\
> > +  1         & 10--11          & {\em Reserved} \\
> > +  1         & 12              & Supervisor TLBI finish interrupt
> > \\
> > +  1         & 13--15          & {\em Reserved} \\
> >    1         & $\ge$16         & {\em Available for platform use}
> > \\ \hline
> >    0         & 0               & Instruction address misaligned \\
> >    0         & 1               & Instruction access fault \\
> > @@ -884,7 +919,7 @@ provided.
> >  \multicolumn{1}{c|}{opcode} \\
> >  \hline
> >  7 & 5 & 5 & 3 & 5 & 7 \\
> > -SFENCE.VMA & asid & vaddr & PRIV & 0 & SYSTEM \\
> > +SFENCE.VMA & mode:ppn:asid & vaddr & LOCAL & 0 & SYSTEM \\
> >  \end{tabular}
> >  \end{center}
> > 
> > @@ -899,21 +934,70 @@ from that hart to the memory-management data
> > structures.
> >  Further details on the behavior of this instruction are
> >  described in Section~\ref{virt-control} and Section~\ref{pmp-
> > vmem}.
> > 
> > +SFENCE.VMA is defined as an asynchronous completion instruction,
> > which means
> > +that the TLB operation is not guaranteed to complete when the
> > instruction retires.
> > +Software need check sstatus:TLBI to determine all TLB operations
> > complete.
> > +The sstatus:TLBI described in Section~\ref{sstatus}. When hardware
> > change
> > +sstatus:TLBI bit from 1 to 0, the TLB Invalidate finish interrupt
> > will be
> > +triggered.
> > +
> >  \begin{commentary}
> > -The SFENCE.VMA is used to flush any local hardware caches related
> > to
> > +The SFENCE.VMA is used to flush any local/remote hardware caches
> > related to
> >  address translation.  It is specified as a fence rather than a TLB
> >  flush to provide cleaner semantics with respect to which
> > instructions
> >  are affected by the flush operation and to support a wider variety
> > of
> >  dynamic caching structures and memory-management schemes. 
> > SFENCE.VMA
> >  is also used by higher privilege levels to synchronize page table
> > -writes and the address translation hardware.
> > +writes and the address translation hardware. There is a mode bit
> > to determine
> > +sfence.vma would broadcast on interconnect or not.
> >  \end{commentary}
> > 
> > -SFENCE.VMA orders only the local hart's implicit references to the
> > -memory-management data structures.
> > +\begin{figure}[h!]
> > +{\footnotesize
> > +\begin{center}
> > +\begin{tabular}{c@{}E@{}K}
> > +\instbit{31} &
> > +\instbitrange{30}{9} &
> > +\instbitrange{8}{0} \\
> > +\hline
> > +\multicolumn{1}{|c|}{{\tt MODE}} &
> > +\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
> > +\multicolumn{1}{|c|}{{\tt ASID}} \\
> > +\hline
> > +1 & 22 & 9 \\
> > +\end{tabular}
> > +\end{center}
> > +}
> > +\vspace{-0.1in}
> > +\caption{RV32 sfence.vma rs2 format.}
> > +\label{rv32satp}
> > +\end{figure}
> > +
> > +\begin{figure}[h!]
> > +{\footnotesize
> > +\begin{center}
> > +\begin{tabular}{@{}S@{}T@{}U}
> > +\instbitrange{63}{60} &
> > +\instbitrange{59}{16} &
> > +\instbitrange{15}{0} \\
> > +\hline
> > +\multicolumn{1}{|c|}{{\tt MODE}} &
> > +\multicolumn{1}{|c|}{{\tt PPN (root page table)}} &
> > +\multicolumn{1}{|c|}{{\tt ASID}} \\
> > +\hline
> > +4 & 44 & 16 \\
> > +\end{tabular}
> > +\end{center}
> > +}
> > +\vspace{-0.1in}
> > +\caption{RV64 sfence.vma rs2 format, for MODE values, only highest
> > bit:63 is
> > +valid and others are reserved.}
> > +\label{rv64satp}
> > +\end{figure}
> > 
> >  \begin{commentary}
> > -Consequently, other harts must be notified separately when the
> > +The mode's highest bit could control sfence.vma behavior with
> > 1:broadcast or 0:local.
> > +If only have mode:local, other harts must be notified separately
> > when the
> >  memory-management data structures have been modified.
> >  One approach is to use 1)
> >  a local data fence to ensure local writes are visible globally,
> > then
> > @@ -928,8 +1012,17 @@ modified for a single address mapping (i.e.,
> > one page or superpage), {\em rs1}
> >  can specify a virtual address within that mapping to effect a
> > translation
> >  fence for that mapping only.  Furthermore, for the common case
> > that the
> >  translation data structures have only been modified for a single
> > address-space
> > -identifier, {\em rs2} can specify the address space.  The behavior
> > of
> > -SFENCE.VMA depends on {\em rs1} and {\em rs2} as follows:
> > +identifier, {\em rs2} can specify the address space with {\tt
> > satp} format
> > +which include asid and root page table's PPN information.
> > +
> > +\begin{commentary}
> > +We use ASID and root page table's PPN to determine address space
> > and the format
> > +stored in rs2 is similar with {\tt satp} described in
> > Section~\ref{sec:satp}.
> > +ASID are used by local harts and root page table's PPN of the asid
> > are used by
> > +other different TLB systems, eg: IOMMU.
> > +\end{commentary}
> > +
> > +The behavior of SFENCE.VMA depends on {\em rs1} and {\em rs2} as
> > follows:
> > 
> >  \begin{itemize}
> >  \item If {\em rs1}={\tt x0} and {\em rs2}={\tt x0}, the fence
> > orders all
> > @@ -939,23 +1032,18 @@ SFENCE.VMA depends on {\em rs1} and {\em
> > rs2} as follows:
> >        all reads and writes made to any level of the page tables,
> > but only
> >        for the address space identified by integer register {\em
> > rs2}.
> >        Accesses to {\em global} mappings (see
> > Section~\ref{sec:translation})
> > -      are not ordered.
> > +      are not ordered. The mode field in rs2 is determine
> > broadcast or local.
> >  \item If {\em rs1}$\neq${\tt x0} and {\em rs2}={\tt x0}, the fence
> > orders
> >        only reads and writes made to the leaf page table entry
> > corresponding
> >        to the virtual address in {\em rs1}, for all address spaces.
> >  \item If {\em rs1}$\neq${\tt x0} and {\em rs2}$\neq${\tt x0}, the
> > fence
> >        orders only reads and writes made to the leaf page table
> > entry
> >        corresponding to the virtual address in {\em rs1}, for the
> > address
> > -      space identified by integer register {\em rs2}.
> > +      space identified by integer register {\em rs2}. The mode
> > field in rs2
> > +      is determine broadcast or local.
> >        Accesses to global mappings are not ordered.
> >  \end{itemize}
> > 
> > -When {\em rs2}$\neq${\tt x0}, bits SXLEN-1:ASIDMAX of the value
> > held in {\em
> > -rs2} are reserved for future use and should be zeroed by software
> > and ignored
> > -by current implementations.  Furthermore, if ASIDLEN~$<$~ASIDMAX,
> > the
> > -implementation shall ignore bits ASIDMAX-1:ASIDLEN of the value
> > held in {\em
> > -rs2}.
> > -
> >  \begin{commentary}
> >  Simpler implementations can ignore the virtual address in {\em
> > rs1} and
> >  the ASID value in {\em rs2} and always perform a global fence.
> > @@ -994,7 +1082,7 @@ can execute the same SFENCE.VMA instruction
> > while a different ASID is loaded
> >  into {\tt satp}, provided the next time {\tt satp} is loaded with
> > the recycled
> >  ASID, it is simultaneously loaded with the new page table.
> > 
> > -\item If the implementation does not provide ASIDs, or software
> > chooses to
> > +\item If the implementation does not provide ASIDs and PPNs, or
> > software chooses to
> >  always use ASID 0, then after every {\tt satp} write, software
> > should execute
> >  SFENCE.VMA with {\em rs1}={\tt x0}.  In the common case that no
> > global
> >  translations have been modified, {\em rs2} should be set to a
> > register other than
> > @@ -1003,13 +1091,14 @@ not flushed.
> > 
> >  \item If software modifies a non-leaf PTE, it should execute
> > SFENCE.VMA with
> >  {\em rs1}={\tt x0}.  If any PTE along the traversal path had its G
> > bit set,
> > -{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to
> > the ASID for
> > -which the translation is being modified.
> > +{\em rs2} must be {\tt x0}; otherwise, {\em rs2} should be set to
> > the ASID and
> > +root page table's PPN for which the translation is being modified.
> > 
> >  \item If software modifies a leaf PTE, it should execute
> > SFENCE.VMA with {\em
> >  rs1} set to a virtual address within the page.  If any PTE along
> > the traversal
> >  path had its G bit set, {\em rs2} must be {\tt x0}; otherwise,
> > {\em rs2}
> > -should be set to the ASID for which the translation is being
> > modified.
> > +should be set to the ASID and root page table's PPN for which the
> > translation
> > +is being modified.
> > 
> >  \item For the special cases of increasing the permissions on a
> > leaf PTE and
> >  changing an invalid PTE to a valid leaf, software may choose to
> > execute

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19 12:35 [RFC PATCH V1] riscv-privileged: Add broadcast mode to sfence.vma guoren
2019-09-19 16:04 ` [tech-privileged] " Andrew Waterman
2019-09-20  0:13   ` Guo Ren
2019-09-20  2:27   ` Benjamin Herrenschmidt

IOMMU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-iommu/0 linux-iommu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-iommu linux-iommu/ https://lore.kernel.org/linux-iommu \
		iommu@lists.linux-foundation.org
	public-inbox-index linux-iommu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.linux-foundation.lists.iommu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git