All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 20:33 ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-02 20:33 UTC (permalink / raw)
  To: kvm-ppc, kvm, qemu-devel

Below is a proposal for a new API for PPC to allow KVM clients
to set MMU state in a vcpu.

BookE processors have one or more software managed TLBs and
currently there is no mechanism for Qemu to initialize
or access them.  This is needed for normal initialization
as well as debug.

There are 4 APIs:
   
-KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
 of MMU with KVM-- the type determines the size and format
 of the data in the other APIs

-KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
 TLBs in the vcpu

-KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
 specifies the format of the MMU data passed in

-KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,
 or iterating over an entire TLB.  Some TLBs may have an unspecified
 geometry and thus the need to be able to iterate in order
 to dump the TLB.  The Power architecture specifies the format
 of the MMU data

Feedback welcome.

Thanks,
Stuart Yoder

------------------------------------------------------------------

KVM PPC MMU API
---------------

User space can query whether the APIs to access the vcpu mmu
is available with the KVM_CHECK_EXTENSION API using
the KVM_CAP_PPC_MMU argument.

If the KVM_CAP_PPC_MMU return value is non-zero it specifies that
the following APIs are available:

   KVM_PPC_SET_MMU_TYPE
   KVM_PPC_INVALIDATE_TLB
   KVM_PPC_SET_TLBE
   KVM_PPC_GET_MMU


KVM_PPC_SET_MMU_TYPE
--------------------

Capability: KVM_CAP_PPC_SET_MMU_TYPE
Architectures: powerpc
Type: vcpu ioctl
Parameters: __u32 mmu_type (in)
Returns: 0 if specified MMU type is supported, else -1

Sets the MMU type.  Valid input values are:
   BOOKE_NOHV   0x1
   BOOKE_HV     0x2

A return value of 0x0 indicates that KVM supports
the specified MMU type.

KVM_PPC_INVALIDATE_TLB
----------------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: none
Returns: 0 on success, -1 on error

Invalidates all TLB entries in all TLBs of the vcpu.

KVM_PPC_SET_TLBE
----------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters:
        For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)
Returns: 0 on success, -1 on error

Sets an MMU entry in a virtual CPU.

For mmu types BOOKE_NOHV and BOOKE_HV:

      To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu 
      as per the Power architecture.

      struct kvm_ppc_booke_mmu {
            union {
                  __u64 mas0_1;
                  struct {
                        __u32 mas0;
                        __u32 mas1;
                  };
            };
            __u64 mas2;
            union {
                  __u64 mas7_3      
                  struct {
                        __u32 mas7;
                        __u32 mas3;
                  };
            };
            union {
                  __u64 mas5_6      
                  struct {
                        __u64 mas5;
                        __u64 mas6;
                  };
            }
            __u32 mas8;
      };

      For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields
      in kvm_ppc_booke_mmu are present but not supported.


KVM_PPC_GET_TLB
---------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: struct kvm_ppc_get_mmu (in/out)
Returns: 0 on success
         -1 on error
         errno = ENOENT when iterating and there are no more entries to read

Reads an MMU entry from a virtual CPU.

      struct kvm_ppc_get_mmu {
            /* in */
                void *mmu;
            __u32 flags;
                  /* a bitmask of flags to the API */
                    /*     TLB_READ_FIRST   0x1      */
                    /*     TLB_SEARCH       0x2      */
            /* out */
            __u32 max_entries;
      };

For mmu types BOOKE_NOHV and BOOKE_HV :

      The "void *mmu" field of kvm_ppc_get_mmu points to 
        a struct of type "struct kvm_ppc_booke_mmu".

      If TLBnCFG[NENTRY] > 0 and TLBnCFG[ASSOC] > 0, the TLB has
      of known number of entries and associativity.  The mas0[ESEL]
      and mas2[EPN] fields specify which entry to read.
      
      If TLBnCFG[NENTRY] == 0 the number of TLB entries is 
      undefined and this API can be used to iterate over
      the entire TLB selected with TLBSEL in mas0.
      
      -To read a TLB entry:
      
         set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):
            flags=0
            mas0[TLBSEL] // select which TLB is being read
            mas0[ESEL]   // select which entry is being read
            mas2[EPN]    // effective address 
      
         On return the following fields are updated as per the Power architecture:
            mas0
            mas1 
            mas2 
            mas3 
            mas7 
      
      -To iterate over a TLB (read all entries):
      
        To start an interation sequence, set the following fields in
        the mmu struct (struct kvm_ppc_booke_mmu)
            flags=TLB_READ_FIRST
            mas0[TLBSEL]  // select which TLB is being read
      
        On return the following fields are updated:
            mas0           // set as per Power arch
            mas1           // set as per Power arch
            mas2           // set as per Power arch
            mas3           // set as per Power arch
            mas7           // set as per Power arch
            max_entries    // Contains upper limit on number of entries that may
                           // be returned. A value of 0xffffffff means there is
                           // no meaningful upper bound.
         
        For subsequent calls to the API the following output fields must
        be passed back into the API unmodified:
            flags
            mas0
            mas2
      
        A return value of -ENOENT indicates that there are no more
        entries to be read.
      
      -To search for TLB entry
      
         To search for TLB entry, set the following fields in
         the mmu struct (struct kvm_ppc_booke_mmu):
            flags=TLB_SEARCH
            mas2[EPN]    // effective address to search for
            mas6         // set as per the Power arch
            mas5         // set as per the Power arch
      
         On return, the following fields are updated as per the Power architecture:
            mas0
            mas1 
            mas2 
            mas3 
            mas7 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 20:33 ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-02 20:33 UTC (permalink / raw)
  To: kvm-ppc, kvm, qemu-devel

Below is a proposal for a new API for PPC to allow KVM clients
to set MMU state in a vcpu.

BookE processors have one or more software managed TLBs and
currently there is no mechanism for Qemu to initialize
or access them.  This is needed for normal initialization
as well as debug.

There are 4 APIs:
   
-KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
 of MMU with KVM-- the type determines the size and format
 of the data in the other APIs

-KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
 TLBs in the vcpu

-KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
 specifies the format of the MMU data passed in

-KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,
 or iterating over an entire TLB.  Some TLBs may have an unspecified
 geometry and thus the need to be able to iterate in order
 to dump the TLB.  The Power architecture specifies the format
 of the MMU data

Feedback welcome.

Thanks,
Stuart Yoder

------------------------------------------------------------------

KVM PPC MMU API
---------------

User space can query whether the APIs to access the vcpu mmu
is available with the KVM_CHECK_EXTENSION API using
the KVM_CAP_PPC_MMU argument.

If the KVM_CAP_PPC_MMU return value is non-zero it specifies that
the following APIs are available:

   KVM_PPC_SET_MMU_TYPE
   KVM_PPC_INVALIDATE_TLB
   KVM_PPC_SET_TLBE
   KVM_PPC_GET_MMU


KVM_PPC_SET_MMU_TYPE
--------------------

Capability: KVM_CAP_PPC_SET_MMU_TYPE
Architectures: powerpc
Type: vcpu ioctl
Parameters: __u32 mmu_type (in)
Returns: 0 if specified MMU type is supported, else -1

Sets the MMU type.  Valid input values are:
   BOOKE_NOHV   0x1
   BOOKE_HV     0x2

A return value of 0x0 indicates that KVM supports
the specified MMU type.

KVM_PPC_INVALIDATE_TLB
----------------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: none
Returns: 0 on success, -1 on error

Invalidates all TLB entries in all TLBs of the vcpu.

KVM_PPC_SET_TLBE
----------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters:
        For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)
Returns: 0 on success, -1 on error

Sets an MMU entry in a virtual CPU.

For mmu types BOOKE_NOHV and BOOKE_HV:

      To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu 
      as per the Power architecture.

      struct kvm_ppc_booke_mmu {
            union {
                  __u64 mas0_1;
                  struct {
                        __u32 mas0;
                        __u32 mas1;
                  };
            };
            __u64 mas2;
            union {
                  __u64 mas7_3      
                  struct {
                        __u32 mas7;
                        __u32 mas3;
                  };
            };
            union {
                  __u64 mas5_6      
                  struct {
                        __u64 mas5;
                        __u64 mas6;
                  };
            }
            __u32 mas8;
      };

      For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields
      in kvm_ppc_booke_mmu are present but not supported.


KVM_PPC_GET_TLB
---------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: struct kvm_ppc_get_mmu (in/out)
Returns: 0 on success
         -1 on error
         errno = ENOENT when iterating and there are no more entries to read

Reads an MMU entry from a virtual CPU.

      struct kvm_ppc_get_mmu {
            /* in */
                void *mmu;
            __u32 flags;
                  /* a bitmask of flags to the API */
                    /*     TLB_READ_FIRST   0x1      */
                    /*     TLB_SEARCH       0x2      */
            /* out */
            __u32 max_entries;
      };

For mmu types BOOKE_NOHV and BOOKE_HV :

      The "void *mmu" field of kvm_ppc_get_mmu points to 
        a struct of type "struct kvm_ppc_booke_mmu".

      If TLBnCFG[NENTRY] > 0 and TLBnCFG[ASSOC] > 0, the TLB has
      of known number of entries and associativity.  The mas0[ESEL]
      and mas2[EPN] fields specify which entry to read.
      
      If TLBnCFG[NENTRY] == 0 the number of TLB entries is 
      undefined and this API can be used to iterate over
      the entire TLB selected with TLBSEL in mas0.
      
      -To read a TLB entry:
      
         set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):
            flags=0
            mas0[TLBSEL] // select which TLB is being read
            mas0[ESEL]   // select which entry is being read
            mas2[EPN]    // effective address 
      
         On return the following fields are updated as per the Power architecture:
            mas0
            mas1 
            mas2 
            mas3 
            mas7 
      
      -To iterate over a TLB (read all entries):
      
        To start an interation sequence, set the following fields in
        the mmu struct (struct kvm_ppc_booke_mmu)
            flags=TLB_READ_FIRST
            mas0[TLBSEL]  // select which TLB is being read
      
        On return the following fields are updated:
            mas0           // set as per Power arch
            mas1           // set as per Power arch
            mas2           // set as per Power arch
            mas3           // set as per Power arch
            mas7           // set as per Power arch
            max_entries    // Contains upper limit on number of entries that may
                           // be returned. A value of 0xffffffff means there is
                           // no meaningful upper bound.
         
        For subsequent calls to the API the following output fields must
        be passed back into the API unmodified:
            flags
            mas0
            mas2
      
        A return value of -ENOENT indicates that there are no more
        entries to be read.
      
      -To search for TLB entry
      
         To search for TLB entry, set the following fields in
         the mmu struct (struct kvm_ppc_booke_mmu):
            flags=TLB_SEARCH
            mas2[EPN]    // effective address to search for
            mas6         // set as per the Power arch
            mas5         // set as per the Power arch
      
         On return, the following fields are updated as per the Power architecture:
            mas0
            mas1 
            mas2 
            mas3 
            mas7 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 20:33 ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-02 20:33 UTC (permalink / raw)
  To: kvm-ppc, kvm, qemu-devel

Below is a proposal for a new API for PPC to allow KVM clients
to set MMU state in a vcpu.

BookE processors have one or more software managed TLBs and
currently there is no mechanism for Qemu to initialize
or access them.  This is needed for normal initialization
as well as debug.

There are 4 APIs:
   
-KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
 of MMU with KVM-- the type determines the size and format
 of the data in the other APIs

-KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
 TLBs in the vcpu

-KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
 specifies the format of the MMU data passed in

-KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,
 or iterating over an entire TLB.  Some TLBs may have an unspecified
 geometry and thus the need to be able to iterate in order
 to dump the TLB.  The Power architecture specifies the format
 of the MMU data

Feedback welcome.

Thanks,
Stuart Yoder

------------------------------------------------------------------

KVM PPC MMU API
---------------

User space can query whether the APIs to access the vcpu mmu
is available with the KVM_CHECK_EXTENSION API using
the KVM_CAP_PPC_MMU argument.

If the KVM_CAP_PPC_MMU return value is non-zero it specifies that
the following APIs are available:

   KVM_PPC_SET_MMU_TYPE
   KVM_PPC_INVALIDATE_TLB
   KVM_PPC_SET_TLBE
   KVM_PPC_GET_MMU


KVM_PPC_SET_MMU_TYPE
--------------------

Capability: KVM_CAP_PPC_SET_MMU_TYPE
Architectures: powerpc
Type: vcpu ioctl
Parameters: __u32 mmu_type (in)
Returns: 0 if specified MMU type is supported, else -1

Sets the MMU type.  Valid input values are:
   BOOKE_NOHV   0x1
   BOOKE_HV     0x2

A return value of 0x0 indicates that KVM supports
the specified MMU type.

KVM_PPC_INVALIDATE_TLB
----------------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: none
Returns: 0 on success, -1 on error

Invalidates all TLB entries in all TLBs of the vcpu.

KVM_PPC_SET_TLBE
----------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters:
        For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)
Returns: 0 on success, -1 on error

Sets an MMU entry in a virtual CPU.

For mmu types BOOKE_NOHV and BOOKE_HV:

      To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu 
      as per the Power architecture.

      struct kvm_ppc_booke_mmu {
            union {
                  __u64 mas0_1;
                  struct {
                        __u32 mas0;
                        __u32 mas1;
                  };
            };
            __u64 mas2;
            union {
                  __u64 mas7_3      
                  struct {
                        __u32 mas7;
                        __u32 mas3;
                  };
            };
            union {
                  __u64 mas5_6      
                  struct {
                        __u64 mas5;
                        __u64 mas6;
                  };
            }
            __u32 mas8;
      };

      For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields
      in kvm_ppc_booke_mmu are present but not supported.


KVM_PPC_GET_TLB
---------------

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: struct kvm_ppc_get_mmu (in/out)
Returns: 0 on success
         -1 on error
         errno = ENOENT when iterating and there are no more entries to read

Reads an MMU entry from a virtual CPU.

      struct kvm_ppc_get_mmu {
            /* in */
                void *mmu;
            __u32 flags;
                  /* a bitmask of flags to the API */
                    /*     TLB_READ_FIRST   0x1      */
                    /*     TLB_SEARCH       0x2      */
            /* out */
            __u32 max_entries;
      };

For mmu types BOOKE_NOHV and BOOKE_HV :

      The "void *mmu" field of kvm_ppc_get_mmu points to 
        a struct of type "struct kvm_ppc_booke_mmu".

      If TLBnCFG[NENTRY] > 0 and TLBnCFG[ASSOC] > 0, the TLB has
      of known number of entries and associativity.  The mas0[ESEL]
      and mas2[EPN] fields specify which entry to read.
      
      If TLBnCFG[NENTRY] = 0 the number of TLB entries is 
      undefined and this API can be used to iterate over
      the entire TLB selected with TLBSEL in mas0.
      
      -To read a TLB entry:
      
         set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):
            flags=0
            mas0[TLBSEL] // select which TLB is being read
            mas0[ESEL]   // select which entry is being read
            mas2[EPN]    // effective address 
      
         On return the following fields are updated as per the Power architecture:
            mas0
            mas1 
            mas2 
            mas3 
            mas7 
      
      -To iterate over a TLB (read all entries):
      
        To start an interation sequence, set the following fields in
        the mmu struct (struct kvm_ppc_booke_mmu)
            flags=TLB_READ_FIRST
            mas0[TLBSEL]  // select which TLB is being read
      
        On return the following fields are updated:
            mas0           // set as per Power arch
            mas1           // set as per Power arch
            mas2           // set as per Power arch
            mas3           // set as per Power arch
            mas7           // set as per Power arch
            max_entries    // Contains upper limit on number of entries that may
                           // be returned. A value of 0xffffffff means there is
                           // no meaningful upper bound.
         
        For subsequent calls to the API the following output fields must
        be passed back into the API unmodified:
            flags
            mas0
            mas2
      
        A return value of -ENOENT indicates that there are no more
        entries to be read.
      
      -To search for TLB entry
      
         To search for TLB entry, set the following fields in
         the mmu struct (struct kvm_ppc_booke_mmu):
            flags=TLB_SEARCH
            mas2[EPN]    // effective address to search for
            mas6         // set as per the Power arch
            mas5         // set as per the Power arch
      
         On return, the following fields are updated as per the Power architecture:
            mas0
            mas1 
            mas2 
            mas3 
            mas7 


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-02 20:33 ` [Qemu-devel] " Yoder Stuart-B08248
  (?)
@ 2011-02-02 21:33   ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-02 21:33 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: kvm-ppc, kvm, qemu-devel


On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:

> Below is a proposal for a new API for PPC to allow KVM clients
> to set MMU state in a vcpu.
> 
> BookE processors have one or more software managed TLBs and
> currently there is no mechanism for Qemu to initialize
> or access them.  This is needed for normal initialization
> as well as debug.
> 
> There are 4 APIs:
> 
> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
> of MMU with KVM-- the type determines the size and format
> of the data in the other APIs

This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.

> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
> TLBs in the vcpu
> 
> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
> specifies the format of the MMU data passed in

This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?

Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.


> -KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,
> or iterating over an entire TLB.  Some TLBs may have an unspecified
> geometry and thus the need to be able to iterate in order
> to dump the TLB.  The Power architecture specifies the format
> of the MMU data
> 
> Feedback welcome.
> 
> Thanks,
> Stuart Yoder
> 
> ------------------------------------------------------------------
> 
> KVM PPC MMU API
> ---------------
> 
> User space can query whether the APIs to access the vcpu mmu
> is available with the KVM_CHECK_EXTENSION API using
> the KVM_CAP_PPC_MMU argument.
> 
> If the KVM_CAP_PPC_MMU return value is non-zero it specifies that
> the following APIs are available:
> 
>   KVM_PPC_SET_MMU_TYPE
>   KVM_PPC_INVALIDATE_TLB
>   KVM_PPC_SET_TLBE
>   KVM_PPC_GET_MMU
> 
> 
> KVM_PPC_SET_MMU_TYPE
> --------------------
> 
> Capability: KVM_CAP_PPC_SET_MMU_TYPE
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: __u32 mmu_type (in)
> Returns: 0 if specified MMU type is supported, else -1
> 
> Sets the MMU type.  Valid input values are:
>   BOOKE_NOHV   0x1
>   BOOKE_HV     0x2
> 
> A return value of 0x0 indicates that KVM supports
> the specified MMU type.

We should probably return some failure code when a PVR gets set that KVM doesn't understand. That would automatically give us that functionality.

> 
> KVM_PPC_INVALIDATE_TLB
> ----------------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: none
> Returns: 0 on success, -1 on error
> 
> Invalidates all TLB entries in all TLBs of the vcpu.

The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.

> 
> KVM_PPC_SET_TLBE
> ----------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters:
>        For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)
> Returns: 0 on success, -1 on error
> 
> Sets an MMU entry in a virtual CPU.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV:
> 
>      To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu 
>      as per the Power architecture.
> 
>      struct kvm_ppc_booke_mmu {
>            union {
>                  __u64 mas0_1;
>                  struct {
>                        __u32 mas0;
>                        __u32 mas1;
>                  };
>            };
>            __u64 mas2;
>            union {
>                  __u64 mas7_3      
>                  struct {
>                        __u32 mas7;
>                        __u32 mas3;
>                  };
>            };
>            union {
>                  __u64 mas5_6      
>                  struct {
>                        __u64 mas5;
>                        __u64 mas6;
>                  };
>            }
>            __u32 mas8;
>      };
> 
>      For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields
>      in kvm_ppc_booke_mmu are present but not supported.

Haven't fully made up my mind on the tlb entry structure yet. Maybe something like

struct kvm_ppc_booke_tlbe {
    __u64 data[8];
};

would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).

> 
> 
> KVM_PPC_GET_TLB
> ---------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: struct kvm_ppc_get_mmu (in/out)
> Returns: 0 on success
>         -1 on error
>         errno = ENOENT when iterating and there are no more entries to read
> 
> Reads an MMU entry from a virtual CPU.
> 
>      struct kvm_ppc_get_mmu {
>            /* in */
>                void *mmu;
>            __u32 flags;
>                  /* a bitmask of flags to the API */
>                    /*     TLB_READ_FIRST   0x1      */
>                    /*     TLB_SEARCH       0x2      */
>            /* out */
>            __u32 max_entries;
>      };
> 
> For mmu types BOOKE_NOHV and BOOKE_HV :
> 
>      The "void *mmu" field of kvm_ppc_get_mmu points to 
>        a struct of type "struct kvm_ppc_booke_mmu".
> 
>      If TLBnCFG[NENTRY] > 0 and TLBnCFG[ASSOC] > 0, the TLB has
>      of known number of entries and associativity.  The mas0[ESEL]
>      and mas2[EPN] fields specify which entry to read.
> 
>      If TLBnCFG[NENTRY] == 0 the number of TLB entries is 
>      undefined and this API can be used to iterate over
>      the entire TLB selected with TLBSEL in mas0.
> 
>      -To read a TLB entry:
> 
>         set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):
>            flags=0
>            mas0[TLBSEL] // select which TLB is being read
>            mas0[ESEL]   // select which entry is being read
>            mas2[EPN]    // effective address 
> 
>         On return the following fields are updated as per the Power architecture:
>            mas0
>            mas1 
>            mas2 
>            mas3 
>            mas7 
> 
>      -To iterate over a TLB (read all entries):
> 
>        To start an interation sequence, set the following fields in
>        the mmu struct (struct kvm_ppc_booke_mmu)
>            flags=TLB_READ_FIRST
>            mas0[TLBSEL]  // select which TLB is being read
> 
>        On return the following fields are updated:
>            mas0           // set as per Power arch
>            mas1           // set as per Power arch
>            mas2           // set as per Power arch
>            mas3           // set as per Power arch
>            mas7           // set as per Power arch
>            max_entries    // Contains upper limit on number of entries that may
>                           // be returned. A value of 0xffffffff means there is
>                           // no meaningful upper bound.
> 
>        For subsequent calls to the API the following output fields must
>        be passed back into the API unmodified:
>            flags
>            mas0
>            mas2
> 
>        A return value of -ENOENT indicates that there are no more
>        entries to be read.
> 
>      -To search for TLB entry
> 
>         To search for TLB entry, set the following fields in
>         the mmu struct (struct kvm_ppc_booke_mmu):
>            flags=TLB_SEARCH
>            mas2[EPN]    // effective address to search for
>            mas6         // set as per the Power arch
>            mas5         // set as per the Power arch
> 
>         On return, the following fields are updated as per the Power architecture:
>            mas0
>            mas1 
>            mas2 
>            mas3 
>            mas7 

Userspace should only really need the TLB entries for

  1) Debugging
  2) Migration

So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 21:33   ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-02 21:33 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: kvm, kvm-ppc, qemu-devel


On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:

> Below is a proposal for a new API for PPC to allow KVM clients
> to set MMU state in a vcpu.
> 
> BookE processors have one or more software managed TLBs and
> currently there is no mechanism for Qemu to initialize
> or access them.  This is needed for normal initialization
> as well as debug.
> 
> There are 4 APIs:
> 
> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
> of MMU with KVM-- the type determines the size and format
> of the data in the other APIs

This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.

> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
> TLBs in the vcpu
> 
> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
> specifies the format of the MMU data passed in

This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?

Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.


> -KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,
> or iterating over an entire TLB.  Some TLBs may have an unspecified
> geometry and thus the need to be able to iterate in order
> to dump the TLB.  The Power architecture specifies the format
> of the MMU data
> 
> Feedback welcome.
> 
> Thanks,
> Stuart Yoder
> 
> ------------------------------------------------------------------
> 
> KVM PPC MMU API
> ---------------
> 
> User space can query whether the APIs to access the vcpu mmu
> is available with the KVM_CHECK_EXTENSION API using
> the KVM_CAP_PPC_MMU argument.
> 
> If the KVM_CAP_PPC_MMU return value is non-zero it specifies that
> the following APIs are available:
> 
>   KVM_PPC_SET_MMU_TYPE
>   KVM_PPC_INVALIDATE_TLB
>   KVM_PPC_SET_TLBE
>   KVM_PPC_GET_MMU
> 
> 
> KVM_PPC_SET_MMU_TYPE
> --------------------
> 
> Capability: KVM_CAP_PPC_SET_MMU_TYPE
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: __u32 mmu_type (in)
> Returns: 0 if specified MMU type is supported, else -1
> 
> Sets the MMU type.  Valid input values are:
>   BOOKE_NOHV   0x1
>   BOOKE_HV     0x2
> 
> A return value of 0x0 indicates that KVM supports
> the specified MMU type.

We should probably return some failure code when a PVR gets set that KVM doesn't understand. That would automatically give us that functionality.

> 
> KVM_PPC_INVALIDATE_TLB
> ----------------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: none
> Returns: 0 on success, -1 on error
> 
> Invalidates all TLB entries in all TLBs of the vcpu.

The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.

> 
> KVM_PPC_SET_TLBE
> ----------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters:
>        For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)
> Returns: 0 on success, -1 on error
> 
> Sets an MMU entry in a virtual CPU.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV:
> 
>      To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu 
>      as per the Power architecture.
> 
>      struct kvm_ppc_booke_mmu {
>            union {
>                  __u64 mas0_1;
>                  struct {
>                        __u32 mas0;
>                        __u32 mas1;
>                  };
>            };
>            __u64 mas2;
>            union {
>                  __u64 mas7_3      
>                  struct {
>                        __u32 mas7;
>                        __u32 mas3;
>                  };
>            };
>            union {
>                  __u64 mas5_6      
>                  struct {
>                        __u64 mas5;
>                        __u64 mas6;
>                  };
>            }
>            __u32 mas8;
>      };
> 
>      For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields
>      in kvm_ppc_booke_mmu are present but not supported.

Haven't fully made up my mind on the tlb entry structure yet. Maybe something like

struct kvm_ppc_booke_tlbe {
    __u64 data[8];
};

would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).

> 
> 
> KVM_PPC_GET_TLB
> ---------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: struct kvm_ppc_get_mmu (in/out)
> Returns: 0 on success
>         -1 on error
>         errno = ENOENT when iterating and there are no more entries to read
> 
> Reads an MMU entry from a virtual CPU.
> 
>      struct kvm_ppc_get_mmu {
>            /* in */
>                void *mmu;
>            __u32 flags;
>                  /* a bitmask of flags to the API */
>                    /*     TLB_READ_FIRST   0x1      */
>                    /*     TLB_SEARCH       0x2      */
>            /* out */
>            __u32 max_entries;
>      };
> 
> For mmu types BOOKE_NOHV and BOOKE_HV :
> 
>      The "void *mmu" field of kvm_ppc_get_mmu points to 
>        a struct of type "struct kvm_ppc_booke_mmu".
> 
>      If TLBnCFG[NENTRY] > 0 and TLBnCFG[ASSOC] > 0, the TLB has
>      of known number of entries and associativity.  The mas0[ESEL]
>      and mas2[EPN] fields specify which entry to read.
> 
>      If TLBnCFG[NENTRY] == 0 the number of TLB entries is 
>      undefined and this API can be used to iterate over
>      the entire TLB selected with TLBSEL in mas0.
> 
>      -To read a TLB entry:
> 
>         set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):
>            flags=0
>            mas0[TLBSEL] // select which TLB is being read
>            mas0[ESEL]   // select which entry is being read
>            mas2[EPN]    // effective address 
> 
>         On return the following fields are updated as per the Power architecture:
>            mas0
>            mas1 
>            mas2 
>            mas3 
>            mas7 
> 
>      -To iterate over a TLB (read all entries):
> 
>        To start an interation sequence, set the following fields in
>        the mmu struct (struct kvm_ppc_booke_mmu)
>            flags=TLB_READ_FIRST
>            mas0[TLBSEL]  // select which TLB is being read
> 
>        On return the following fields are updated:
>            mas0           // set as per Power arch
>            mas1           // set as per Power arch
>            mas2           // set as per Power arch
>            mas3           // set as per Power arch
>            mas7           // set as per Power arch
>            max_entries    // Contains upper limit on number of entries that may
>                           // be returned. A value of 0xffffffff means there is
>                           // no meaningful upper bound.
> 
>        For subsequent calls to the API the following output fields must
>        be passed back into the API unmodified:
>            flags
>            mas0
>            mas2
> 
>        A return value of -ENOENT indicates that there are no more
>        entries to be read.
> 
>      -To search for TLB entry
> 
>         To search for TLB entry, set the following fields in
>         the mmu struct (struct kvm_ppc_booke_mmu):
>            flags=TLB_SEARCH
>            mas2[EPN]    // effective address to search for
>            mas6         // set as per the Power arch
>            mas5         // set as per the Power arch
> 
>         On return, the following fields are updated as per the Power architecture:
>            mas0
>            mas1 
>            mas2 
>            mas3 
>            mas7 

Userspace should only really need the TLB entries for

  1) Debugging
  2) Migration

So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 21:33   ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-02 21:33 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: kvm-ppc, kvm, qemu-devel


On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:

> Below is a proposal for a new API for PPC to allow KVM clients
> to set MMU state in a vcpu.
> 
> BookE processors have one or more software managed TLBs and
> currently there is no mechanism for Qemu to initialize
> or access them.  This is needed for normal initialization
> as well as debug.
> 
> There are 4 APIs:
> 
> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
> of MMU with KVM-- the type determines the size and format
> of the data in the other APIs

This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.

> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
> TLBs in the vcpu
> 
> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
> specifies the format of the MMU data passed in

This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?

Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.


> -KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,
> or iterating over an entire TLB.  Some TLBs may have an unspecified
> geometry and thus the need to be able to iterate in order
> to dump the TLB.  The Power architecture specifies the format
> of the MMU data
> 
> Feedback welcome.
> 
> Thanks,
> Stuart Yoder
> 
> ------------------------------------------------------------------
> 
> KVM PPC MMU API
> ---------------
> 
> User space can query whether the APIs to access the vcpu mmu
> is available with the KVM_CHECK_EXTENSION API using
> the KVM_CAP_PPC_MMU argument.
> 
> If the KVM_CAP_PPC_MMU return value is non-zero it specifies that
> the following APIs are available:
> 
>   KVM_PPC_SET_MMU_TYPE
>   KVM_PPC_INVALIDATE_TLB
>   KVM_PPC_SET_TLBE
>   KVM_PPC_GET_MMU
> 
> 
> KVM_PPC_SET_MMU_TYPE
> --------------------
> 
> Capability: KVM_CAP_PPC_SET_MMU_TYPE
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: __u32 mmu_type (in)
> Returns: 0 if specified MMU type is supported, else -1
> 
> Sets the MMU type.  Valid input values are:
>   BOOKE_NOHV   0x1
>   BOOKE_HV     0x2
> 
> A return value of 0x0 indicates that KVM supports
> the specified MMU type.

We should probably return some failure code when a PVR gets set that KVM doesn't understand. That would automatically give us that functionality.

> 
> KVM_PPC_INVALIDATE_TLB
> ----------------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: none
> Returns: 0 on success, -1 on error
> 
> Invalidates all TLB entries in all TLBs of the vcpu.

The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.

> 
> KVM_PPC_SET_TLBE
> ----------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters:
>        For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)
> Returns: 0 on success, -1 on error
> 
> Sets an MMU entry in a virtual CPU.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV:
> 
>      To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu 
>      as per the Power architecture.
> 
>      struct kvm_ppc_booke_mmu {
>            union {
>                  __u64 mas0_1;
>                  struct {
>                        __u32 mas0;
>                        __u32 mas1;
>                  };
>            };
>            __u64 mas2;
>            union {
>                  __u64 mas7_3      
>                  struct {
>                        __u32 mas7;
>                        __u32 mas3;
>                  };
>            };
>            union {
>                  __u64 mas5_6      
>                  struct {
>                        __u64 mas5;
>                        __u64 mas6;
>                  };
>            }
>            __u32 mas8;
>      };
> 
>      For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields
>      in kvm_ppc_booke_mmu are present but not supported.

Haven't fully made up my mind on the tlb entry structure yet. Maybe something like

struct kvm_ppc_booke_tlbe {
    __u64 data[8];
};

would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).

> 
> 
> KVM_PPC_GET_TLB
> ---------------
> 
> Capability: KVM_CAP_PPC_MMU
> Architectures: powerpc
> Type: vcpu ioctl
> Parameters: struct kvm_ppc_get_mmu (in/out)
> Returns: 0 on success
>         -1 on error
>         errno = ENOENT when iterating and there are no more entries to read
> 
> Reads an MMU entry from a virtual CPU.
> 
>      struct kvm_ppc_get_mmu {
>            /* in */
>                void *mmu;
>            __u32 flags;
>                  /* a bitmask of flags to the API */
>                    /*     TLB_READ_FIRST   0x1      */
>                    /*     TLB_SEARCH       0x2      */
>            /* out */
>            __u32 max_entries;
>      };
> 
> For mmu types BOOKE_NOHV and BOOKE_HV :
> 
>      The "void *mmu" field of kvm_ppc_get_mmu points to 
>        a struct of type "struct kvm_ppc_booke_mmu".
> 
>      If TLBnCFG[NENTRY] > 0 and TLBnCFG[ASSOC] > 0, the TLB has
>      of known number of entries and associativity.  The mas0[ESEL]
>      and mas2[EPN] fields specify which entry to read.
> 
>      If TLBnCFG[NENTRY] = 0 the number of TLB entries is 
>      undefined and this API can be used to iterate over
>      the entire TLB selected with TLBSEL in mas0.
> 
>      -To read a TLB entry:
> 
>         set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):
>            flags=0
>            mas0[TLBSEL] // select which TLB is being read
>            mas0[ESEL]   // select which entry is being read
>            mas2[EPN]    // effective address 
> 
>         On return the following fields are updated as per the Power architecture:
>            mas0
>            mas1 
>            mas2 
>            mas3 
>            mas7 
> 
>      -To iterate over a TLB (read all entries):
> 
>        To start an interation sequence, set the following fields in
>        the mmu struct (struct kvm_ppc_booke_mmu)
>            flags=TLB_READ_FIRST
>            mas0[TLBSEL]  // select which TLB is being read
> 
>        On return the following fields are updated:
>            mas0           // set as per Power arch
>            mas1           // set as per Power arch
>            mas2           // set as per Power arch
>            mas3           // set as per Power arch
>            mas7           // set as per Power arch
>            max_entries    // Contains upper limit on number of entries that may
>                           // be returned. A value of 0xffffffff means there is
>                           // no meaningful upper bound.
> 
>        For subsequent calls to the API the following output fields must
>        be passed back into the API unmodified:
>            flags
>            mas0
>            mas2
> 
>        A return value of -ENOENT indicates that there are no more
>        entries to be read.
> 
>      -To search for TLB entry
> 
>         To search for TLB entry, set the following fields in
>         the mmu struct (struct kvm_ppc_booke_mmu):
>            flags=TLB_SEARCH
>            mas2[EPN]    // effective address to search for
>            mas6         // set as per the Power arch
>            mas5         // set as per the Power arch
> 
>         On return, the following fields are updated as per the Power architecture:
>            mas0
>            mas1 
>            mas2 
>            mas3 
>            mas7 

Userspace should only really need the TLB entries for

  1) Debugging
  2) Migration

So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-02 21:33   ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-02 22:08     ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-02 22:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Wed, 2 Feb 2011 22:33:41 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
> 
> > Below is a proposal for a new API for PPC to allow KVM clients
> > to set MMU state in a vcpu.
> > 
> > BookE processors have one or more software managed TLBs and
> > currently there is no mechanism for Qemu to initialize
> > or access them.  This is needed for normal initialization
> > as well as debug.
> > 
> > There are 4 APIs:
> > 
> > -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
> > of MMU with KVM-- the type determines the size and format
> > of the data in the other APIs
> 
> This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.

Well, for one, we don't have sregs or a PVR hint on Book E yet. :-)

But also, there could be differing levels of support -- e.g. on e500mc,
we have no plans to support exposing the hardware virtualization
features in a nested manner (nor am I sure that it's reasonably
possible).  But if someone does it, that would be a change in the
interface between Qemu and KVM to allow the extra fields to be set,
with no change in PVR.

Likewise, a new chip could introduce new capabilities, but still be
capable of working the old way.

Plus, basing it on PVR means Qemu needs to be updated every time
there's a new chip with a new PVR.

> > -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
> > TLBs in the vcpu
> > 
> > -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
> > specifies the format of the MMU data passed in
> 
> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?

I have no idea what things will look like 10 years down the road, but
currently e500mc has 576 entries (512 TLB0, 64 TLB1).

> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.

No, we don't know the size (or necessarily even the structure) of the
TLB.  KVM may provide a guest TLB that is larger than what hardware has,
as a cache to reduce the number of TLB misses that have to go to the
guest (we do this now in another hypervisor).

Plus sometimes it's just simpler -- why bother halving the size of the
guest TLB when running on e500v1?

> > KVM_PPC_INVALIDATE_TLB
> > ----------------------
> > 
> > Capability: KVM_CAP_PPC_MMU
> > Architectures: powerpc
> > Type: vcpu ioctl
> > Parameters: none
> > Returns: 0 on success, -1 on error
> > 
> > Invalidates all TLB entries in all TLBs of the vcpu.
> 
> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.

The other way to look at it is that there's no need for a reset
function if all the state is properly settable. :-)

Which we want anyway for debugging (and migration, though I wonder if
anyone would actually use that with embedded hardware).

> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
> 
> struct kvm_ppc_booke_tlbe {
>     __u64 data[8];
> };
> 
> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).

If it's not exposed to userspace, how is userspace going to
interpret/fill in the data?

As for kernel headers, I think qemu needs to provide its own copy, like
qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
for programs which rely on recent kernel APIs (which Qemu+KVM tends
to do already).

> Userspace should only really need the TLB entries for
> 
>   1) Debugging
>   2) Migration
> 
> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?

The third case is reset/init, which can be performance sensitive
(especially in failover setups).

And debugging can require single translations, and can be a
performance issue if you need to toss around several kilobytes of data
per translation, and a debugger is doing e.g. a automated pattern of
single step plus inspect memory.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 22:08     ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-02 22:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Wed, 2 Feb 2011 22:33:41 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
> 
> > Below is a proposal for a new API for PPC to allow KVM clients
> > to set MMU state in a vcpu.
> > 
> > BookE processors have one or more software managed TLBs and
> > currently there is no mechanism for Qemu to initialize
> > or access them.  This is needed for normal initialization
> > as well as debug.
> > 
> > There are 4 APIs:
> > 
> > -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
> > of MMU with KVM-- the type determines the size and format
> > of the data in the other APIs
> 
> This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.

Well, for one, we don't have sregs or a PVR hint on Book E yet. :-)

But also, there could be differing levels of support -- e.g. on e500mc,
we have no plans to support exposing the hardware virtualization
features in a nested manner (nor am I sure that it's reasonably
possible).  But if someone does it, that would be a change in the
interface between Qemu and KVM to allow the extra fields to be set,
with no change in PVR.

Likewise, a new chip could introduce new capabilities, but still be
capable of working the old way.

Plus, basing it on PVR means Qemu needs to be updated every time
there's a new chip with a new PVR.

> > -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
> > TLBs in the vcpu
> > 
> > -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
> > specifies the format of the MMU data passed in
> 
> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?

I have no idea what things will look like 10 years down the road, but
currently e500mc has 576 entries (512 TLB0, 64 TLB1).

> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.

No, we don't know the size (or necessarily even the structure) of the
TLB.  KVM may provide a guest TLB that is larger than what hardware has,
as a cache to reduce the number of TLB misses that have to go to the
guest (we do this now in another hypervisor).

Plus sometimes it's just simpler -- why bother halving the size of the
guest TLB when running on e500v1?

> > KVM_PPC_INVALIDATE_TLB
> > ----------------------
> > 
> > Capability: KVM_CAP_PPC_MMU
> > Architectures: powerpc
> > Type: vcpu ioctl
> > Parameters: none
> > Returns: 0 on success, -1 on error
> > 
> > Invalidates all TLB entries in all TLBs of the vcpu.
> 
> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.

The other way to look at it is that there's no need for a reset
function if all the state is properly settable. :-)

Which we want anyway for debugging (and migration, though I wonder if
anyone would actually use that with embedded hardware).

> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
> 
> struct kvm_ppc_booke_tlbe {
>     __u64 data[8];
> };
> 
> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).

If it's not exposed to userspace, how is userspace going to
interpret/fill in the data?

As for kernel headers, I think qemu needs to provide its own copy, like
qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
for programs which rely on recent kernel APIs (which Qemu+KVM tends
to do already).

> Userspace should only really need the TLB entries for
> 
>   1) Debugging
>   2) Migration
> 
> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?

The third case is reset/init, which can be performance sensitive
(especially in failover setups).

And debugging can require single translations, and can be a
performance issue if you need to toss around several kilobytes of data
per translation, and a debugger is doing e.g. a automated pattern of
single step plus inspect memory.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 22:08     ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-02 22:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Wed, 2 Feb 2011 22:33:41 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
> 
> > Below is a proposal for a new API for PPC to allow KVM clients
> > to set MMU state in a vcpu.
> > 
> > BookE processors have one or more software managed TLBs and
> > currently there is no mechanism for Qemu to initialize
> > or access them.  This is needed for normal initialization
> > as well as debug.
> > 
> > There are 4 APIs:
> > 
> > -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
> > of MMU with KVM-- the type determines the size and format
> > of the data in the other APIs
> 
> This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.

Well, for one, we don't have sregs or a PVR hint on Book E yet. :-)

But also, there could be differing levels of support -- e.g. on e500mc,
we have no plans to support exposing the hardware virtualization
features in a nested manner (nor am I sure that it's reasonably
possible).  But if someone does it, that would be a change in the
interface between Qemu and KVM to allow the extra fields to be set,
with no change in PVR.

Likewise, a new chip could introduce new capabilities, but still be
capable of working the old way.

Plus, basing it on PVR means Qemu needs to be updated every time
there's a new chip with a new PVR.

> > -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
> > TLBs in the vcpu
> > 
> > -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
> > specifies the format of the MMU data passed in
> 
> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?

I have no idea what things will look like 10 years down the road, but
currently e500mc has 576 entries (512 TLB0, 64 TLB1).

> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.

No, we don't know the size (or necessarily even the structure) of the
TLB.  KVM may provide a guest TLB that is larger than what hardware has,
as a cache to reduce the number of TLB misses that have to go to the
guest (we do this now in another hypervisor).

Plus sometimes it's just simpler -- why bother halving the size of the
guest TLB when running on e500v1?

> > KVM_PPC_INVALIDATE_TLB
> > ----------------------
> > 
> > Capability: KVM_CAP_PPC_MMU
> > Architectures: powerpc
> > Type: vcpu ioctl
> > Parameters: none
> > Returns: 0 on success, -1 on error
> > 
> > Invalidates all TLB entries in all TLBs of the vcpu.
> 
> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.

The other way to look at it is that there's no need for a reset
function if all the state is properly settable. :-)

Which we want anyway for debugging (and migration, though I wonder if
anyone would actually use that with embedded hardware).

> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
> 
> struct kvm_ppc_booke_tlbe {
>     __u64 data[8];
> };
> 
> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).

If it's not exposed to userspace, how is userspace going to
interpret/fill in the data?

As for kernel headers, I think qemu needs to provide its own copy, like
qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
for programs which rely on recent kernel APIs (which Qemu+KVM tends
to do already).

> Userspace should only really need the TLB entries for
> 
>   1) Debugging
>   2) Migration
> 
> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?

The third case is reset/init, which can be performance sensitive
(especially in failover setups).

And debugging can require single translations, and can be a
performance issue if you need to toss around several kilobytes of data
per translation, and a debugger is doing e.g. a automated pattern of
single step plus inspect memory.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
  2011-02-02 21:33   ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-02 22:34     ` Yoder Stuart-B08248
  -1 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-02 22:34 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, qemu-devel



> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Wednesday, February 02, 2011 3:34 PM
> To: Yoder Stuart-B08248
> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> 
> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
> 
> > Below is a proposal for a new API for PPC to allow KVM clients to set
> > MMU state in a vcpu.
> >
> > BookE processors have one or more software managed TLBs and currently
> > there is no mechanism for Qemu to initialize or access them.  This is
> > needed for normal initialization as well as debug.
> >
> > There are 4 APIs:
> >
> > -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
> > with KVM-- the type determines the size and format of the data in the
> > other APIs
> 
> This should be done through the PVR hint in sregs, no? Usually a single CPU
> type only has a single MMU type.
> 
> > -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
> > vcpu
> >
> > -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
> > the format of the MMU data passed in
> 
> This seems to fine-grained. I'd prefer a list of all TLB entries to be
> pushed in either direction. What's the foreseeable number of TLB entries
> within the next 10 years?
> 
> Having the whole stack available would make the sync with qemu easier and
> also allows us to only do a single ioctl for all the TLB management. Thanks
> to the PVR we know the size of the TLB, so we don't have to shove that
> around.

Yes, we thought about that approach but the idea here, as Scott 
described, was to provide an API that could work if user space
is unaware of the geometry of the TLB.

Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
allow TLB geometries that cannot be described in the TLBnCFG
registers.

I think the use case where this API would be used the most
would be from a gdb stub that needed to look up an effective
address.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 22:34     ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-02 22:34 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, qemu-devel



> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Wednesday, February 02, 2011 3:34 PM
> To: Yoder Stuart-B08248
> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> 
> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
> 
> > Below is a proposal for a new API for PPC to allow KVM clients to set
> > MMU state in a vcpu.
> >
> > BookE processors have one or more software managed TLBs and currently
> > there is no mechanism for Qemu to initialize or access them.  This is
> > needed for normal initialization as well as debug.
> >
> > There are 4 APIs:
> >
> > -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
> > with KVM-- the type determines the size and format of the data in the
> > other APIs
> 
> This should be done through the PVR hint in sregs, no? Usually a single CPU
> type only has a single MMU type.
> 
> > -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
> > vcpu
> >
> > -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
> > the format of the MMU data passed in
> 
> This seems to fine-grained. I'd prefer a list of all TLB entries to be
> pushed in either direction. What's the foreseeable number of TLB entries
> within the next 10 years?
> 
> Having the whole stack available would make the sync with qemu easier and
> also allows us to only do a single ioctl for all the TLB management. Thanks
> to the PVR we know the size of the TLB, so we don't have to shove that
> around.

Yes, we thought about that approach but the idea here, as Scott 
described, was to provide an API that could work if user space
is unaware of the geometry of the TLB.

Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
allow TLB geometries that cannot be described in the TLBnCFG
registers.

I think the use case where this API would be used the most
would be from a gdb stub that needed to look up an effective
address.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 22:34     ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-02 22:34 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, qemu-devel



> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Wednesday, February 02, 2011 3:34 PM
> To: Yoder Stuart-B08248
> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> 
> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
> 
> > Below is a proposal for a new API for PPC to allow KVM clients to set
> > MMU state in a vcpu.
> >
> > BookE processors have one or more software managed TLBs and currently
> > there is no mechanism for Qemu to initialize or access them.  This is
> > needed for normal initialization as well as debug.
> >
> > There are 4 APIs:
> >
> > -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
> > with KVM-- the type determines the size and format of the data in the
> > other APIs
> 
> This should be done through the PVR hint in sregs, no? Usually a single CPU
> type only has a single MMU type.
> 
> > -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
> > vcpu
> >
> > -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
> > the format of the MMU data passed in
> 
> This seems to fine-grained. I'd prefer a list of all TLB entries to be
> pushed in either direction. What's the foreseeable number of TLB entries
> within the next 10 years?
> 
> Having the whole stack available would make the sync with qemu easier and
> also allows us to only do a single ioctl for all the TLB management. Thanks
> to the PVR we know the size of the TLB, so we don't have to shove that
> around.

Yes, we thought about that approach but the idea here, as Scott 
described, was to provide an API that could work if user space
is unaware of the geometry of the TLB.

Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
allow TLB geometries that cannot be described in the TLBnCFG
registers.

I think the use case where this API would be used the most
would be from a gdb stub that needed to look up an effective
address.

Stuart


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-02 22:08     ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-03  9:19       ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-03  9:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 02.02.2011, at 23:08, Scott Wood wrote:

> On Wed, 2 Feb 2011 22:33:41 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
>> 
>>> Below is a proposal for a new API for PPC to allow KVM clients
>>> to set MMU state in a vcpu.
>>> 
>>> BookE processors have one or more software managed TLBs and
>>> currently there is no mechanism for Qemu to initialize
>>> or access them.  This is needed for normal initialization
>>> as well as debug.
>>> 
>>> There are 4 APIs:
>>> 
>>> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
>>> of MMU with KVM-- the type determines the size and format
>>> of the data in the other APIs
>> 
>> This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.
> 
> Well, for one, we don't have sregs or a PVR hint on Book E yet. :-)

Ah, right. The BookE code just passes its host PVR to the guest. :)

> But also, there could be differing levels of support -- e.g. on e500mc,
> we have no plans to support exposing the hardware virtualization
> features in a nested manner (nor am I sure that it's reasonably
> possible).  But if someone does it, that would be a change in the
> interface between Qemu and KVM to allow the extra fields to be set,
> with no change in PVR.
> 
> Likewise, a new chip could introduce new capabilities, but still be
> capable of working the old way.
> 
> Plus, basing it on PVR means Qemu needs to be updated every time
> there's a new chip with a new PVR.

Ok, convinced. We need a way to choose an mmu model.

> 
>>> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
>>> TLBs in the vcpu
>>> 
>>> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
>>> specifies the format of the MMU data passed in
>> 
>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> 
> I have no idea what things will look like 10 years down the road, but
> currently e500mc has 576 entries (512 TLB0, 64 TLB1).

That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

> 
>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> 
> No, we don't know the size (or necessarily even the structure) of the
> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> as a cache to reduce the number of TLB misses that have to go to the
> guest (we do this now in another hypervisor).
> 
> Plus sometimes it's just simpler -- why bother halving the size of the
> guest TLB when running on e500v1?

Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). Maybe this even needs to be potentially runtime switchable, in case you boot off with u-boot in the guest, load a kernel and the kernel activates some PV extensions.

> 
>>> KVM_PPC_INVALIDATE_TLB
>>> ----------------------
>>> 
>>> Capability: KVM_CAP_PPC_MMU
>>> Architectures: powerpc
>>> Type: vcpu ioctl
>>> Parameters: none
>>> Returns: 0 on success, -1 on error
>>> 
>>> Invalidates all TLB entries in all TLBs of the vcpu.
>> 
>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> 
> The other way to look at it is that there's no need for a reset
> function if all the state is properly settable. :-)

You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

> 
> Which we want anyway for debugging (and migration, though I wonder if
> anyone would actually use that with embedded hardware).

We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

> 
>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>> 
>> struct kvm_ppc_booke_tlbe {
>>    __u64 data[8];
>> };
>> 
>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> 
> If it's not exposed to userspace, how is userspace going to
> interpret/fill in the data?

It can overlay cast according to the MMU type. So userspace still has to know the layout of the tlbe, but it doesn't have to be defined with a huge amount of anonymous unions. An alternative would be to explicitly define each mmu type's entries:

struct kvm_ppc_booke_tlbe {
    union {
        struct {
            ...
        } tlbe_e500;
        struct {
            ...
        } tlbe_e500mc;
        struct {
            ...
        } tlbe_e500mc_hv;
        __u64 pad[x];
    }
};

> As for kernel headers, I think qemu needs to provide its own copy, like
> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> for programs which rely on recent kernel APIs (which Qemu+KVM tends
> to do already).

Yeah, tedious story...

> 
>> Userspace should only really need the TLB entries for
>> 
>>  1) Debugging
>>  2) Migration
>> 
>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> 
> The third case is reset/init, which can be performance sensitive
> (especially in failover setups).

This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

> And debugging can require single translations, and can be a
> performance issue if you need to toss around several kilobytes of data
> per translation, and a debugger is doing e.g. a automated pattern of
> single step plus inspect memory.

Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:

 * kvm goes to qemu
 * qemu fetches all mmu and register data from kvm
 * qemu runs its mmu resolution function as if the target was emulated

So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).

So I guess we need both. We need a full get call to facilitate migration and savevm and we can then accelerate it using a direct lookup call. Here too, I'd prefer to see the generic one first. But I do agree that it's a lot of data with high frequency, so it might make sense to expose both on the same CAP.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-03  9:19       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-03  9:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel


On 02.02.2011, at 23:08, Scott Wood wrote:

> On Wed, 2 Feb 2011 22:33:41 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
>> 
>>> Below is a proposal for a new API for PPC to allow KVM clients
>>> to set MMU state in a vcpu.
>>> 
>>> BookE processors have one or more software managed TLBs and
>>> currently there is no mechanism for Qemu to initialize
>>> or access them.  This is needed for normal initialization
>>> as well as debug.
>>> 
>>> There are 4 APIs:
>>> 
>>> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
>>> of MMU with KVM-- the type determines the size and format
>>> of the data in the other APIs
>> 
>> This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.
> 
> Well, for one, we don't have sregs or a PVR hint on Book E yet. :-)

Ah, right. The BookE code just passes its host PVR to the guest. :)

> But also, there could be differing levels of support -- e.g. on e500mc,
> we have no plans to support exposing the hardware virtualization
> features in a nested manner (nor am I sure that it's reasonably
> possible).  But if someone does it, that would be a change in the
> interface between Qemu and KVM to allow the extra fields to be set,
> with no change in PVR.
> 
> Likewise, a new chip could introduce new capabilities, but still be
> capable of working the old way.
> 
> Plus, basing it on PVR means Qemu needs to be updated every time
> there's a new chip with a new PVR.

Ok, convinced. We need a way to choose an mmu model.

> 
>>> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
>>> TLBs in the vcpu
>>> 
>>> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
>>> specifies the format of the MMU data passed in
>> 
>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> 
> I have no idea what things will look like 10 years down the road, but
> currently e500mc has 576 entries (512 TLB0, 64 TLB1).

That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

> 
>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> 
> No, we don't know the size (or necessarily even the structure) of the
> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> as a cache to reduce the number of TLB misses that have to go to the
> guest (we do this now in another hypervisor).
> 
> Plus sometimes it's just simpler -- why bother halving the size of the
> guest TLB when running on e500v1?

Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). Maybe this even needs to be potentially runtime switchable, in case you boot off with u-boot in the guest, load a kernel and the kernel activates some PV extensions.

> 
>>> KVM_PPC_INVALIDATE_TLB
>>> ----------------------
>>> 
>>> Capability: KVM_CAP_PPC_MMU
>>> Architectures: powerpc
>>> Type: vcpu ioctl
>>> Parameters: none
>>> Returns: 0 on success, -1 on error
>>> 
>>> Invalidates all TLB entries in all TLBs of the vcpu.
>> 
>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> 
> The other way to look at it is that there's no need for a reset
> function if all the state is properly settable. :-)

You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

> 
> Which we want anyway for debugging (and migration, though I wonder if
> anyone would actually use that with embedded hardware).

We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

> 
>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>> 
>> struct kvm_ppc_booke_tlbe {
>>    __u64 data[8];
>> };
>> 
>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> 
> If it's not exposed to userspace, how is userspace going to
> interpret/fill in the data?

It can overlay cast according to the MMU type. So userspace still has to know the layout of the tlbe, but it doesn't have to be defined with a huge amount of anonymous unions. An alternative would be to explicitly define each mmu type's entries:

struct kvm_ppc_booke_tlbe {
    union {
        struct {
            ...
        } tlbe_e500;
        struct {
            ...
        } tlbe_e500mc;
        struct {
            ...
        } tlbe_e500mc_hv;
        __u64 pad[x];
    }
};

> As for kernel headers, I think qemu needs to provide its own copy, like
> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> for programs which rely on recent kernel APIs (which Qemu+KVM tends
> to do already).

Yeah, tedious story...

> 
>> Userspace should only really need the TLB entries for
>> 
>>  1) Debugging
>>  2) Migration
>> 
>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> 
> The third case is reset/init, which can be performance sensitive
> (especially in failover setups).

This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

> And debugging can require single translations, and can be a
> performance issue if you need to toss around several kilobytes of data
> per translation, and a debugger is doing e.g. a automated pattern of
> single step plus inspect memory.

Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:

 * kvm goes to qemu
 * qemu fetches all mmu and register data from kvm
 * qemu runs its mmu resolution function as if the target was emulated

So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).

So I guess we need both. We need a full get call to facilitate migration and savevm and we can then accelerate it using a direct lookup call. Here too, I'd prefer to see the generic one first. But I do agree that it's a lot of data with high frequency, so it might make sense to expose both on the same CAP.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-03  9:19       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-03  9:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 02.02.2011, at 23:08, Scott Wood wrote:

> On Wed, 2 Feb 2011 22:33:41 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
>> 
>>> Below is a proposal for a new API for PPC to allow KVM clients
>>> to set MMU state in a vcpu.
>>> 
>>> BookE processors have one or more software managed TLBs and
>>> currently there is no mechanism for Qemu to initialize
>>> or access them.  This is needed for normal initialization
>>> as well as debug.
>>> 
>>> There are 4 APIs:
>>> 
>>> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
>>> of MMU with KVM-- the type determines the size and format
>>> of the data in the other APIs
>> 
>> This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.
> 
> Well, for one, we don't have sregs or a PVR hint on Book E yet. :-)

Ah, right. The BookE code just passes its host PVR to the guest. :)

> But also, there could be differing levels of support -- e.g. on e500mc,
> we have no plans to support exposing the hardware virtualization
> features in a nested manner (nor am I sure that it's reasonably
> possible).  But if someone does it, that would be a change in the
> interface between Qemu and KVM to allow the extra fields to be set,
> with no change in PVR.
> 
> Likewise, a new chip could introduce new capabilities, but still be
> capable of working the old way.
> 
> Plus, basing it on PVR means Qemu needs to be updated every time
> there's a new chip with a new PVR.

Ok, convinced. We need a way to choose an mmu model.

> 
>>> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
>>> TLBs in the vcpu
>>> 
>>> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
>>> specifies the format of the MMU data passed in
>> 
>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> 
> I have no idea what things will look like 10 years down the road, but
> currently e500mc has 576 entries (512 TLB0, 64 TLB1).

That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

> 
>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> 
> No, we don't know the size (or necessarily even the structure) of the
> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> as a cache to reduce the number of TLB misses that have to go to the
> guest (we do this now in another hypervisor).
> 
> Plus sometimes it's just simpler -- why bother halving the size of the
> guest TLB when running on e500v1?

Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). Maybe this even needs to be potentially runtime switchable, in case you boot off with u-boot in the guest, load a kernel and the kernel activates some PV extensions.

> 
>>> KVM_PPC_INVALIDATE_TLB
>>> ----------------------
>>> 
>>> Capability: KVM_CAP_PPC_MMU
>>> Architectures: powerpc
>>> Type: vcpu ioctl
>>> Parameters: none
>>> Returns: 0 on success, -1 on error
>>> 
>>> Invalidates all TLB entries in all TLBs of the vcpu.
>> 
>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> 
> The other way to look at it is that there's no need for a reset
> function if all the state is properly settable. :-)

You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

> 
> Which we want anyway for debugging (and migration, though I wonder if
> anyone would actually use that with embedded hardware).

We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

> 
>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>> 
>> struct kvm_ppc_booke_tlbe {
>>    __u64 data[8];
>> };
>> 
>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> 
> If it's not exposed to userspace, how is userspace going to
> interpret/fill in the data?

It can overlay cast according to the MMU type. So userspace still has to know the layout of the tlbe, but it doesn't have to be defined with a huge amount of anonymous unions. An alternative would be to explicitly define each mmu type's entries:

struct kvm_ppc_booke_tlbe {
    union {
        struct {
            ...
        } tlbe_e500;
        struct {
            ...
        } tlbe_e500mc;
        struct {
            ...
        } tlbe_e500mc_hv;
        __u64 pad[x];
    }
};

> As for kernel headers, I think qemu needs to provide its own copy, like
> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> for programs which rely on recent kernel APIs (which Qemu+KVM tends
> to do already).

Yeah, tedious story...

> 
>> Userspace should only really need the TLB entries for
>> 
>>  1) Debugging
>>  2) Migration
>> 
>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> 
> The third case is reset/init, which can be performance sensitive
> (especially in failover setups).

This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

> And debugging can require single translations, and can be a
> performance issue if you need to toss around several kilobytes of data
> per translation, and a debugger is doing e.g. a automated pattern of
> single step plus inspect memory.

Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:

 * kvm goes to qemu
 * qemu fetches all mmu and register data from kvm
 * qemu runs its mmu resolution function as if the target was emulated

So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).

So I guess we need both. We need a full get call to facilitate migration and savevm and we can then accelerate it using a direct lookup call. Here too, I'd prefer to see the generic one first. But I do agree that it's a lot of data with high frequency, so it might make sense to expose both on the same CAP.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-02 22:34     ` [Qemu-devel] " Yoder Stuart-B08248
  (?)
@ 2011-02-03  9:29       ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-03  9:29 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: kvm-ppc, kvm, qemu-devel


On 02.02.2011, at 23:34, Yoder Stuart-B08248 wrote:

> 
> 
>> -----Original Message-----
>> From: Alexander Graf [mailto:agraf@suse.de]
>> Sent: Wednesday, February 02, 2011 3:34 PM
>> To: Yoder Stuart-B08248
>> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; qemu-devel@nongnu.org
>> Subject: Re: RFC: New API for PPC for vcpu mmu access
>> 
>> 
>> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
>> 
>>> Below is a proposal for a new API for PPC to allow KVM clients to set
>>> MMU state in a vcpu.
>>> 
>>> BookE processors have one or more software managed TLBs and currently
>>> there is no mechanism for Qemu to initialize or access them.  This is
>>> needed for normal initialization as well as debug.
>>> 
>>> There are 4 APIs:
>>> 
>>> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
>>> with KVM-- the type determines the size and format of the data in the
>>> other APIs
>> 
>> This should be done through the PVR hint in sregs, no? Usually a single CPU
>> type only has a single MMU type.
>> 
>>> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
>>> vcpu
>>> 
>>> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
>>> the format of the MMU data passed in
>> 
>> This seems to fine-grained. I'd prefer a list of all TLB entries to be
>> pushed in either direction. What's the foreseeable number of TLB entries
>> within the next 10 years?
>> 
>> Having the whole stack available would make the sync with qemu easier and
>> also allows us to only do a single ioctl for all the TLB management. Thanks
>> to the PVR we know the size of the TLB, so we don't have to shove that
>> around.
> 
> Yes, we thought about that approach but the idea here, as Scott 
> described, was to provide an API that could work if user space
> is unaware of the geometry of the TLB.

Userspace shouldn't be unaware of the TLB, that's the whole point :). In principle, all state should be fetchable from userspace - so it has to know the geometry.

> Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
> of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
> allow TLB geometries that cannot be described in the TLBnCFG
> registers.

It's certainly not easy to assemble all the required information in userspace, but we need to do so nevertheless - having the state available simply buys us a lot in terms of flexibility.

> I think the use case where this API would be used the most
> would be from a gdb stub that needed to look up an effective
> address.

I agree. As I stated in my previous mail, there's probably good rationale to explicitly tune that path. That doesn't mean that we have to leave out the generic one. It should only be an acceleration.

After all, this whole flexibility thing with all the potential possibilities is KVM's strong point. We should not close the doors on those :).


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-03  9:29       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-03  9:29 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: kvm, kvm-ppc, qemu-devel


On 02.02.2011, at 23:34, Yoder Stuart-B08248 wrote:

> 
> 
>> -----Original Message-----
>> From: Alexander Graf [mailto:agraf@suse.de]
>> Sent: Wednesday, February 02, 2011 3:34 PM
>> To: Yoder Stuart-B08248
>> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; qemu-devel@nongnu.org
>> Subject: Re: RFC: New API for PPC for vcpu mmu access
>> 
>> 
>> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
>> 
>>> Below is a proposal for a new API for PPC to allow KVM clients to set
>>> MMU state in a vcpu.
>>> 
>>> BookE processors have one or more software managed TLBs and currently
>>> there is no mechanism for Qemu to initialize or access them.  This is
>>> needed for normal initialization as well as debug.
>>> 
>>> There are 4 APIs:
>>> 
>>> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
>>> with KVM-- the type determines the size and format of the data in the
>>> other APIs
>> 
>> This should be done through the PVR hint in sregs, no? Usually a single CPU
>> type only has a single MMU type.
>> 
>>> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
>>> vcpu
>>> 
>>> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
>>> the format of the MMU data passed in
>> 
>> This seems to fine-grained. I'd prefer a list of all TLB entries to be
>> pushed in either direction. What's the foreseeable number of TLB entries
>> within the next 10 years?
>> 
>> Having the whole stack available would make the sync with qemu easier and
>> also allows us to only do a single ioctl for all the TLB management. Thanks
>> to the PVR we know the size of the TLB, so we don't have to shove that
>> around.
> 
> Yes, we thought about that approach but the idea here, as Scott 
> described, was to provide an API that could work if user space
> is unaware of the geometry of the TLB.

Userspace shouldn't be unaware of the TLB, that's the whole point :). In principle, all state should be fetchable from userspace - so it has to know the geometry.

> Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
> of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
> allow TLB geometries that cannot be described in the TLBnCFG
> registers.

It's certainly not easy to assemble all the required information in userspace, but we need to do so nevertheless - having the state available simply buys us a lot in terms of flexibility.

> I think the use case where this API would be used the most
> would be from a gdb stub that needed to look up an effective
> address.

I agree. As I stated in my previous mail, there's probably good rationale to explicitly tune that path. That doesn't mean that we have to leave out the generic one. It should only be an acceleration.

After all, this whole flexibility thing with all the potential possibilities is KVM's strong point. We should not close the doors on those :).


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-03  9:29       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-03  9:29 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: kvm-ppc, kvm, qemu-devel


On 02.02.2011, at 23:34, Yoder Stuart-B08248 wrote:

> 
> 
>> -----Original Message-----
>> From: Alexander Graf [mailto:agraf@suse.de]
>> Sent: Wednesday, February 02, 2011 3:34 PM
>> To: Yoder Stuart-B08248
>> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; qemu-devel@nongnu.org
>> Subject: Re: RFC: New API for PPC for vcpu mmu access
>> 
>> 
>> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
>> 
>>> Below is a proposal for a new API for PPC to allow KVM clients to set
>>> MMU state in a vcpu.
>>> 
>>> BookE processors have one or more software managed TLBs and currently
>>> there is no mechanism for Qemu to initialize or access them.  This is
>>> needed for normal initialization as well as debug.
>>> 
>>> There are 4 APIs:
>>> 
>>> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
>>> with KVM-- the type determines the size and format of the data in the
>>> other APIs
>> 
>> This should be done through the PVR hint in sregs, no? Usually a single CPU
>> type only has a single MMU type.
>> 
>>> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
>>> vcpu
>>> 
>>> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
>>> the format of the MMU data passed in
>> 
>> This seems to fine-grained. I'd prefer a list of all TLB entries to be
>> pushed in either direction. What's the foreseeable number of TLB entries
>> within the next 10 years?
>> 
>> Having the whole stack available would make the sync with qemu easier and
>> also allows us to only do a single ioctl for all the TLB management. Thanks
>> to the PVR we know the size of the TLB, so we don't have to shove that
>> around.
> 
> Yes, we thought about that approach but the idea here, as Scott 
> described, was to provide an API that could work if user space
> is unaware of the geometry of the TLB.

Userspace shouldn't be unaware of the TLB, that's the whole point :). In principle, all state should be fetchable from userspace - so it has to know the geometry.

> Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
> of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
> allow TLB geometries that cannot be described in the TLBnCFG
> registers.

It's certainly not easy to assemble all the required information in userspace, but we need to do so nevertheless - having the state available simply buys us a lot in terms of flexibility.

> I think the use case where this API would be used the most
> would be from a gdb stub that needed to look up an effective
> address.

I agree. As I stated in my previous mail, there's probably good rationale to explicitly tune that path. That doesn't mean that we have to leave out the generic one. It should only be an acceleration.

After all, this whole flexibility thing with all the potential possibilities is KVM's strong point. We should not close the doors on those :).


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-03  9:19       ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-04 22:33         ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-04 22:33 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 02.02.2011, at 23:08, Scott Wood wrote:
> > On Wed, 2 Feb 2011 22:33:41 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> > 
> > I have no idea what things will look like 10 years down the road, but
> > currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> 
> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

And that's only with the standard hardware TLB size.  On Topaz (our
standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
It speeds up some workloads nicely, but invalidation-heavy loads get
hurt.

> >> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> > 
> > No, we don't know the size (or necessarily even the structure) of the
> > TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> > as a cache to reduce the number of TLB misses that have to go to the
> > guest (we do this now in another hypervisor).
> > 
> > Plus sometimes it's just simpler -- why bother halving the size of the
> > guest TLB when running on e500v1?
> 
> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).

Maybe, though keeping it in KVM means we can change it whenever we want
without having to sync up Qemu and worry about backward compatibility.

Same-as-hardware TLB geometry with a Qemu-specified number of sets is
probably good enough for the forseeable future, though.  There were
some other schemes we considered a while back for Topaz, but we ended
up just going with a larger version of what's in hardware.

> Maybe this even needs to be potentially runtime switchable, in case
> you boot off with u-boot in the guest, load a kernel and the kernel
> activates some PV extensions.

U-Boot should be OK with it -- the increased TLB size is
architecturally valid, and U-boot doesn't do much with TLB0 anyway.

If we later have true PV extensions such as a page table, that'd be
another matter.

> >> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> > 
> > The other way to look at it is that there's no need for a reset
> > function if all the state is properly settable. :-)
> 
> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".

This is meant to be used by a qemu reset function.  If there's a 
full-tlb set, that could be used instead, though it'd be slower.  With
the API as proposed it's needed to clear the slate before you set the
individual entries you want.

> > Which we want anyway for debugging (and migration, though I wonder if
> > anyone would actually use that with embedded hardware).
> 
> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

Oh, I agree -- or I wouldn't have even mentioned it. :-)

I just wouldn't make it the primary optimization concern at this point.

> >> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
> >> 
> >> struct kvm_ppc_booke_tlbe {
> >>    __u64 data[8];
> >> };
> >> 
> >> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> > 
> > If it's not exposed to userspace, how is userspace going to
> > interpret/fill in the data?
> 
> It can overlay cast according to the MMU type.

How's that different from backing the void pointer up with a different
struct depending on the MMU type?  We weren't proposing unions.

A fixed array does mean you wouldn't have to worry about whether qemu
supports the more advanced struct format if fields are added --
you can just unconditionally write it, as long as it's backwards
compatible.  Unless you hit the limit of the pre-determined array size,
that is.  And if that gets made higher for future expansion, that's
even more data that has to get transferred, before it's really needed.

> > As for kernel headers, I think qemu needs to provide its own copy, like
> > qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> > for programs which rely on recent kernel APIs (which Qemu+KVM tends
> > to do already).
> 
> Yeah, tedious story...

I guess it's come up before?

> >> Userspace should only really need the TLB entries for
> >> 
> >>  1) Debugging
> >>  2) Migration
> >> 
> >> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> > 
> > The third case is reset/init, which can be performance sensitive
> > (especially in failover setups).
> 
> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

The API as proposed can do a full TLB set (if you start with
invalidate), and a full TLB get (by iterating).  So it's an
optimization decision in either direction.

If we do decide to mandate a standard geometry TLB, just with
settable size, then doing a full get/set has a simplicity advantage.
The iteration approach was intended to preserve flexibility of
implementation.  And then for optimization, we could add an interface
to get/set a single entry, that doesn't need to support iteration.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-04 22:33         ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-04 22:33 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 02.02.2011, at 23:08, Scott Wood wrote:
> > On Wed, 2 Feb 2011 22:33:41 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> > 
> > I have no idea what things will look like 10 years down the road, but
> > currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> 
> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

And that's only with the standard hardware TLB size.  On Topaz (our
standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
It speeds up some workloads nicely, but invalidation-heavy loads get
hurt.

> >> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> > 
> > No, we don't know the size (or necessarily even the structure) of the
> > TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> > as a cache to reduce the number of TLB misses that have to go to the
> > guest (we do this now in another hypervisor).
> > 
> > Plus sometimes it's just simpler -- why bother halving the size of the
> > guest TLB when running on e500v1?
> 
> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).

Maybe, though keeping it in KVM means we can change it whenever we want
without having to sync up Qemu and worry about backward compatibility.

Same-as-hardware TLB geometry with a Qemu-specified number of sets is
probably good enough for the forseeable future, though.  There were
some other schemes we considered a while back for Topaz, but we ended
up just going with a larger version of what's in hardware.

> Maybe this even needs to be potentially runtime switchable, in case
> you boot off with u-boot in the guest, load a kernel and the kernel
> activates some PV extensions.

U-Boot should be OK with it -- the increased TLB size is
architecturally valid, and U-boot doesn't do much with TLB0 anyway.

If we later have true PV extensions such as a page table, that'd be
another matter.

> >> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> > 
> > The other way to look at it is that there's no need for a reset
> > function if all the state is properly settable. :-)
> 
> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".

This is meant to be used by a qemu reset function.  If there's a 
full-tlb set, that could be used instead, though it'd be slower.  With
the API as proposed it's needed to clear the slate before you set the
individual entries you want.

> > Which we want anyway for debugging (and migration, though I wonder if
> > anyone would actually use that with embedded hardware).
> 
> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

Oh, I agree -- or I wouldn't have even mentioned it. :-)

I just wouldn't make it the primary optimization concern at this point.

> >> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
> >> 
> >> struct kvm_ppc_booke_tlbe {
> >>    __u64 data[8];
> >> };
> >> 
> >> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> > 
> > If it's not exposed to userspace, how is userspace going to
> > interpret/fill in the data?
> 
> It can overlay cast according to the MMU type.

How's that different from backing the void pointer up with a different
struct depending on the MMU type?  We weren't proposing unions.

A fixed array does mean you wouldn't have to worry about whether qemu
supports the more advanced struct format if fields are added --
you can just unconditionally write it, as long as it's backwards
compatible.  Unless you hit the limit of the pre-determined array size,
that is.  And if that gets made higher for future expansion, that's
even more data that has to get transferred, before it's really needed.

> > As for kernel headers, I think qemu needs to provide its own copy, like
> > qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> > for programs which rely on recent kernel APIs (which Qemu+KVM tends
> > to do already).
> 
> Yeah, tedious story...

I guess it's come up before?

> >> Userspace should only really need the TLB entries for
> >> 
> >>  1) Debugging
> >>  2) Migration
> >> 
> >> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> > 
> > The third case is reset/init, which can be performance sensitive
> > (especially in failover setups).
> 
> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

The API as proposed can do a full TLB set (if you start with
invalidate), and a full TLB get (by iterating).  So it's an
optimization decision in either direction.

If we do decide to mandate a standard geometry TLB, just with
settable size, then doing a full get/set has a simplicity advantage.
The iteration approach was intended to preserve flexibility of
implementation.  And then for optimization, we could add an interface
to get/set a single entry, that doesn't need to support iteration.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-04 22:33         ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-04 22:33 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 02.02.2011, at 23:08, Scott Wood wrote:
> > On Wed, 2 Feb 2011 22:33:41 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> > 
> > I have no idea what things will look like 10 years down the road, but
> > currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> 
> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

And that's only with the standard hardware TLB size.  On Topaz (our
standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
It speeds up some workloads nicely, but invalidation-heavy loads get
hurt.

> >> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> > 
> > No, we don't know the size (or necessarily even the structure) of the
> > TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> > as a cache to reduce the number of TLB misses that have to go to the
> > guest (we do this now in another hypervisor).
> > 
> > Plus sometimes it's just simpler -- why bother halving the size of the
> > guest TLB when running on e500v1?
> 
> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).

Maybe, though keeping it in KVM means we can change it whenever we want
without having to sync up Qemu and worry about backward compatibility.

Same-as-hardware TLB geometry with a Qemu-specified number of sets is
probably good enough for the forseeable future, though.  There were
some other schemes we considered a while back for Topaz, but we ended
up just going with a larger version of what's in hardware.

> Maybe this even needs to be potentially runtime switchable, in case
> you boot off with u-boot in the guest, load a kernel and the kernel
> activates some PV extensions.

U-Boot should be OK with it -- the increased TLB size is
architecturally valid, and U-boot doesn't do much with TLB0 anyway.

If we later have true PV extensions such as a page table, that'd be
another matter.

> >> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> > 
> > The other way to look at it is that there's no need for a reset
> > function if all the state is properly settable. :-)
> 
> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".

This is meant to be used by a qemu reset function.  If there's a 
full-tlb set, that could be used instead, though it'd be slower.  With
the API as proposed it's needed to clear the slate before you set the
individual entries you want.

> > Which we want anyway for debugging (and migration, though I wonder if
> > anyone would actually use that with embedded hardware).
> 
> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

Oh, I agree -- or I wouldn't have even mentioned it. :-)

I just wouldn't make it the primary optimization concern at this point.

> >> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
> >> 
> >> struct kvm_ppc_booke_tlbe {
> >>    __u64 data[8];
> >> };
> >> 
> >> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> > 
> > If it's not exposed to userspace, how is userspace going to
> > interpret/fill in the data?
> 
> It can overlay cast according to the MMU type.

How's that different from backing the void pointer up with a different
struct depending on the MMU type?  We weren't proposing unions.

A fixed array does mean you wouldn't have to worry about whether qemu
supports the more advanced struct format if fields are added --
you can just unconditionally write it, as long as it's backwards
compatible.  Unless you hit the limit of the pre-determined array size,
that is.  And if that gets made higher for future expansion, that's
even more data that has to get transferred, before it's really needed.

> > As for kernel headers, I think qemu needs to provide its own copy, like
> > qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> > for programs which rely on recent kernel APIs (which Qemu+KVM tends
> > to do already).
> 
> Yeah, tedious story...

I guess it's come up before?

> >> Userspace should only really need the TLB entries for
> >> 
> >>  1) Debugging
> >>  2) Migration
> >> 
> >> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> > 
> > The third case is reset/init, which can be performance sensitive
> > (especially in failover setups).
> 
> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

The API as proposed can do a full TLB set (if you start with
invalidate), and a full TLB get (by iterating).  So it's an
optimization decision in either direction.

If we do decide to mandate a standard geometry TLB, just with
settable size, then doing a full get/set has a simplicity advantage.
The iteration approach was intended to preserve flexibility of
implementation.  And then for optimization, we could add an interface
to get/set a single entry, that doesn't need to support iteration.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-04 22:33         ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-07 15:43           ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-07 15:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 04.02.2011, at 23:33, Scott Wood wrote:

> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 02.02.2011, at 23:08, Scott Wood wrote:
>>> On Wed, 2 Feb 2011 22:33:41 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
>>> 
>>> I have no idea what things will look like 10 years down the road, but
>>> currently e500mc has 576 entries (512 TLB0, 64 TLB1).
>> 
>> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.
> 
> And that's only with the standard hardware TLB size.  On Topaz (our
> standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
> It speeds up some workloads nicely, but invalidation-heavy loads get
> hurt.

Yup - I do a similar trick for Book3S. Just that the TLB is implementation dependent anyways and mostly hidden from the OS :).

> 
>>>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
>>> 
>>> No, we don't know the size (or necessarily even the structure) of the
>>> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
>>> as a cache to reduce the number of TLB misses that have to go to the
>>> guest (we do this now in another hypervisor).
>>> 
>>> Plus sometimes it's just simpler -- why bother halving the size of the
>>> guest TLB when running on e500v1?
>> 
>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
> 
> Maybe, though keeping it in KVM means we can change it whenever we want
> without having to sync up Qemu and worry about backward compatibility.

Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.

> Same-as-hardware TLB geometry with a Qemu-specified number of sets is
> probably good enough for the forseeable future, though.  There were
> some other schemes we considered a while back for Topaz, but we ended
> up just going with a larger version of what's in hardware.

Sounds good. Not sure how we'd handle page tables in this whole scheme yet, but I guess we really should take one step at a time.

> 
>> Maybe this even needs to be potentially runtime switchable, in case
>> you boot off with u-boot in the guest, load a kernel and the kernel
>> activates some PV extensions.
> 
> U-Boot should be OK with it -- the increased TLB size is
> architecturally valid, and U-boot doesn't do much with TLB0 anyway.
> 
> If we later have true PV extensions such as a page table, that'd be
> another matter.

Yeah. Or imagine we realize that we should make the TLB size dynamic, so that non-PV aware guest OSs get the exact same behavior as on real hardware (because they're buggy for example) while PV aware guests can bump up the TLB size. But really, let's worry about that later.

> 
>>>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
>>> 
>>> The other way to look at it is that there's no need for a reset
>>> function if all the state is properly settable. :-)
>> 
>> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.
> 
> Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".
> 
> This is meant to be used by a qemu reset function.  If there's a 
> full-tlb set, that could be used instead, though it'd be slower.  With
> the API as proposed it's needed to clear the slate before you set the
> individual entries you want.

Yeah, let's just go for a full TLB get/set and optimize specific slow parts later. It's easier to fine-tune stuff that's slow than to miss out on functionality because we were trying to be too clever from the start.

> 
>>> Which we want anyway for debugging (and migration, though I wonder if
>>> anyone would actually use that with embedded hardware).
>> 
>> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.
> 
> Oh, I agree -- or I wouldn't have even mentioned it. :-)
> 
> I just wouldn't make it the primary optimization concern at this point.

No, not at all - no optimization for it. But keeping the door open :).

> 
>>>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>>>> 
>>>> struct kvm_ppc_booke_tlbe {
>>>>   __u64 data[8];
>>>> };
>>>> 
>>>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
>>> 
>>> If it's not exposed to userspace, how is userspace going to
>>> interpret/fill in the data?
>> 
>> It can overlay cast according to the MMU type.
> 
> How's that different from backing the void pointer up with a different
> struct depending on the MMU type?  We weren't proposing unions.
> 
> A fixed array does mean you wouldn't have to worry about whether qemu
> supports the more advanced struct format if fields are added --
> you can just unconditionally write it, as long as it's backwards
> compatible.  Unless you hit the limit of the pre-determined array size,
> that is.  And if that gets made higher for future expansion, that's
> even more data that has to get transferred, before it's really needed.

Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?

struct request_ppc_tlb {
    int tlb_type;
    int tlb_entries;
    uint64_t __user *tlb_data
};

in qemu:

struct request_ppc_tlb req;
reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX);
r = do_ioctl(REQUEST_PPC_TLB, &req);
if (r == -ENOMEM) {
    cpu_abort(env, "TLB too big");
}

switch (reg.tlb_type) {
case PPC_TLB_xxx:
    copy_reg_to_tlb_for_xxx(env, reg.tlb_data);
}

something like this. Then we should be flexible enough for the foreseeable future and make it possible for kernel space to switch MMU modes in case we need that.

> 
>>> As for kernel headers, I think qemu needs to provide its own copy, like
>>> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
>>> for programs which rely on recent kernel APIs (which Qemu+KVM tends
>>> to do already).
>> 
>> Yeah, tedious story...
> 
> I guess it's come up before?

It has and has never been really resolved I think. Hits me every other day on book3s.

> 
>>>> Userspace should only really need the TLB entries for
>>>> 
>>>> 1) Debugging
>>>> 2) Migration
>>>> 
>>>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
>>> 
>>> The third case is reset/init, which can be performance sensitive
>>> (especially in failover setups).
>> 
>> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.
> 
> The API as proposed can do a full TLB set (if you start with
> invalidate), and a full TLB get (by iterating).  So it's an
> optimization decision in either direction.

Would you really want to loop through 16k entries, doing an ioctl for each? Then performance really would always be an issue.

> If we do decide to mandate a standard geometry TLB, just with
> settable size, then doing a full get/set has a simplicity advantage.
> The iteration approach was intended to preserve flexibility of
> implementation.  And then for optimization, we could add an interface
> to get/set a single entry, that doesn't need to support iteration.

It's more than simplicity. It's also speed because you save the individual ioctl on each. I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 15:43           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-07 15:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel


On 04.02.2011, at 23:33, Scott Wood wrote:

> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 02.02.2011, at 23:08, Scott Wood wrote:
>>> On Wed, 2 Feb 2011 22:33:41 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
>>> 
>>> I have no idea what things will look like 10 years down the road, but
>>> currently e500mc has 576 entries (512 TLB0, 64 TLB1).
>> 
>> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.
> 
> And that's only with the standard hardware TLB size.  On Topaz (our
> standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
> It speeds up some workloads nicely, but invalidation-heavy loads get
> hurt.

Yup - I do a similar trick for Book3S. Just that the TLB is implementation dependent anyways and mostly hidden from the OS :).

> 
>>>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
>>> 
>>> No, we don't know the size (or necessarily even the structure) of the
>>> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
>>> as a cache to reduce the number of TLB misses that have to go to the
>>> guest (we do this now in another hypervisor).
>>> 
>>> Plus sometimes it's just simpler -- why bother halving the size of the
>>> guest TLB when running on e500v1?
>> 
>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
> 
> Maybe, though keeping it in KVM means we can change it whenever we want
> without having to sync up Qemu and worry about backward compatibility.

Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.

> Same-as-hardware TLB geometry with a Qemu-specified number of sets is
> probably good enough for the forseeable future, though.  There were
> some other schemes we considered a while back for Topaz, but we ended
> up just going with a larger version of what's in hardware.

Sounds good. Not sure how we'd handle page tables in this whole scheme yet, but I guess we really should take one step at a time.

> 
>> Maybe this even needs to be potentially runtime switchable, in case
>> you boot off with u-boot in the guest, load a kernel and the kernel
>> activates some PV extensions.
> 
> U-Boot should be OK with it -- the increased TLB size is
> architecturally valid, and U-boot doesn't do much with TLB0 anyway.
> 
> If we later have true PV extensions such as a page table, that'd be
> another matter.

Yeah. Or imagine we realize that we should make the TLB size dynamic, so that non-PV aware guest OSs get the exact same behavior as on real hardware (because they're buggy for example) while PV aware guests can bump up the TLB size. But really, let's worry about that later.

> 
>>>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
>>> 
>>> The other way to look at it is that there's no need for a reset
>>> function if all the state is properly settable. :-)
>> 
>> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.
> 
> Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".
> 
> This is meant to be used by a qemu reset function.  If there's a 
> full-tlb set, that could be used instead, though it'd be slower.  With
> the API as proposed it's needed to clear the slate before you set the
> individual entries you want.

Yeah, let's just go for a full TLB get/set and optimize specific slow parts later. It's easier to fine-tune stuff that's slow than to miss out on functionality because we were trying to be too clever from the start.

> 
>>> Which we want anyway for debugging (and migration, though I wonder if
>>> anyone would actually use that with embedded hardware).
>> 
>> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.
> 
> Oh, I agree -- or I wouldn't have even mentioned it. :-)
> 
> I just wouldn't make it the primary optimization concern at this point.

No, not at all - no optimization for it. But keeping the door open :).

> 
>>>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>>>> 
>>>> struct kvm_ppc_booke_tlbe {
>>>>   __u64 data[8];
>>>> };
>>>> 
>>>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
>>> 
>>> If it's not exposed to userspace, how is userspace going to
>>> interpret/fill in the data?
>> 
>> It can overlay cast according to the MMU type.
> 
> How's that different from backing the void pointer up with a different
> struct depending on the MMU type?  We weren't proposing unions.
> 
> A fixed array does mean you wouldn't have to worry about whether qemu
> supports the more advanced struct format if fields are added --
> you can just unconditionally write it, as long as it's backwards
> compatible.  Unless you hit the limit of the pre-determined array size,
> that is.  And if that gets made higher for future expansion, that's
> even more data that has to get transferred, before it's really needed.

Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?

struct request_ppc_tlb {
    int tlb_type;
    int tlb_entries;
    uint64_t __user *tlb_data
};

in qemu:

struct request_ppc_tlb req;
reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX);
r = do_ioctl(REQUEST_PPC_TLB, &req);
if (r == -ENOMEM) {
    cpu_abort(env, "TLB too big");
}

switch (reg.tlb_type) {
case PPC_TLB_xxx:
    copy_reg_to_tlb_for_xxx(env, reg.tlb_data);
}

something like this. Then we should be flexible enough for the foreseeable future and make it possible for kernel space to switch MMU modes in case we need that.

> 
>>> As for kernel headers, I think qemu needs to provide its own copy, like
>>> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
>>> for programs which rely on recent kernel APIs (which Qemu+KVM tends
>>> to do already).
>> 
>> Yeah, tedious story...
> 
> I guess it's come up before?

It has and has never been really resolved I think. Hits me every other day on book3s.

> 
>>>> Userspace should only really need the TLB entries for
>>>> 
>>>> 1) Debugging
>>>> 2) Migration
>>>> 
>>>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
>>> 
>>> The third case is reset/init, which can be performance sensitive
>>> (especially in failover setups).
>> 
>> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.
> 
> The API as proposed can do a full TLB set (if you start with
> invalidate), and a full TLB get (by iterating).  So it's an
> optimization decision in either direction.

Would you really want to loop through 16k entries, doing an ioctl for each? Then performance really would always be an issue.

> If we do decide to mandate a standard geometry TLB, just with
> settable size, then doing a full get/set has a simplicity advantage.
> The iteration approach was intended to preserve flexibility of
> implementation.  And then for optimization, we could add an interface
> to get/set a single entry, that doesn't need to support iteration.

It's more than simplicity. It's also speed because you save the individual ioctl on each. I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 15:43           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-07 15:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 04.02.2011, at 23:33, Scott Wood wrote:

> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 02.02.2011, at 23:08, Scott Wood wrote:
>>> On Wed, 2 Feb 2011 22:33:41 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
>>> 
>>> I have no idea what things will look like 10 years down the road, but
>>> currently e500mc has 576 entries (512 TLB0, 64 TLB1).
>> 
>> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.
> 
> And that's only with the standard hardware TLB size.  On Topaz (our
> standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
> It speeds up some workloads nicely, but invalidation-heavy loads get
> hurt.

Yup - I do a similar trick for Book3S. Just that the TLB is implementation dependent anyways and mostly hidden from the OS :).

> 
>>>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
>>> 
>>> No, we don't know the size (or necessarily even the structure) of the
>>> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
>>> as a cache to reduce the number of TLB misses that have to go to the
>>> guest (we do this now in another hypervisor).
>>> 
>>> Plus sometimes it's just simpler -- why bother halving the size of the
>>> guest TLB when running on e500v1?
>> 
>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
> 
> Maybe, though keeping it in KVM means we can change it whenever we want
> without having to sync up Qemu and worry about backward compatibility.

Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.

> Same-as-hardware TLB geometry with a Qemu-specified number of sets is
> probably good enough for the forseeable future, though.  There were
> some other schemes we considered a while back for Topaz, but we ended
> up just going with a larger version of what's in hardware.

Sounds good. Not sure how we'd handle page tables in this whole scheme yet, but I guess we really should take one step at a time.

> 
>> Maybe this even needs to be potentially runtime switchable, in case
>> you boot off with u-boot in the guest, load a kernel and the kernel
>> activates some PV extensions.
> 
> U-Boot should be OK with it -- the increased TLB size is
> architecturally valid, and U-boot doesn't do much with TLB0 anyway.
> 
> If we later have true PV extensions such as a page table, that'd be
> another matter.

Yeah. Or imagine we realize that we should make the TLB size dynamic, so that non-PV aware guest OSs get the exact same behavior as on real hardware (because they're buggy for example) while PV aware guests can bump up the TLB size. But really, let's worry about that later.

> 
>>>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
>>> 
>>> The other way to look at it is that there's no need for a reset
>>> function if all the state is properly settable. :-)
>> 
>> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.
> 
> Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".
> 
> This is meant to be used by a qemu reset function.  If there's a 
> full-tlb set, that could be used instead, though it'd be slower.  With
> the API as proposed it's needed to clear the slate before you set the
> individual entries you want.

Yeah, let's just go for a full TLB get/set and optimize specific slow parts later. It's easier to fine-tune stuff that's slow than to miss out on functionality because we were trying to be too clever from the start.

> 
>>> Which we want anyway for debugging (and migration, though I wonder if
>>> anyone would actually use that with embedded hardware).
>> 
>> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.
> 
> Oh, I agree -- or I wouldn't have even mentioned it. :-)
> 
> I just wouldn't make it the primary optimization concern at this point.

No, not at all - no optimization for it. But keeping the door open :).

> 
>>>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>>>> 
>>>> struct kvm_ppc_booke_tlbe {
>>>>   __u64 data[8];
>>>> };
>>>> 
>>>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
>>> 
>>> If it's not exposed to userspace, how is userspace going to
>>> interpret/fill in the data?
>> 
>> It can overlay cast according to the MMU type.
> 
> How's that different from backing the void pointer up with a different
> struct depending on the MMU type?  We weren't proposing unions.
> 
> A fixed array does mean you wouldn't have to worry about whether qemu
> supports the more advanced struct format if fields are added --
> you can just unconditionally write it, as long as it's backwards
> compatible.  Unless you hit the limit of the pre-determined array size,
> that is.  And if that gets made higher for future expansion, that's
> even more data that has to get transferred, before it's really needed.

Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?

struct request_ppc_tlb {
    int tlb_type;
    int tlb_entries;
    uint64_t __user *tlb_data
};

in qemu:

struct request_ppc_tlb req;
reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX);
r = do_ioctl(REQUEST_PPC_TLB, &req);
if (r = -ENOMEM) {
    cpu_abort(env, "TLB too big");
}

switch (reg.tlb_type) {
case PPC_TLB_xxx:
    copy_reg_to_tlb_for_xxx(env, reg.tlb_data);
}

something like this. Then we should be flexible enough for the foreseeable future and make it possible for kernel space to switch MMU modes in case we need that.

> 
>>> As for kernel headers, I think qemu needs to provide its own copy, like
>>> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
>>> for programs which rely on recent kernel APIs (which Qemu+KVM tends
>>> to do already).
>> 
>> Yeah, tedious story...
> 
> I guess it's come up before?

It has and has never been really resolved I think. Hits me every other day on book3s.

> 
>>>> Userspace should only really need the TLB entries for
>>>> 
>>>> 1) Debugging
>>>> 2) Migration
>>>> 
>>>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
>>> 
>>> The third case is reset/init, which can be performance sensitive
>>> (especially in failover setups).
>> 
>> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.
> 
> The API as proposed can do a full TLB set (if you start with
> invalidate), and a full TLB get (by iterating).  So it's an
> optimization decision in either direction.

Would you really want to loop through 16k entries, doing an ioctl for each? Then performance really would always be an issue.

> If we do decide to mandate a standard geometry TLB, just with
> settable size, then doing a full get/set has a simplicity advantage.
> The iteration approach was intended to preserve flexibility of
> implementation.  And then for optimization, we could add an interface
> to get/set a single entry, that doesn't need to support iteration.

It's more than simplicity. It's also speed because you save the individual ioctl on each. I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
  2011-02-07 15:43           ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-07 16:40             ` Yoder Stuart-B08248
  -1 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 16:40 UTC (permalink / raw)
  To: Alexander Graf, Wood Scott-B07421; +Cc: kvm-ppc, kvm, qemu-devel


> > A fixed array does mean you wouldn't have to worry about whether qemu
> > supports the more advanced struct format if fields are added -- you
> > can just unconditionally write it, as long as it's backwards
> > compatible.  Unless you hit the limit of the pre-determined array
> > size, that is.  And if that gets made higher for future expansion,
> > that's even more data that has to get transferred, before it's really
> needed.
> 
> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass
> in a random __user pointer that we directly write to from kernel space and
> tell qemu how big and what type a tlb entry is?
> 
> struct request_ppc_tlb {
>     int tlb_type;
>     int tlb_entries;
>     uint64_t __user *tlb_data
> };
> 
> in qemu:
> 
> struct request_ppc_tlb req;
> reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX); r = do_ioctl(REQUEST_PPC_TLB,
> &req); if (r == -ENOMEM) {
>     cpu_abort(env, "TLB too big");
> }
> 
> switch (reg.tlb_type) {
> case PPC_TLB_xxx:
>     copy_reg_to_tlb_for_xxx(env, reg.tlb_data); }
> 
> something like this. Then we should be flexible enough for the foreseeable
> future and make it possible for kernel space to switch MMU modes in case we
> need that.

Suggested change to this would be to have Qemu set tlb_type as 
an _input_ argument.   If KVM supports it, that type gets used,
else an error is returned.    This would allow Qemu to tell
the kernel what type of MMU it is prepared to support.   Without
this Qemu would just have to error out if the type returned is
unknown.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 16:40             ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 16:40 UTC (permalink / raw)
  To: Alexander Graf, Wood Scott-B07421; +Cc: kvm, kvm-ppc, qemu-devel


> > A fixed array does mean you wouldn't have to worry about whether qemu
> > supports the more advanced struct format if fields are added -- you
> > can just unconditionally write it, as long as it's backwards
> > compatible.  Unless you hit the limit of the pre-determined array
> > size, that is.  And if that gets made higher for future expansion,
> > that's even more data that has to get transferred, before it's really
> needed.
> 
> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass
> in a random __user pointer that we directly write to from kernel space and
> tell qemu how big and what type a tlb entry is?
> 
> struct request_ppc_tlb {
>     int tlb_type;
>     int tlb_entries;
>     uint64_t __user *tlb_data
> };
> 
> in qemu:
> 
> struct request_ppc_tlb req;
> reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX); r = do_ioctl(REQUEST_PPC_TLB,
> &req); if (r == -ENOMEM) {
>     cpu_abort(env, "TLB too big");
> }
> 
> switch (reg.tlb_type) {
> case PPC_TLB_xxx:
>     copy_reg_to_tlb_for_xxx(env, reg.tlb_data); }
> 
> something like this. Then we should be flexible enough for the foreseeable
> future and make it possible for kernel space to switch MMU modes in case we
> need that.

Suggested change to this would be to have Qemu set tlb_type as 
an _input_ argument.   If KVM supports it, that type gets used,
else an error is returned.    This would allow Qemu to tell
the kernel what type of MMU it is prepared to support.   Without
this Qemu would just have to error out if the type returned is
unknown.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 16:40             ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 16:40 UTC (permalink / raw)
  To: Alexander Graf, Wood Scott-B07421; +Cc: kvm-ppc, kvm, qemu-devel


> > A fixed array does mean you wouldn't have to worry about whether qemu
> > supports the more advanced struct format if fields are added -- you
> > can just unconditionally write it, as long as it's backwards
> > compatible.  Unless you hit the limit of the pre-determined array
> > size, that is.  And if that gets made higher for future expansion,
> > that's even more data that has to get transferred, before it's really
> needed.
> 
> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass
> in a random __user pointer that we directly write to from kernel space and
> tell qemu how big and what type a tlb entry is?
> 
> struct request_ppc_tlb {
>     int tlb_type;
>     int tlb_entries;
>     uint64_t __user *tlb_data
> };
> 
> in qemu:
> 
> struct request_ppc_tlb req;
> reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX); r = do_ioctl(REQUEST_PPC_TLB,
> &req); if (r = -ENOMEM) {
>     cpu_abort(env, "TLB too big");
> }
> 
> switch (reg.tlb_type) {
> case PPC_TLB_xxx:
>     copy_reg_to_tlb_for_xxx(env, reg.tlb_data); }
> 
> something like this. Then we should be flexible enough for the foreseeable
> future and make it possible for kernel space to switch MMU modes in case we
> need that.

Suggested change to this would be to have Qemu set tlb_type as 
an _input_ argument.   If KVM supports it, that type gets used,
else an error is returned.    This would allow Qemu to tell
the kernel what type of MMU it is prepared to support.   Without
this Qemu would just have to error out if the type returned is
unknown.

Stuart


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-07 16:40             ` [Qemu-devel] " Yoder Stuart-B08248
  (?)
@ 2011-02-07 16:49               ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-07 16:49 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: Wood Scott-B07421, kvm-ppc, kvm, qemu-devel


On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:

> 
>>> A fixed array does mean you wouldn't have to worry about whether qemu
>>> supports the more advanced struct format if fields are added -- you
>>> can just unconditionally write it, as long as it's backwards
>>> compatible.  Unless you hit the limit of the pre-determined array
>>> size, that is.  And if that gets made higher for future expansion,
>>> that's even more data that has to get transferred, before it's really
>> needed.
>> 
>> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass
>> in a random __user pointer that we directly write to from kernel space and
>> tell qemu how big and what type a tlb entry is?
>> 
>> struct request_ppc_tlb {
>>    int tlb_type;
>>    int tlb_entries;
>>    uint64_t __user *tlb_data
>> };
>> 
>> in qemu:
>> 
>> struct request_ppc_tlb req;
>> reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX); r = do_ioctl(REQUEST_PPC_TLB,
>> &req); if (r == -ENOMEM) {
>>    cpu_abort(env, "TLB too big");
>> }
>> 
>> switch (reg.tlb_type) {
>> case PPC_TLB_xxx:
>>    copy_reg_to_tlb_for_xxx(env, reg.tlb_data); }
>> 
>> something like this. Then we should be flexible enough for the foreseeable
>> future and make it possible for kernel space to switch MMU modes in case we
>> need that.
> 
> Suggested change to this would be to have Qemu set tlb_type as 
> an _input_ argument.   If KVM supports it, that type gets used,
> else an error is returned.    This would allow Qemu to tell
> the kernel what type of MMU it is prepared to support.   Without
> this Qemu would just have to error out if the type returned is
> unknown.

Yes, we could use the same struct for get and set. On set, it could transfer the mmu type, on get it could tell userspace the mmu type.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 16:49               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-07 16:49 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: Wood Scott-B07421, kvm, kvm-ppc, qemu-devel


On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:

> 
>>> A fixed array does mean you wouldn't have to worry about whether qemu
>>> supports the more advanced struct format if fields are added -- you
>>> can just unconditionally write it, as long as it's backwards
>>> compatible.  Unless you hit the limit of the pre-determined array
>>> size, that is.  And if that gets made higher for future expansion,
>>> that's even more data that has to get transferred, before it's really
>> needed.
>> 
>> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass
>> in a random __user pointer that we directly write to from kernel space and
>> tell qemu how big and what type a tlb entry is?
>> 
>> struct request_ppc_tlb {
>>    int tlb_type;
>>    int tlb_entries;
>>    uint64_t __user *tlb_data
>> };
>> 
>> in qemu:
>> 
>> struct request_ppc_tlb req;
>> reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX); r = do_ioctl(REQUEST_PPC_TLB,
>> &req); if (r == -ENOMEM) {
>>    cpu_abort(env, "TLB too big");
>> }
>> 
>> switch (reg.tlb_type) {
>> case PPC_TLB_xxx:
>>    copy_reg_to_tlb_for_xxx(env, reg.tlb_data); }
>> 
>> something like this. Then we should be flexible enough for the foreseeable
>> future and make it possible for kernel space to switch MMU modes in case we
>> need that.
> 
> Suggested change to this would be to have Qemu set tlb_type as 
> an _input_ argument.   If KVM supports it, that type gets used,
> else an error is returned.    This would allow Qemu to tell
> the kernel what type of MMU it is prepared to support.   Without
> this Qemu would just have to error out if the type returned is
> unknown.

Yes, we could use the same struct for get and set. On set, it could transfer the mmu type, on get it could tell userspace the mmu type.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 16:49               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-07 16:49 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: Wood Scott-B07421, kvm-ppc, kvm, qemu-devel


On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:

> 
>>> A fixed array does mean you wouldn't have to worry about whether qemu
>>> supports the more advanced struct format if fields are added -- you
>>> can just unconditionally write it, as long as it's backwards
>>> compatible.  Unless you hit the limit of the pre-determined array
>>> size, that is.  And if that gets made higher for future expansion,
>>> that's even more data that has to get transferred, before it's really
>> needed.
>> 
>> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass
>> in a random __user pointer that we directly write to from kernel space and
>> tell qemu how big and what type a tlb entry is?
>> 
>> struct request_ppc_tlb {
>>    int tlb_type;
>>    int tlb_entries;
>>    uint64_t __user *tlb_data
>> };
>> 
>> in qemu:
>> 
>> struct request_ppc_tlb req;
>> reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX); r = do_ioctl(REQUEST_PPC_TLB,
>> &req); if (r = -ENOMEM) {
>>    cpu_abort(env, "TLB too big");
>> }
>> 
>> switch (reg.tlb_type) {
>> case PPC_TLB_xxx:
>>    copy_reg_to_tlb_for_xxx(env, reg.tlb_data); }
>> 
>> something like this. Then we should be flexible enough for the foreseeable
>> future and make it possible for kernel space to switch MMU modes in case we
>> need that.
> 
> Suggested change to this would be to have Qemu set tlb_type as 
> an _input_ argument.   If KVM supports it, that type gets used,
> else an error is returned.    This would allow Qemu to tell
> the kernel what type of MMU it is prepared to support.   Without
> this Qemu would just have to error out if the type returned is
> unknown.

Yes, we could use the same struct for get and set. On set, it could transfer the mmu type, on get it could tell userspace the mmu type.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-03  9:19       ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-07 17:13         ` Avi Kivity
  -1 siblings, 0 replies; 112+ messages in thread
From: Avi Kivity @ 2011-02-07 17:13 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On 02/03/2011 11:19 AM, Alexander Graf wrote:
> >
> >  I have no idea what things will look like 10 years down the road, but
> >  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
>
> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or 
similar), or have the TLB stored in user memory, so there is no need to 
transfer it (on the other hand, you have to re-validate it every time 
you peek at it).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 17:13         ` Avi Kivity
  0 siblings, 0 replies; 112+ messages in thread
From: Avi Kivity @ 2011-02-07 17:13 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On 02/03/2011 11:19 AM, Alexander Graf wrote:
> >
> >  I have no idea what things will look like 10 years down the road, but
> >  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
>
> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or 
similar), or have the TLB stored in user memory, so there is no need to 
transfer it (on the other hand, you have to re-validate it every time 
you peek at it).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 17:13         ` Avi Kivity
  0 siblings, 0 replies; 112+ messages in thread
From: Avi Kivity @ 2011-02-07 17:13 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On 02/03/2011 11:19 AM, Alexander Graf wrote:
> >
> >  I have no idea what things will look like 10 years down the road, but
> >  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
>
> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or 
similar), or have the TLB stored in user memory, so there is no need to 
transfer it (on the other hand, you have to re-validate it every time 
you peek at it).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
  2011-02-07 17:13         ` [Qemu-devel] " Avi Kivity
  (?)
@ 2011-02-07 17:30           ` Yoder Stuart-B08248
  -1 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 17:30 UTC (permalink / raw)
  To: Avi Kivity, Alexander Graf; +Cc: Wood Scott-B07421, kvm-ppc, kvm, qemu-devel



> -----Original Message-----
> From: kvm-ppc-owner@vger.kernel.org [mailto:kvm-ppc-owner@vger.kernel.org]
> On Behalf Of Avi Kivity
> Sent: Monday, February 07, 2011 11:14 AM
> To: Alexander Graf
> Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org;
> kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> On 02/03/2011 11:19 AM, Alexander Graf wrote:
> > >
> > >  I have no idea what things will look like 10 years down the road,
> > > but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> >
> > That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
> want to transfer every time qemu feels like resolving an EA.
> 
> You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
> similar), or have the TLB stored in user memory, so there is no need to
> transfer it (on the other hand, you have to re-validate it every time you
> peek at it).

The most convenient and flexible thing for  Power Book III-E I think
will be something that operates like a TLB search instruction.  Inputs
are 'address space' and 'process id' and outputs are in which TLB the
entry was found and all the components of a TLB entry:
   address space
   pid
   entry number
   ea
   rpn
   guest state
   permissions flags
   attributes (WIMGE)

Since all of those fields are architected in MAS registers, in the previous
proposal we just proposed to return several 32-bit fields (one per MAS)
that use the architected layout instead of inventing a brand new
structure defining these fields.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 17:30           ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 17:30 UTC (permalink / raw)
  To: Avi Kivity, Alexander Graf; +Cc: Wood Scott-B07421, kvm, kvm-ppc, qemu-devel



> -----Original Message-----
> From: kvm-ppc-owner@vger.kernel.org [mailto:kvm-ppc-owner@vger.kernel.org]
> On Behalf Of Avi Kivity
> Sent: Monday, February 07, 2011 11:14 AM
> To: Alexander Graf
> Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org;
> kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> On 02/03/2011 11:19 AM, Alexander Graf wrote:
> > >
> > >  I have no idea what things will look like 10 years down the road,
> > > but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> >
> > That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
> want to transfer every time qemu feels like resolving an EA.
> 
> You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
> similar), or have the TLB stored in user memory, so there is no need to
> transfer it (on the other hand, you have to re-validate it every time you
> peek at it).

The most convenient and flexible thing for  Power Book III-E I think
will be something that operates like a TLB search instruction.  Inputs
are 'address space' and 'process id' and outputs are in which TLB the
entry was found and all the components of a TLB entry:
   address space
   pid
   entry number
   ea
   rpn
   guest state
   permissions flags
   attributes (WIMGE)

Since all of those fields are architected in MAS registers, in the previous
proposal we just proposed to return several 32-bit fields (one per MAS)
that use the architected layout instead of inventing a brand new
structure defining these fields.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 17:30           ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 17:30 UTC (permalink / raw)
  To: Avi Kivity, Alexander Graf; +Cc: Wood Scott-B07421, kvm-ppc, kvm, qemu-devel



> -----Original Message-----
> From: kvm-ppc-owner@vger.kernel.org [mailto:kvm-ppc-owner@vger.kernel.org]
> On Behalf Of Avi Kivity
> Sent: Monday, February 07, 2011 11:14 AM
> To: Alexander Graf
> Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org;
> kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> On 02/03/2011 11:19 AM, Alexander Graf wrote:
> > >
> > >  I have no idea what things will look like 10 years down the road,
> > > but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> >
> > That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
> want to transfer every time qemu feels like resolving an EA.
> 
> You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
> similar), or have the TLB stored in user memory, so there is no need to
> transfer it (on the other hand, you have to re-validate it every time you
> peek at it).

The most convenient and flexible thing for  Power Book III-E I think
will be something that operates like a TLB search instruction.  Inputs
are 'address space' and 'process id' and outputs are in which TLB the
entry was found and all the components of a TLB entry:
   address space
   pid
   entry number
   ea
   rpn
   guest state
   permissions flags
   attributes (WIMGE)

Since all of those fields are architected in MAS registers, in the previous
proposal we just proposed to return several 32-bit fields (one per MAS)
that use the architected layout instead of inventing a brand new
structure defining these fields.

Stuart


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-07 16:49               ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-07 18:52                 ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-07 18:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, Wood Scott-B07421, kvm-ppc, kvm, qemu-devel

On Mon, 7 Feb 2011 17:49:51 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
> 
> > Suggested change to this would be to have Qemu set tlb_type as 
> > an _input_ argument.   If KVM supports it, that type gets used,
> > else an error is returned.    This would allow Qemu to tell
> > the kernel what type of MMU it is prepared to support.   Without
> > this Qemu would just have to error out if the type returned is
> > unknown.
> 
> Yes, we could use the same struct for get and set. On set, it could transfer the mmu type, on get it could tell userspace the mmu type.

What happens if a get is done before the first set, and there are
multiple MMU type options for this hardware, with differing entry sizes?

Qemu would have to know beforehand how large to make the buffer.

We could say that going forward, it's expected that qemu will do a
TLB set (either a full one, or a lightweight alternative) after
creating a vcpu.  For compatibility, if this doesn't happen before the
vcpu is run, the TLB is created and initialized as it is today, but
no new Qemu-visible features will be enabled that way.

If Qemu does a get without ever doing some set operation, it should
get an error, since the requirement to do a set is added at the same
time as the get API.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 18:52                 ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-07 18:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: qemu-devel, Yoder Stuart-B08248, kvm, kvm-ppc, Wood Scott-B07421

On Mon, 7 Feb 2011 17:49:51 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
> 
> > Suggested change to this would be to have Qemu set tlb_type as 
> > an _input_ argument.   If KVM supports it, that type gets used,
> > else an error is returned.    This would allow Qemu to tell
> > the kernel what type of MMU it is prepared to support.   Without
> > this Qemu would just have to error out if the type returned is
> > unknown.
> 
> Yes, we could use the same struct for get and set. On set, it could transfer the mmu type, on get it could tell userspace the mmu type.

What happens if a get is done before the first set, and there are
multiple MMU type options for this hardware, with differing entry sizes?

Qemu would have to know beforehand how large to make the buffer.

We could say that going forward, it's expected that qemu will do a
TLB set (either a full one, or a lightweight alternative) after
creating a vcpu.  For compatibility, if this doesn't happen before the
vcpu is run, the TLB is created and initialized as it is today, but
no new Qemu-visible features will be enabled that way.

If Qemu does a get without ever doing some set operation, it should
get an error, since the requirement to do a set is added at the same
time as the get API.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 18:52                 ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-07 18:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, Wood Scott-B07421, kvm-ppc, kvm, qemu-devel

On Mon, 7 Feb 2011 17:49:51 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
> 
> > Suggested change to this would be to have Qemu set tlb_type as 
> > an _input_ argument.   If KVM supports it, that type gets used,
> > else an error is returned.    This would allow Qemu to tell
> > the kernel what type of MMU it is prepared to support.   Without
> > this Qemu would just have to error out if the type returned is
> > unknown.
> 
> Yes, we could use the same struct for get and set. On set, it could transfer the mmu type, on get it could tell userspace the mmu type.

What happens if a get is done before the first set, and there are
multiple MMU type options for this hardware, with differing entry sizes?

Qemu would have to know beforehand how large to make the buffer.

We could say that going forward, it's expected that qemu will do a
TLB set (either a full one, or a lightweight alternative) after
creating a vcpu.  For compatibility, if this doesn't happen before the
vcpu is run, the TLB is created and initialized as it is today, but
no new Qemu-visible features will be enabled that way.

If Qemu does a get without ever doing some set operation, it should
get an error, since the requirement to do a set is added at the same
time as the get API.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
  2011-02-07 18:52                 ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-07 19:56                   ` Yoder Stuart-B08248
  -1 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 19:56 UTC (permalink / raw)
  To: Wood Scott-B07421, Alexander Graf; +Cc: kvm-ppc, kvm, qemu-devel



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Monday, February 07, 2011 12:52 PM
> To: Alexander Graf
> Cc: Yoder Stuart-B08248; Wood Scott-B07421; kvm-ppc@vger.kernel.org;
> kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> On Mon, 7 Feb 2011 17:49:51 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
> >
> > On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
> >
> > > Suggested change to this would be to have Qemu set tlb_type as
> > > an _input_ argument.   If KVM supports it, that type gets used,
> > > else an error is returned.    This would allow Qemu to tell
> > > the kernel what type of MMU it is prepared to support.   Without
> > > this Qemu would just have to error out if the type returned is
> > > unknown.
> >
> > Yes, we could use the same struct for get and set. On set, it could
> transfer the mmu type, on get it could tell userspace the mmu type.
> 
> What happens if a get is done before the first set, and there are multiple
> MMU type options for this hardware, with differing entry sizes?
> 
> Qemu would have to know beforehand how large to make the buffer.
> 
> We could say that going forward, it's expected that qemu will do a TLB set
> (either a full one, or a lightweight alternative) after creating a vcpu.
> For compatibility, if this doesn't happen before the vcpu is run, the TLB
> is created and initialized as it is today, but no new Qemu-visible features
> will be enabled that way.

Since I think the normal thing Qemu would want to do is determine
the type/size before allocating space for the TLB, we could just
pass in NULL for tlb_data on the first set.   If tlb_data is
NULL we just set the MMU type and return the size (and type).

> If Qemu does a get without ever doing some set operation, it should get an
> error, since the requirement to do a set is added at the same time as the
> get API.

Right.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 19:56                   ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 19:56 UTC (permalink / raw)
  To: Wood Scott-B07421, Alexander Graf; +Cc: kvm, kvm-ppc, qemu-devel



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Monday, February 07, 2011 12:52 PM
> To: Alexander Graf
> Cc: Yoder Stuart-B08248; Wood Scott-B07421; kvm-ppc@vger.kernel.org;
> kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> On Mon, 7 Feb 2011 17:49:51 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
> >
> > On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
> >
> > > Suggested change to this would be to have Qemu set tlb_type as
> > > an _input_ argument.   If KVM supports it, that type gets used,
> > > else an error is returned.    This would allow Qemu to tell
> > > the kernel what type of MMU it is prepared to support.   Without
> > > this Qemu would just have to error out if the type returned is
> > > unknown.
> >
> > Yes, we could use the same struct for get and set. On set, it could
> transfer the mmu type, on get it could tell userspace the mmu type.
> 
> What happens if a get is done before the first set, and there are multiple
> MMU type options for this hardware, with differing entry sizes?
> 
> Qemu would have to know beforehand how large to make the buffer.
> 
> We could say that going forward, it's expected that qemu will do a TLB set
> (either a full one, or a lightweight alternative) after creating a vcpu.
> For compatibility, if this doesn't happen before the vcpu is run, the TLB
> is created and initialized as it is today, but no new Qemu-visible features
> will be enabled that way.

Since I think the normal thing Qemu would want to do is determine
the type/size before allocating space for the TLB, we could just
pass in NULL for tlb_data on the first set.   If tlb_data is
NULL we just set the MMU type and return the size (and type).

> If Qemu does a get without ever doing some set operation, it should get an
> error, since the requirement to do a set is added at the same time as the
> get API.

Right.

Stuart

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 19:56                   ` Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-07 19:56 UTC (permalink / raw)
  To: Wood Scott-B07421, Alexander Graf; +Cc: kvm-ppc, kvm, qemu-devel



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Monday, February 07, 2011 12:52 PM
> To: Alexander Graf
> Cc: Yoder Stuart-B08248; Wood Scott-B07421; kvm-ppc@vger.kernel.org;
> kvm@vger.kernel.org; qemu-devel@nongnu.org
> Subject: Re: RFC: New API for PPC for vcpu mmu access
> 
> On Mon, 7 Feb 2011 17:49:51 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
> >
> > On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
> >
> > > Suggested change to this would be to have Qemu set tlb_type as
> > > an _input_ argument.   If KVM supports it, that type gets used,
> > > else an error is returned.    This would allow Qemu to tell
> > > the kernel what type of MMU it is prepared to support.   Without
> > > this Qemu would just have to error out if the type returned is
> > > unknown.
> >
> > Yes, we could use the same struct for get and set. On set, it could
> transfer the mmu type, on get it could tell userspace the mmu type.
> 
> What happens if a get is done before the first set, and there are multiple
> MMU type options for this hardware, with differing entry sizes?
> 
> Qemu would have to know beforehand how large to make the buffer.
> 
> We could say that going forward, it's expected that qemu will do a TLB set
> (either a full one, or a lightweight alternative) after creating a vcpu.
> For compatibility, if this doesn't happen before the vcpu is run, the TLB
> is created and initialized as it is today, but no new Qemu-visible features
> will be enabled that way.

Since I think the normal thing Qemu would want to do is determine
the type/size before allocating space for the TLB, we could just
pass in NULL for tlb_data on the first set.   If tlb_data is
NULL we just set the MMU type and return the size (and type).

> If Qemu does a get without ever doing some set operation, it should get an
> error, since the requirement to do a set is added at the same time as the
> get API.

Right.

Stuart


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-07 15:43           ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-07 20:15             ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-07 20:15 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Mon, 7 Feb 2011 16:43:02 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 04.02.2011, at 23:33, Scott Wood wrote:
> 
> > On Thu, 3 Feb 2011 10:19:06 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
> > 
> > Maybe, though keeping it in KVM means we can change it whenever we want
> > without having to sync up Qemu and worry about backward compatibility.
> 
> Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.

If you're talking about actual interface changes, yes.  But a change in
how KVM implements things behind the scenes shouldn't break an old
Qemu, unless it's buggy and makes assumptions not permitted by the API.

> > How's that different from backing the void pointer up with a different
> > struct depending on the MMU type?  We weren't proposing unions.
> > 
> > A fixed array does mean you wouldn't have to worry about whether qemu
> > supports the more advanced struct format if fields are added --
> > you can just unconditionally write it, as long as it's backwards
> > compatible.  Unless you hit the limit of the pre-determined array size,
> > that is.  And if that gets made higher for future expansion, that's
> > even more data that has to get transferred, before it's really needed.
> 
> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?
> 
> struct request_ppc_tlb {
>     int tlb_type;
>     int tlb_entries;
>     uint64_t __user *tlb_data
> };

That's pretty much what the proposed API does -- except it uses a void
pointer instead of uint64_t *.

> Would you really want to loop through 16k entries, doing an ioctl for each? 

Not really.  The API was modeled after something we did on Topaz where
it's just a function call.  But something array-based would have been
awkward without constraining the geometry.

Now that we're going to constrain the geometry, providing an array-based
get/set would be easy and should definitely be a part of the API.

> Then performance really would always be an issue.

For cases where you really need to do a full get/set, yes.

> I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.

If we add everything at once, it minimizes the possibilities that Qemu
has to deal with -- either the full MMU API is there, or it's not.

BTW, I wonder if we should leave PPC out of the name.  It seems like
any arch with a software-visible TLB could use this, since the hw
details are hidden behind the MMU type.

How about:

struct kvmppc_booke_tlb_entry {
	union {
		__u64 mas0_1;
		struct {
			__u32 mas0;
			__u32 mas1;
		};
	};
	__u64 mas2;
	union {
		__u64 mas7_3	
		struct {
			__u32 mas7;
			__u32 mas3;
		};
	};
	__u32 mas8;
	__u32 pad;
};

struct kvmppc_booke_tlb_params {
	/*
	 * book3e defines 4 TLBs.  Individual implementations may have
	 * fewer.  TLBs that do not exist on the target must be configured
	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
	 * configured here, though arrays greater than 2048 entries will
	 * have TLBnCFG[NENTRY] set to zero.
	 */
	__u32 tlb_sizes[4];
};

struct kvmppc_booke_tlb_search {
	struct kvmppc_booke_tlb_entry entry;
	union {
		__u64 mas5_6;
		struct {
			__u64 mas5;
			__u64 mas6;
		};
	};
};

For a mmu type of PPC_BOOKE_NOHV, the mas5 field in
kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry
are present but not supported.

For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV:
 - TLB entries in get/set arrays may be provided in any order, and all 
   TLBs are get/set at once.
 - An entry with MAS1[V] = 0 terminates the list early (but there will
   be no terminating entry if the full array is valid).  On a call to
   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
   On a call to KVM_SET_TLB, excess elements beyond the terminating
   entry may not be accessed by KVM.

[Note: Once we implement sregs, Qemu can determine which TLBs are
implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
unsupported by KVM if its existence is implied by the target CPU]

KVM_SET_TLB
-----------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_set_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_set_tlb {
	__u64 params;
	__u64 array;
	__u32 mmu_type;
};

[Note: I used __u64 rather than void * to avoid the need for special
compat handling with 32-bit userspace on a 64-bit kernel -- if the other
way is preferred, that's fine with me]

Configures and sets the virtual CPU's TLB array.  The "params" and
"array" fields are userspace addresses of mmu-type-specific data
structures.

For mmu types PPC_BOOKE_NOHV and PPC_BOOKE_HV, the "params" field is of
type "struct kvmppc_booke_tlb_params", and the "array" field points to
an array of type "struct kvmppc_booke_tlb_entry".

[Note: KVM_SET_TLB with early array termination makes a separate
invalidate call unnecessary.]

KVM_GET_TLB
-----------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: void pointer (out)
Returns: 0 on success
         -1 on error

Reads the TLB array from a virtual CPU.  A successful call to
KVM_SET_TLB must have been previously made on this vcpu.  The argument
must point to space for an array of the size and type of TLB entry
structs configured by the most recent successful call to KVM_SET_TLB.

For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
kvmppc_booke_tlb_entry", and must hold a number of entries equal to
the sum of the elements of tlb_sizes in the most recent successful
TLB configuration call.


KVM_SEARCH_TLB
--------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: void pointer (in/out)
Returns: 0 on success
         -1 on error

Searches the TLB array of a virtual CPU for an entry matching
mmu-type-specific parameters.  A successful call to KVM_SET_TLB must
have been previously made on this vcpu.

For mmu types BOOKE_NOHV and BOOKE_HV, the argument must point to a
struct of type "kvmppc_booke_tlb_search".  The search operates as a tlbsx
instruction.

[Note: there currently exists a BookE implementation of KVM_TRANSLATE,
but the way it interprets the address is broken on 64-bit, and seems to
be confusing PPC's notion of a virtual address with what was most likely
intended by the x86ish API.  If nothing yet uses it to be broken by a
change, we may want to reimplement that using the address input as just
an effective address, as it would be interpreted by the current state
of the vcpu.  This would be a way to provide a non-hw-specific
mechanism for simple virtual->physical translation for debuggers and
such, as long as the caller doesn't care about the x86ish attributes.].


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 20:15             ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-07 20:15 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Mon, 7 Feb 2011 16:43:02 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 04.02.2011, at 23:33, Scott Wood wrote:
> 
> > On Thu, 3 Feb 2011 10:19:06 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
> > 
> > Maybe, though keeping it in KVM means we can change it whenever we want
> > without having to sync up Qemu and worry about backward compatibility.
> 
> Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.

If you're talking about actual interface changes, yes.  But a change in
how KVM implements things behind the scenes shouldn't break an old
Qemu, unless it's buggy and makes assumptions not permitted by the API.

> > How's that different from backing the void pointer up with a different
> > struct depending on the MMU type?  We weren't proposing unions.
> > 
> > A fixed array does mean you wouldn't have to worry about whether qemu
> > supports the more advanced struct format if fields are added --
> > you can just unconditionally write it, as long as it's backwards
> > compatible.  Unless you hit the limit of the pre-determined array size,
> > that is.  And if that gets made higher for future expansion, that's
> > even more data that has to get transferred, before it's really needed.
> 
> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?
> 
> struct request_ppc_tlb {
>     int tlb_type;
>     int tlb_entries;
>     uint64_t __user *tlb_data
> };

That's pretty much what the proposed API does -- except it uses a void
pointer instead of uint64_t *.

> Would you really want to loop through 16k entries, doing an ioctl for each? 

Not really.  The API was modeled after something we did on Topaz where
it's just a function call.  But something array-based would have been
awkward without constraining the geometry.

Now that we're going to constrain the geometry, providing an array-based
get/set would be easy and should definitely be a part of the API.

> Then performance really would always be an issue.

For cases where you really need to do a full get/set, yes.

> I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.

If we add everything at once, it minimizes the possibilities that Qemu
has to deal with -- either the full MMU API is there, or it's not.

BTW, I wonder if we should leave PPC out of the name.  It seems like
any arch with a software-visible TLB could use this, since the hw
details are hidden behind the MMU type.

How about:

struct kvmppc_booke_tlb_entry {
	union {
		__u64 mas0_1;
		struct {
			__u32 mas0;
			__u32 mas1;
		};
	};
	__u64 mas2;
	union {
		__u64 mas7_3	
		struct {
			__u32 mas7;
			__u32 mas3;
		};
	};
	__u32 mas8;
	__u32 pad;
};

struct kvmppc_booke_tlb_params {
	/*
	 * book3e defines 4 TLBs.  Individual implementations may have
	 * fewer.  TLBs that do not exist on the target must be configured
	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
	 * configured here, though arrays greater than 2048 entries will
	 * have TLBnCFG[NENTRY] set to zero.
	 */
	__u32 tlb_sizes[4];
};

struct kvmppc_booke_tlb_search {
	struct kvmppc_booke_tlb_entry entry;
	union {
		__u64 mas5_6;
		struct {
			__u64 mas5;
			__u64 mas6;
		};
	};
};

For a mmu type of PPC_BOOKE_NOHV, the mas5 field in
kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry
are present but not supported.

For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV:
 - TLB entries in get/set arrays may be provided in any order, and all 
   TLBs are get/set at once.
 - An entry with MAS1[V] = 0 terminates the list early (but there will
   be no terminating entry if the full array is valid).  On a call to
   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
   On a call to KVM_SET_TLB, excess elements beyond the terminating
   entry may not be accessed by KVM.

[Note: Once we implement sregs, Qemu can determine which TLBs are
implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
unsupported by KVM if its existence is implied by the target CPU]

KVM_SET_TLB
-----------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_set_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_set_tlb {
	__u64 params;
	__u64 array;
	__u32 mmu_type;
};

[Note: I used __u64 rather than void * to avoid the need for special
compat handling with 32-bit userspace on a 64-bit kernel -- if the other
way is preferred, that's fine with me]

Configures and sets the virtual CPU's TLB array.  The "params" and
"array" fields are userspace addresses of mmu-type-specific data
structures.

For mmu types PPC_BOOKE_NOHV and PPC_BOOKE_HV, the "params" field is of
type "struct kvmppc_booke_tlb_params", and the "array" field points to
an array of type "struct kvmppc_booke_tlb_entry".

[Note: KVM_SET_TLB with early array termination makes a separate
invalidate call unnecessary.]

KVM_GET_TLB
-----------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: void pointer (out)
Returns: 0 on success
         -1 on error

Reads the TLB array from a virtual CPU.  A successful call to
KVM_SET_TLB must have been previously made on this vcpu.  The argument
must point to space for an array of the size and type of TLB entry
structs configured by the most recent successful call to KVM_SET_TLB.

For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
kvmppc_booke_tlb_entry", and must hold a number of entries equal to
the sum of the elements of tlb_sizes in the most recent successful
TLB configuration call.


KVM_SEARCH_TLB
--------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: void pointer (in/out)
Returns: 0 on success
         -1 on error

Searches the TLB array of a virtual CPU for an entry matching
mmu-type-specific parameters.  A successful call to KVM_SET_TLB must
have been previously made on this vcpu.

For mmu types BOOKE_NOHV and BOOKE_HV, the argument must point to a
struct of type "kvmppc_booke_tlb_search".  The search operates as a tlbsx
instruction.

[Note: there currently exists a BookE implementation of KVM_TRANSLATE,
but the way it interprets the address is broken on 64-bit, and seems to
be confusing PPC's notion of a virtual address with what was most likely
intended by the x86ish API.  If nothing yet uses it to be broken by a
change, we may want to reimplement that using the address input as just
an effective address, as it would be interpreted by the current state
of the vcpu.  This would be a way to provide a non-hw-specific
mechanism for simple virtual->physical translation for debuggers and
such, as long as the caller doesn't care about the x86ish attributes.].

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-07 20:15             ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-07 20:15 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Mon, 7 Feb 2011 16:43:02 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 04.02.2011, at 23:33, Scott Wood wrote:
> 
> > On Thu, 3 Feb 2011 10:19:06 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
> > 
> > Maybe, though keeping it in KVM means we can change it whenever we want
> > without having to sync up Qemu and worry about backward compatibility.
> 
> Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.

If you're talking about actual interface changes, yes.  But a change in
how KVM implements things behind the scenes shouldn't break an old
Qemu, unless it's buggy and makes assumptions not permitted by the API.

> > How's that different from backing the void pointer up with a different
> > struct depending on the MMU type?  We weren't proposing unions.
> > 
> > A fixed array does mean you wouldn't have to worry about whether qemu
> > supports the more advanced struct format if fields are added --
> > you can just unconditionally write it, as long as it's backwards
> > compatible.  Unless you hit the limit of the pre-determined array size,
> > that is.  And if that gets made higher for future expansion, that's
> > even more data that has to get transferred, before it's really needed.
> 
> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?
> 
> struct request_ppc_tlb {
>     int tlb_type;
>     int tlb_entries;
>     uint64_t __user *tlb_data
> };

That's pretty much what the proposed API does -- except it uses a void
pointer instead of uint64_t *.

> Would you really want to loop through 16k entries, doing an ioctl for each? 

Not really.  The API was modeled after something we did on Topaz where
it's just a function call.  But something array-based would have been
awkward without constraining the geometry.

Now that we're going to constrain the geometry, providing an array-based
get/set would be easy and should definitely be a part of the API.

> Then performance really would always be an issue.

For cases where you really need to do a full get/set, yes.

> I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.

If we add everything at once, it minimizes the possibilities that Qemu
has to deal with -- either the full MMU API is there, or it's not.

BTW, I wonder if we should leave PPC out of the name.  It seems like
any arch with a software-visible TLB could use this, since the hw
details are hidden behind the MMU type.

How about:

struct kvmppc_booke_tlb_entry {
	union {
		__u64 mas0_1;
		struct {
			__u32 mas0;
			__u32 mas1;
		};
	};
	__u64 mas2;
	union {
		__u64 mas7_3	
		struct {
			__u32 mas7;
			__u32 mas3;
		};
	};
	__u32 mas8;
	__u32 pad;
};

struct kvmppc_booke_tlb_params {
	/*
	 * book3e defines 4 TLBs.  Individual implementations may have
	 * fewer.  TLBs that do not exist on the target must be configured
	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
	 * configured here, though arrays greater than 2048 entries will
	 * have TLBnCFG[NENTRY] set to zero.
	 */
	__u32 tlb_sizes[4];
};

struct kvmppc_booke_tlb_search {
	struct kvmppc_booke_tlb_entry entry;
	union {
		__u64 mas5_6;
		struct {
			__u64 mas5;
			__u64 mas6;
		};
	};
};

For a mmu type of PPC_BOOKE_NOHV, the mas5 field in
kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry
are present but not supported.

For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV:
 - TLB entries in get/set arrays may be provided in any order, and all 
   TLBs are get/set at once.
 - An entry with MAS1[V] = 0 terminates the list early (but there will
   be no terminating entry if the full array is valid).  On a call to
   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
   On a call to KVM_SET_TLB, excess elements beyond the terminating
   entry may not be accessed by KVM.

[Note: Once we implement sregs, Qemu can determine which TLBs are
implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
unsupported by KVM if its existence is implied by the target CPU]

KVM_SET_TLB
-----------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_set_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_set_tlb {
	__u64 params;
	__u64 array;
	__u32 mmu_type;
};

[Note: I used __u64 rather than void * to avoid the need for special
compat handling with 32-bit userspace on a 64-bit kernel -- if the other
way is preferred, that's fine with me]

Configures and sets the virtual CPU's TLB array.  The "params" and
"array" fields are userspace addresses of mmu-type-specific data
structures.

For mmu types PPC_BOOKE_NOHV and PPC_BOOKE_HV, the "params" field is of
type "struct kvmppc_booke_tlb_params", and the "array" field points to
an array of type "struct kvmppc_booke_tlb_entry".

[Note: KVM_SET_TLB with early array termination makes a separate
invalidate call unnecessary.]

KVM_GET_TLB
-----------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: void pointer (out)
Returns: 0 on success
         -1 on error

Reads the TLB array from a virtual CPU.  A successful call to
KVM_SET_TLB must have been previously made on this vcpu.  The argument
must point to space for an array of the size and type of TLB entry
structs configured by the most recent successful call to KVM_SET_TLB.

For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
kvmppc_booke_tlb_entry", and must hold a number of entries equal to
the sum of the elements of tlb_sizes in the most recent successful
TLB configuration call.


KVM_SEARCH_TLB
--------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: void pointer (in/out)
Returns: 0 on success
         -1 on error

Searches the TLB array of a virtual CPU for an entry matching
mmu-type-specific parameters.  A successful call to KVM_SET_TLB must
have been previously made on this vcpu.

For mmu types BOOKE_NOHV and BOOKE_HV, the argument must point to a
struct of type "kvmppc_booke_tlb_search".  The search operates as a tlbsx
instruction.

[Note: there currently exists a BookE implementation of KVM_TRANSLATE,
but the way it interprets the address is broken on 64-bit, and seems to
be confusing PPC's notion of a virtual address with what was most likely
intended by the x86ish API.  If nothing yet uses it to be broken by a
change, we may want to reimplement that using the address input as just
an effective address, as it would be interpreted by the current state
of the vcpu.  This would be a way to provide a non-hw-specific
mechanism for simple virtual->physical translation for debuggers and
such, as long as the caller doesn't care about the x86ish attributes.].


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-07 17:30           ` [Qemu-devel] " Yoder Stuart-B08248
  (?)
@ 2011-02-08  9:10             ` Avi Kivity
  -1 siblings, 0 replies; 112+ messages in thread
From: Avi Kivity @ 2011-02-08  9:10 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Alexander Graf, Wood Scott-B07421, kvm-ppc, kvm, qemu-devel

On 02/07/2011 07:30 PM, Yoder Stuart-B08248 wrote:
>
> >  -----Original Message-----
> >  From: kvm-ppc-owner@vger.kernel.org [mailto:kvm-ppc-owner@vger.kernel.org]
> >  On Behalf Of Avi Kivity
> >  Sent: Monday, February 07, 2011 11:14 AM
> >  To: Alexander Graf
> >  Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org;
> >  kvm@vger.kernel.org; qemu-devel@nongnu.org
> >  Subject: Re: RFC: New API for PPC for vcpu mmu access
> >
> >  On 02/03/2011 11:19 AM, Alexander Graf wrote:
> >  >  >
> >  >  >   I have no idea what things will look like 10 years down the road,
> >  >  >  but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> >  >
> >  >  That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
> >  want to transfer every time qemu feels like resolving an EA.
> >
> >  You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
> >  similar), or have the TLB stored in user memory, so there is no need to
> >  transfer it (on the other hand, you have to re-validate it every time you
> >  peek at it).
>
> The most convenient and flexible thing for  Power Book III-E I think
> will be something that operates like a TLB search instruction.  Inputs
> are 'address space' and 'process id' and outputs are in which TLB the
> entry was found and all the components of a TLB entry:
>     address space
>     pid
>     entry number
>     ea
>     rpn
>     guest state
>     permissions flags
>     attributes (WIMGE)
>
> Since all of those fields are architected in MAS registers, in the previous
> proposal we just proposed to return several 32-bit fields (one per MAS)
> that use the architected layout instead of inventing a brand new
> structure defining these fields.
>

This looks reasonable assuming you can take the hit of a system call per 
translation.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-08  9:10             ` Avi Kivity
  0 siblings, 0 replies; 112+ messages in thread
From: Avi Kivity @ 2011-02-08  9:10 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: qemu-devel, Wood Scott-B07421, Alexander Graf, kvm-ppc, kvm

On 02/07/2011 07:30 PM, Yoder Stuart-B08248 wrote:
>
> >  -----Original Message-----
> >  From: kvm-ppc-owner@vger.kernel.org [mailto:kvm-ppc-owner@vger.kernel.org]
> >  On Behalf Of Avi Kivity
> >  Sent: Monday, February 07, 2011 11:14 AM
> >  To: Alexander Graf
> >  Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org;
> >  kvm@vger.kernel.org; qemu-devel@nongnu.org
> >  Subject: Re: RFC: New API for PPC for vcpu mmu access
> >
> >  On 02/03/2011 11:19 AM, Alexander Graf wrote:
> >  >  >
> >  >  >   I have no idea what things will look like 10 years down the road,
> >  >  >  but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> >  >
> >  >  That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
> >  want to transfer every time qemu feels like resolving an EA.
> >
> >  You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
> >  similar), or have the TLB stored in user memory, so there is no need to
> >  transfer it (on the other hand, you have to re-validate it every time you
> >  peek at it).
>
> The most convenient and flexible thing for  Power Book III-E I think
> will be something that operates like a TLB search instruction.  Inputs
> are 'address space' and 'process id' and outputs are in which TLB the
> entry was found and all the components of a TLB entry:
>     address space
>     pid
>     entry number
>     ea
>     rpn
>     guest state
>     permissions flags
>     attributes (WIMGE)
>
> Since all of those fields are architected in MAS registers, in the previous
> proposal we just proposed to return several 32-bit fields (one per MAS)
> that use the architected layout instead of inventing a brand new
> structure defining these fields.
>

This looks reasonable assuming you can take the hit of a system call per 
translation.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-08  9:10             ` Avi Kivity
  0 siblings, 0 replies; 112+ messages in thread
From: Avi Kivity @ 2011-02-08  9:10 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Alexander Graf, Wood Scott-B07421, kvm-ppc, kvm, qemu-devel

On 02/07/2011 07:30 PM, Yoder Stuart-B08248 wrote:
>
> >  -----Original Message-----
> >  From: kvm-ppc-owner@vger.kernel.org [mailto:kvm-ppc-owner@vger.kernel.org]
> >  On Behalf Of Avi Kivity
> >  Sent: Monday, February 07, 2011 11:14 AM
> >  To: Alexander Graf
> >  Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org;
> >  kvm@vger.kernel.org; qemu-devel@nongnu.org
> >  Subject: Re: RFC: New API for PPC for vcpu mmu access
> >
> >  On 02/03/2011 11:19 AM, Alexander Graf wrote:
> >  >  >
> >  >  >   I have no idea what things will look like 10 years down the road,
> >  >  >  but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> >  >
> >  >  That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
> >  want to transfer every time qemu feels like resolving an EA.
> >
> >  You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
> >  similar), or have the TLB stored in user memory, so there is no need to
> >  transfer it (on the other hand, you have to re-validate it every time you
> >  peek at it).
>
> The most convenient and flexible thing for  Power Book III-E I think
> will be something that operates like a TLB search instruction.  Inputs
> are 'address space' and 'process id' and outputs are in which TLB the
> entry was found and all the components of a TLB entry:
>     address space
>     pid
>     entry number
>     ea
>     rpn
>     guest state
>     permissions flags
>     attributes (WIMGE)
>
> Since all of those fields are architected in MAS registers, in the previous
> proposal we just proposed to return several 32-bit fields (one per MAS)
> that use the architected layout instead of inventing a brand new
> structure defining these fields.
>

This looks reasonable assuming you can take the hit of a system call per 
translation.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-07 19:56                   ` [Qemu-devel] " Yoder Stuart-B08248
  (?)
@ 2011-02-09 17:03                     ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-09 17:03 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: Wood Scott-B07421, kvm-ppc, kvm, qemu-devel


On 07.02.2011, at 20:56, Yoder Stuart-B08248 wrote:

> 
> 
>> -----Original Message-----
>> From: Wood Scott-B07421
>> Sent: Monday, February 07, 2011 12:52 PM
>> To: Alexander Graf
>> Cc: Yoder Stuart-B08248; Wood Scott-B07421; kvm-ppc@vger.kernel.org;
>> kvm@vger.kernel.org; qemu-devel@nongnu.org
>> Subject: Re: RFC: New API for PPC for vcpu mmu access
>> 
>> On Mon, 7 Feb 2011 17:49:51 +0100
>> Alexander Graf <agraf@suse.de> wrote:
>> 
>>> 
>>> On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
>>> 
>>>> Suggested change to this would be to have Qemu set tlb_type as
>>>> an _input_ argument.   If KVM supports it, that type gets used,
>>>> else an error is returned.    This would allow Qemu to tell
>>>> the kernel what type of MMU it is prepared to support.   Without
>>>> this Qemu would just have to error out if the type returned is
>>>> unknown.
>>> 
>>> Yes, we could use the same struct for get and set. On set, it could
>> transfer the mmu type, on get it could tell userspace the mmu type.
>> 
>> What happens if a get is done before the first set, and there are multiple
>> MMU type options for this hardware, with differing entry sizes?
>> 
>> Qemu would have to know beforehand how large to make the buffer.
>> 
>> We could say that going forward, it's expected that qemu will do a TLB set
>> (either a full one, or a lightweight alternative) after creating a vcpu.
>> For compatibility, if this doesn't happen before the vcpu is run, the TLB
>> is created and initialized as it is today, but no new Qemu-visible features
>> will be enabled that way.
> 
> Since I think the normal thing Qemu would want to do is determine
> the type/size before allocating space for the TLB, we could just
> pass in NULL for tlb_data on the first set.   If tlb_data is
> NULL we just set the MMU type and return the size (and type).

It could also pass in some sort of max size - as long as nobody touches that memory it's almost for free.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-09 17:03                     ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-09 17:03 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: Wood Scott-B07421, kvm, kvm-ppc, qemu-devel


On 07.02.2011, at 20:56, Yoder Stuart-B08248 wrote:

> 
> 
>> -----Original Message-----
>> From: Wood Scott-B07421
>> Sent: Monday, February 07, 2011 12:52 PM
>> To: Alexander Graf
>> Cc: Yoder Stuart-B08248; Wood Scott-B07421; kvm-ppc@vger.kernel.org;
>> kvm@vger.kernel.org; qemu-devel@nongnu.org
>> Subject: Re: RFC: New API for PPC for vcpu mmu access
>> 
>> On Mon, 7 Feb 2011 17:49:51 +0100
>> Alexander Graf <agraf@suse.de> wrote:
>> 
>>> 
>>> On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
>>> 
>>>> Suggested change to this would be to have Qemu set tlb_type as
>>>> an _input_ argument.   If KVM supports it, that type gets used,
>>>> else an error is returned.    This would allow Qemu to tell
>>>> the kernel what type of MMU it is prepared to support.   Without
>>>> this Qemu would just have to error out if the type returned is
>>>> unknown.
>>> 
>>> Yes, we could use the same struct for get and set. On set, it could
>> transfer the mmu type, on get it could tell userspace the mmu type.
>> 
>> What happens if a get is done before the first set, and there are multiple
>> MMU type options for this hardware, with differing entry sizes?
>> 
>> Qemu would have to know beforehand how large to make the buffer.
>> 
>> We could say that going forward, it's expected that qemu will do a TLB set
>> (either a full one, or a lightweight alternative) after creating a vcpu.
>> For compatibility, if this doesn't happen before the vcpu is run, the TLB
>> is created and initialized as it is today, but no new Qemu-visible features
>> will be enabled that way.
> 
> Since I think the normal thing Qemu would want to do is determine
> the type/size before allocating space for the TLB, we could just
> pass in NULL for tlb_data on the first set.   If tlb_data is
> NULL we just set the MMU type and return the size (and type).

It could also pass in some sort of max size - as long as nobody touches that memory it's almost for free.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-09 17:03                     ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-09 17:03 UTC (permalink / raw)
  To: Yoder Stuart-B08248; +Cc: Wood Scott-B07421, kvm-ppc, kvm, qemu-devel


On 07.02.2011, at 20:56, Yoder Stuart-B08248 wrote:

> 
> 
>> -----Original Message-----
>> From: Wood Scott-B07421
>> Sent: Monday, February 07, 2011 12:52 PM
>> To: Alexander Graf
>> Cc: Yoder Stuart-B08248; Wood Scott-B07421; kvm-ppc@vger.kernel.org;
>> kvm@vger.kernel.org; qemu-devel@nongnu.org
>> Subject: Re: RFC: New API for PPC for vcpu mmu access
>> 
>> On Mon, 7 Feb 2011 17:49:51 +0100
>> Alexander Graf <agraf@suse.de> wrote:
>> 
>>> 
>>> On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
>>> 
>>>> Suggested change to this would be to have Qemu set tlb_type as
>>>> an _input_ argument.   If KVM supports it, that type gets used,
>>>> else an error is returned.    This would allow Qemu to tell
>>>> the kernel what type of MMU it is prepared to support.   Without
>>>> this Qemu would just have to error out if the type returned is
>>>> unknown.
>>> 
>>> Yes, we could use the same struct for get and set. On set, it could
>> transfer the mmu type, on get it could tell userspace the mmu type.
>> 
>> What happens if a get is done before the first set, and there are multiple
>> MMU type options for this hardware, with differing entry sizes?
>> 
>> Qemu would have to know beforehand how large to make the buffer.
>> 
>> We could say that going forward, it's expected that qemu will do a TLB set
>> (either a full one, or a lightweight alternative) after creating a vcpu.
>> For compatibility, if this doesn't happen before the vcpu is run, the TLB
>> is created and initialized as it is today, but no new Qemu-visible features
>> will be enabled that way.
> 
> Since I think the normal thing Qemu would want to do is determine
> the type/size before allocating space for the TLB, we could just
> pass in NULL for tlb_data on the first set.   If tlb_data is
> NULL we just set the MMU type and return the size (and type).

It could also pass in some sort of max size - as long as nobody touches that memory it's almost for free.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-07 20:15             ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-09 17:21               ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-09 17:21 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 07.02.2011, at 21:15, Scott Wood wrote:

> On Mon, 7 Feb 2011 16:43:02 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 04.02.2011, at 23:33, Scott Wood wrote:
>> 
>>> On Thu, 3 Feb 2011 10:19:06 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
>>> 
>>> Maybe, though keeping it in KVM means we can change it whenever we want
>>> without having to sync up Qemu and worry about backward compatibility.
>> 
>> Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.
> 
> If you're talking about actual interface changes, yes.  But a change in
> how KVM implements things behind the scenes shouldn't break an old
> Qemu, unless it's buggy and makes assumptions not permitted by the API.

Right :).

> 
>>> How's that different from backing the void pointer up with a different
>>> struct depending on the MMU type?  We weren't proposing unions.
>>> 
>>> A fixed array does mean you wouldn't have to worry about whether qemu
>>> supports the more advanced struct format if fields are added --
>>> you can just unconditionally write it, as long as it's backwards
>>> compatible.  Unless you hit the limit of the pre-determined array size,
>>> that is.  And if that gets made higher for future expansion, that's
>>> even more data that has to get transferred, before it's really needed.
>> 
>> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?
>> 
>> struct request_ppc_tlb {
>>    int tlb_type;
>>    int tlb_entries;
>>    uint64_t __user *tlb_data
>> };
> 
> That's pretty much what the proposed API does -- except it uses a void
> pointer instead of uint64_t *.

Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.

> 
>> Would you really want to loop through 16k entries, doing an ioctl for each? 
> 
> Not really.  The API was modeled after something we did on Topaz where
> it's just a function call.  But something array-based would have been
> awkward without constraining the geometry.
> 
> Now that we're going to constrain the geometry, providing an array-based
> get/set would be easy and should definitely be a part of the API.
> 
>> Then performance really would always be an issue.
> 
> For cases where you really need to do a full get/set, yes.
> 
>> I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.
> 
> If we add everything at once, it minimizes the possibilities that Qemu
> has to deal with -- either the full MMU API is there, or it's not.

Don't try to take load off qemu before it started to handle it and shows that it performs badly. One of the big strengths of the architecture with kvm+qemu that we have today is the emulation part. Sure, it duplicates some of the code, but it does help in a lot of other parts too - sometimes even because of the duplication.

> BTW, I wonder if we should leave PPC out of the name.  It seems like
> any arch with a software-visible TLB could use this, since the hw
> details are hidden behind the MMU type.

Very true.

> 
> How about:
> 
> struct kvmppc_booke_tlb_entry {
> 	union {
> 		__u64 mas0_1;
> 		struct {
> 			__u32 mas0;
> 			__u32 mas1;
> 		};
> 	};
> 	__u64 mas2;
> 	union {
> 		__u64 mas7_3	
> 		struct {
> 			__u32 mas7;
> 			__u32 mas3;
> 		};
> 	};
> 	__u32 mas8;
> 	__u32 pad;

Would it make sense to add some reserved fields or would we just bump up the mmu id?

> };
> 
> struct kvmppc_booke_tlb_params {
> 	/*
> 	 * book3e defines 4 TLBs.  Individual implementations may have
> 	 * fewer.  TLBs that do not exist on the target must be configured
> 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
> 	 * configured here, though arrays greater than 2048 entries will
> 	 * have TLBnCFG[NENTRY] set to zero.
> 	 */
> 	__u32 tlb_sizes[4];

Add some reserved fields?

> };
> 
> struct kvmppc_booke_tlb_search {

Search? I thought we agreed on having a search later, after the full get/set is settled?

> 	struct kvmppc_booke_tlb_entry entry;
> 	union {
> 		__u64 mas5_6;
> 		struct {
> 			__u64 mas5;
> 			__u64 mas6;
> 		};
> 	};
> };
> 
> For a mmu type of PPC_BOOKE_NOHV, the mas5 field in
> kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry
> are present but not supported.
> 
> For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV:
> - TLB entries in get/set arrays may be provided in any order, and all 
>   TLBs are get/set at once.

Makes sense

> - An entry with MAS1[V] = 0 terminates the list early (but there will
>   be no terminating entry if the full array is valid).  On a call to
>   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
>   On a call to KVM_SET_TLB, excess elements beyond the terminating
>   entry may not be accessed by KVM.

Very implementation specific, but ok with me. It's constrained to the BOOKE implementation of that GET/SET anyway. Is this how the hardware works too?

> 
> [Note: Once we implement sregs, Qemu can determine which TLBs are
> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
> unsupported by KVM if its existence is implied by the target CPU]
> 
> KVM_SET_TLB
> -----------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_set_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_set_tlb {
> 	__u64 params;
> 	__u64 array;
> 	__u32 mmu_type;
> };
> 
> [Note: I used __u64 rather than void * to avoid the need for special
> compat handling with 32-bit userspace on a 64-bit kernel -- if the other
> way is preferred, that's fine with me]

Oh, now I understand what you were proposing :). Sorry. No, this way is sane.

> Configures and sets the virtual CPU's TLB array.  The "params" and
> "array" fields are userspace addresses of mmu-type-specific data
> structures.
> 
> For mmu types PPC_BOOKE_NOHV and PPC_BOOKE_HV, the "params" field is of
> type "struct kvmppc_booke_tlb_params", and the "array" field points to
> an array of type "struct kvmppc_booke_tlb_entry".
> 
> [Note: KVM_SET_TLB with early array termination makes a separate
> invalidate call unnecessary.]
> 
> KVM_GET_TLB
> -----------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: void pointer (out)
> Returns: 0 on success
>         -1 on error
> 
> Reads the TLB array from a virtual CPU.  A successful call to
> KVM_SET_TLB must have been previously made on this vcpu.  The argument
> must point to space for an array of the size and type of TLB entry
> structs configured by the most recent successful call to KVM_SET_TLB.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
> kvmppc_booke_tlb_entry", and must hold a number of entries equal to
> the sum of the elements of tlb_sizes in the most recent successful
> TLB configuration call.

We should add some sort of safety net here. The caller really should pass in how big that pointer is.

> 
> 
> KVM_SEARCH_TLB
> --------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: void pointer (in/out)
> Returns: 0 on success
>         -1 on error
> 
> Searches the TLB array of a virtual CPU for an entry matching
> mmu-type-specific parameters.  A successful call to KVM_SET_TLB must
> have been previously made on this vcpu.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV, the argument must point to a
> struct of type "kvmppc_booke_tlb_search".  The search operates as a tlbsx
> instruction.
> 
> [Note: there currently exists a BookE implementation of KVM_TRANSLATE,
> but the way it interprets the address is broken on 64-bit, and seems to
> be confusing PPC's notion of a virtual address with what was most likely
> intended by the x86ish API.  If nothing yet uses it to be broken by a
> change, we may want to reimplement that using the address input as just
> an effective address, as it would be interpreted by the current state
> of the vcpu.  This would be a way to provide a non-hw-specific
> mechanism for simple virtual->physical translation for debuggers and
> such, as long as the caller doesn't care about the x86ish attributes.].

I'm not aware of any callers of KVM_TRANSLATE on PPC, so we're free to change the semantics. For gdb a simple ea->pa translation with valid bit should be enough. The original semantic of KVM_TRANSLATE is on the current context anyways.


Sorry for the late reply :)

Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-09 17:21               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-09 17:21 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel


On 07.02.2011, at 21:15, Scott Wood wrote:

> On Mon, 7 Feb 2011 16:43:02 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 04.02.2011, at 23:33, Scott Wood wrote:
>> 
>>> On Thu, 3 Feb 2011 10:19:06 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
>>> 
>>> Maybe, though keeping it in KVM means we can change it whenever we want
>>> without having to sync up Qemu and worry about backward compatibility.
>> 
>> Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.
> 
> If you're talking about actual interface changes, yes.  But a change in
> how KVM implements things behind the scenes shouldn't break an old
> Qemu, unless it's buggy and makes assumptions not permitted by the API.

Right :).

> 
>>> How's that different from backing the void pointer up with a different
>>> struct depending on the MMU type?  We weren't proposing unions.
>>> 
>>> A fixed array does mean you wouldn't have to worry about whether qemu
>>> supports the more advanced struct format if fields are added --
>>> you can just unconditionally write it, as long as it's backwards
>>> compatible.  Unless you hit the limit of the pre-determined array size,
>>> that is.  And if that gets made higher for future expansion, that's
>>> even more data that has to get transferred, before it's really needed.
>> 
>> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?
>> 
>> struct request_ppc_tlb {
>>    int tlb_type;
>>    int tlb_entries;
>>    uint64_t __user *tlb_data
>> };
> 
> That's pretty much what the proposed API does -- except it uses a void
> pointer instead of uint64_t *.

Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.

> 
>> Would you really want to loop through 16k entries, doing an ioctl for each? 
> 
> Not really.  The API was modeled after something we did on Topaz where
> it's just a function call.  But something array-based would have been
> awkward without constraining the geometry.
> 
> Now that we're going to constrain the geometry, providing an array-based
> get/set would be easy and should definitely be a part of the API.
> 
>> Then performance really would always be an issue.
> 
> For cases where you really need to do a full get/set, yes.
> 
>> I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.
> 
> If we add everything at once, it minimizes the possibilities that Qemu
> has to deal with -- either the full MMU API is there, or it's not.

Don't try to take load off qemu before it started to handle it and shows that it performs badly. One of the big strengths of the architecture with kvm+qemu that we have today is the emulation part. Sure, it duplicates some of the code, but it does help in a lot of other parts too - sometimes even because of the duplication.

> BTW, I wonder if we should leave PPC out of the name.  It seems like
> any arch with a software-visible TLB could use this, since the hw
> details are hidden behind the MMU type.

Very true.

> 
> How about:
> 
> struct kvmppc_booke_tlb_entry {
> 	union {
> 		__u64 mas0_1;
> 		struct {
> 			__u32 mas0;
> 			__u32 mas1;
> 		};
> 	};
> 	__u64 mas2;
> 	union {
> 		__u64 mas7_3	
> 		struct {
> 			__u32 mas7;
> 			__u32 mas3;
> 		};
> 	};
> 	__u32 mas8;
> 	__u32 pad;

Would it make sense to add some reserved fields or would we just bump up the mmu id?

> };
> 
> struct kvmppc_booke_tlb_params {
> 	/*
> 	 * book3e defines 4 TLBs.  Individual implementations may have
> 	 * fewer.  TLBs that do not exist on the target must be configured
> 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
> 	 * configured here, though arrays greater than 2048 entries will
> 	 * have TLBnCFG[NENTRY] set to zero.
> 	 */
> 	__u32 tlb_sizes[4];

Add some reserved fields?

> };
> 
> struct kvmppc_booke_tlb_search {

Search? I thought we agreed on having a search later, after the full get/set is settled?

> 	struct kvmppc_booke_tlb_entry entry;
> 	union {
> 		__u64 mas5_6;
> 		struct {
> 			__u64 mas5;
> 			__u64 mas6;
> 		};
> 	};
> };
> 
> For a mmu type of PPC_BOOKE_NOHV, the mas5 field in
> kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry
> are present but not supported.
> 
> For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV:
> - TLB entries in get/set arrays may be provided in any order, and all 
>   TLBs are get/set at once.

Makes sense

> - An entry with MAS1[V] = 0 terminates the list early (but there will
>   be no terminating entry if the full array is valid).  On a call to
>   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
>   On a call to KVM_SET_TLB, excess elements beyond the terminating
>   entry may not be accessed by KVM.

Very implementation specific, but ok with me. It's constrained to the BOOKE implementation of that GET/SET anyway. Is this how the hardware works too?

> 
> [Note: Once we implement sregs, Qemu can determine which TLBs are
> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
> unsupported by KVM if its existence is implied by the target CPU]
> 
> KVM_SET_TLB
> -----------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_set_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_set_tlb {
> 	__u64 params;
> 	__u64 array;
> 	__u32 mmu_type;
> };
> 
> [Note: I used __u64 rather than void * to avoid the need for special
> compat handling with 32-bit userspace on a 64-bit kernel -- if the other
> way is preferred, that's fine with me]

Oh, now I understand what you were proposing :). Sorry. No, this way is sane.

> Configures and sets the virtual CPU's TLB array.  The "params" and
> "array" fields are userspace addresses of mmu-type-specific data
> structures.
> 
> For mmu types PPC_BOOKE_NOHV and PPC_BOOKE_HV, the "params" field is of
> type "struct kvmppc_booke_tlb_params", and the "array" field points to
> an array of type "struct kvmppc_booke_tlb_entry".
> 
> [Note: KVM_SET_TLB with early array termination makes a separate
> invalidate call unnecessary.]
> 
> KVM_GET_TLB
> -----------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: void pointer (out)
> Returns: 0 on success
>         -1 on error
> 
> Reads the TLB array from a virtual CPU.  A successful call to
> KVM_SET_TLB must have been previously made on this vcpu.  The argument
> must point to space for an array of the size and type of TLB entry
> structs configured by the most recent successful call to KVM_SET_TLB.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
> kvmppc_booke_tlb_entry", and must hold a number of entries equal to
> the sum of the elements of tlb_sizes in the most recent successful
> TLB configuration call.

We should add some sort of safety net here. The caller really should pass in how big that pointer is.

> 
> 
> KVM_SEARCH_TLB
> --------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: void pointer (in/out)
> Returns: 0 on success
>         -1 on error
> 
> Searches the TLB array of a virtual CPU for an entry matching
> mmu-type-specific parameters.  A successful call to KVM_SET_TLB must
> have been previously made on this vcpu.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV, the argument must point to a
> struct of type "kvmppc_booke_tlb_search".  The search operates as a tlbsx
> instruction.
> 
> [Note: there currently exists a BookE implementation of KVM_TRANSLATE,
> but the way it interprets the address is broken on 64-bit, and seems to
> be confusing PPC's notion of a virtual address with what was most likely
> intended by the x86ish API.  If nothing yet uses it to be broken by a
> change, we may want to reimplement that using the address input as just
> an effective address, as it would be interpreted by the current state
> of the vcpu.  This would be a way to provide a non-hw-specific
> mechanism for simple virtual->physical translation for debuggers and
> such, as long as the caller doesn't care about the x86ish attributes.].

I'm not aware of any callers of KVM_TRANSLATE on PPC, so we're free to change the semantics. For gdb a simple ea->pa translation with valid bit should be enough. The original semantic of KVM_TRANSLATE is on the current context anyways.


Sorry for the late reply :)

Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-09 17:21               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-09 17:21 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 07.02.2011, at 21:15, Scott Wood wrote:

> On Mon, 7 Feb 2011 16:43:02 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 04.02.2011, at 23:33, Scott Wood wrote:
>> 
>>> On Thu, 3 Feb 2011 10:19:06 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
>>> 
>>> Maybe, though keeping it in KVM means we can change it whenever we want
>>> without having to sync up Qemu and worry about backward compatibility.
>> 
>> Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.
> 
> If you're talking about actual interface changes, yes.  But a change in
> how KVM implements things behind the scenes shouldn't break an old
> Qemu, unless it's buggy and makes assumptions not permitted by the API.

Right :).

> 
>>> How's that different from backing the void pointer up with a different
>>> struct depending on the MMU type?  We weren't proposing unions.
>>> 
>>> A fixed array does mean you wouldn't have to worry about whether qemu
>>> supports the more advanced struct format if fields are added --
>>> you can just unconditionally write it, as long as it's backwards
>>> compatible.  Unless you hit the limit of the pre-determined array size,
>>> that is.  And if that gets made higher for future expansion, that's
>>> even more data that has to get transferred, before it's really needed.
>> 
>> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?
>> 
>> struct request_ppc_tlb {
>>    int tlb_type;
>>    int tlb_entries;
>>    uint64_t __user *tlb_data
>> };
> 
> That's pretty much what the proposed API does -- except it uses a void
> pointer instead of uint64_t *.

Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.

> 
>> Would you really want to loop through 16k entries, doing an ioctl for each? 
> 
> Not really.  The API was modeled after something we did on Topaz where
> it's just a function call.  But something array-based would have been
> awkward without constraining the geometry.
> 
> Now that we're going to constrain the geometry, providing an array-based
> get/set would be easy and should definitely be a part of the API.
> 
>> Then performance really would always be an issue.
> 
> For cases where you really need to do a full get/set, yes.
> 
>> I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.
> 
> If we add everything at once, it minimizes the possibilities that Qemu
> has to deal with -- either the full MMU API is there, or it's not.

Don't try to take load off qemu before it started to handle it and shows that it performs badly. One of the big strengths of the architecture with kvm+qemu that we have today is the emulation part. Sure, it duplicates some of the code, but it does help in a lot of other parts too - sometimes even because of the duplication.

> BTW, I wonder if we should leave PPC out of the name.  It seems like
> any arch with a software-visible TLB could use this, since the hw
> details are hidden behind the MMU type.

Very true.

> 
> How about:
> 
> struct kvmppc_booke_tlb_entry {
> 	union {
> 		__u64 mas0_1;
> 		struct {
> 			__u32 mas0;
> 			__u32 mas1;
> 		};
> 	};
> 	__u64 mas2;
> 	union {
> 		__u64 mas7_3	
> 		struct {
> 			__u32 mas7;
> 			__u32 mas3;
> 		};
> 	};
> 	__u32 mas8;
> 	__u32 pad;

Would it make sense to add some reserved fields or would we just bump up the mmu id?

> };
> 
> struct kvmppc_booke_tlb_params {
> 	/*
> 	 * book3e defines 4 TLBs.  Individual implementations may have
> 	 * fewer.  TLBs that do not exist on the target must be configured
> 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
> 	 * configured here, though arrays greater than 2048 entries will
> 	 * have TLBnCFG[NENTRY] set to zero.
> 	 */
> 	__u32 tlb_sizes[4];

Add some reserved fields?

> };
> 
> struct kvmppc_booke_tlb_search {

Search? I thought we agreed on having a search later, after the full get/set is settled?

> 	struct kvmppc_booke_tlb_entry entry;
> 	union {
> 		__u64 mas5_6;
> 		struct {
> 			__u64 mas5;
> 			__u64 mas6;
> 		};
> 	};
> };
> 
> For a mmu type of PPC_BOOKE_NOHV, the mas5 field in
> kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry
> are present but not supported.
> 
> For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV:
> - TLB entries in get/set arrays may be provided in any order, and all 
>   TLBs are get/set at once.

Makes sense

> - An entry with MAS1[V] = 0 terminates the list early (but there will
>   be no terminating entry if the full array is valid).  On a call to
>   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
>   On a call to KVM_SET_TLB, excess elements beyond the terminating
>   entry may not be accessed by KVM.

Very implementation specific, but ok with me. It's constrained to the BOOKE implementation of that GET/SET anyway. Is this how the hardware works too?

> 
> [Note: Once we implement sregs, Qemu can determine which TLBs are
> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
> unsupported by KVM if its existence is implied by the target CPU]
> 
> KVM_SET_TLB
> -----------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_set_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_set_tlb {
> 	__u64 params;
> 	__u64 array;
> 	__u32 mmu_type;
> };
> 
> [Note: I used __u64 rather than void * to avoid the need for special
> compat handling with 32-bit userspace on a 64-bit kernel -- if the other
> way is preferred, that's fine with me]

Oh, now I understand what you were proposing :). Sorry. No, this way is sane.

> Configures and sets the virtual CPU's TLB array.  The "params" and
> "array" fields are userspace addresses of mmu-type-specific data
> structures.
> 
> For mmu types PPC_BOOKE_NOHV and PPC_BOOKE_HV, the "params" field is of
> type "struct kvmppc_booke_tlb_params", and the "array" field points to
> an array of type "struct kvmppc_booke_tlb_entry".
> 
> [Note: KVM_SET_TLB with early array termination makes a separate
> invalidate call unnecessary.]
> 
> KVM_GET_TLB
> -----------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: void pointer (out)
> Returns: 0 on success
>         -1 on error
> 
> Reads the TLB array from a virtual CPU.  A successful call to
> KVM_SET_TLB must have been previously made on this vcpu.  The argument
> must point to space for an array of the size and type of TLB entry
> structs configured by the most recent successful call to KVM_SET_TLB.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
> kvmppc_booke_tlb_entry", and must hold a number of entries equal to
> the sum of the elements of tlb_sizes in the most recent successful
> TLB configuration call.

We should add some sort of safety net here. The caller really should pass in how big that pointer is.

> 
> 
> KVM_SEARCH_TLB
> --------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: void pointer (in/out)
> Returns: 0 on success
>         -1 on error
> 
> Searches the TLB array of a virtual CPU for an entry matching
> mmu-type-specific parameters.  A successful call to KVM_SET_TLB must
> have been previously made on this vcpu.
> 
> For mmu types BOOKE_NOHV and BOOKE_HV, the argument must point to a
> struct of type "kvmppc_booke_tlb_search".  The search operates as a tlbsx
> instruction.
> 
> [Note: there currently exists a BookE implementation of KVM_TRANSLATE,
> but the way it interprets the address is broken on 64-bit, and seems to
> be confusing PPC's notion of a virtual address with what was most likely
> intended by the x86ish API.  If nothing yet uses it to be broken by a
> change, we may want to reimplement that using the address input as just
> an effective address, as it would be interpreted by the current state
> of the vcpu.  This would be a way to provide a non-hw-specific
> mechanism for simple virtual->physical translation for debuggers and
> such, as long as the caller doesn't care about the x86ish attributes.].

I'm not aware of any callers of KVM_TRANSLATE on PPC, so we're free to change the semantics. For gdb a simple ea->pa translation with valid bit should be enough. The original semantic of KVM_TRANSLATE is on the current context anyways.


Sorry for the late reply :)

Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-09 17:21               ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-09 23:09                 ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-09 23:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Wed, 9 Feb 2011 18:21:40 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 07.02.2011, at 21:15, Scott Wood wrote:
> 
> > That's pretty much what the proposed API does -- except it uses a void
> > pointer instead of uint64_t *.
> 
> Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.

Right, I just meant in terms of avoiding a fixed reference to a hw-specific
type.

> > How about:
> > 
> > struct kvmppc_booke_tlb_entry {
> > 	union {
> > 		__u64 mas0_1;
> > 		struct {
> > 			__u32 mas0;
> > 			__u32 mas1;
> > 		};
> > 	};
> > 	__u64 mas2;
> > 	union {
> > 		__u64 mas7_3	
> > 		struct {
> > 			__u32 mas7;
> > 			__u32 mas3;
> > 		};
> > 	};
> > 	__u32 mas8;
> > 	__u32 pad;
> 
> Would it make sense to add some reserved fields or would we just bump up the mmu id?

I was thinking we'd just bump the ID.  I only stuck "pad" in there for
alignment.  And we're making a large array of it, so padding could hurt.

> > struct kvmppc_booke_tlb_params {
> > 	/*
> > 	 * book3e defines 4 TLBs.  Individual implementations may have
> > 	 * fewer.  TLBs that do not exist on the target must be configured
> > 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
> > 	 * configured here, though arrays greater than 2048 entries will
> > 	 * have TLBnCFG[NENTRY] set to zero.
> > 	 */
> > 	__u32 tlb_sizes[4];
> 
> Add some reserved fields?

MMU type ID also controls this, but could add some padding to make
extensions simpler (esp. since we're not making an array of it).  How much
would you recommend?

> > struct kvmppc_booke_tlb_search {
> 
> Search? I thought we agreed on having a search later, after the full get/set is settled?

We agreed on having a full array-like get/set... my preference was to keep
it all under one capability, which implies adding it at the same time.
But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
skeptical that array-only will not be a performance issue under any usage
pattern, but we can implement it and try it out before finalizing any of
this.

> > 	struct kvmppc_booke_tlb_entry entry;
> > 	union {
> > 		__u64 mas5_6;
> > 		struct {
> > 			__u64 mas5;
> > 			__u64 mas6;
> > 		};
> > 	};
> > };

The fields inside the struct should be __u32, of course. :-P

> > - An entry with MAS1[V] = 0 terminates the list early (but there will
> >   be no terminating entry if the full array is valid).  On a call to
> >   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
> >   On a call to KVM_SET_TLB, excess elements beyond the terminating
> >   entry may not be accessed by KVM.
> 
> Very implementation specific, but ok with me. 

I assumed most MMU types would have some straightforward way of marking an
entry invalid (if not, it can add a software field in the struct), and that
it would be MMU-specific code that is processing the list.

> It's constrained to the BOOKE implementation of that GET/SET anyway. Is
> this how the hardware works too?

Hardware doesn't process lists of entries.  But MAS1[V] is the valid
bit in hardware.

> > [Note: Once we implement sregs, Qemu can determine which TLBs are
> > implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
> > unsupported by KVM if its existence is implied by the target CPU]
> > 
> > KVM_SET_TLB
> > -----------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: struct kvm_set_tlb (in)
> > Returns: 0 on success
> >         -1 on error
> > 
> > struct kvm_set_tlb {
> > 	__u64 params;
> > 	__u64 array;
> > 	__u32 mmu_type;
> > };
> > 
> > [Note: I used __u64 rather than void * to avoid the need for special
> > compat handling with 32-bit userspace on a 64-bit kernel -- if the other
> > way is preferred, that's fine with me]
> 
> Oh, now I understand what you were proposing :). Sorry. No, this way is sane.

What about the ioctls that take only a pointer?  The actual calling
mechanism should work without compat, but in order for _IOR and such to not
assign a different IOCTL number based on the size of void *, we'd need to
lie and use plain _IO().  It looks like some ioctls such as
KVM_SET_TSS_ADDR already do this.

If we drop KVM_SEARCH_TLB and struct-ize KVM_GET_TLB to fit in a buffer
size parameter, it's moot though.

> > KVM_GET_TLB
> > -----------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: void pointer (out)
> > Returns: 0 on success
> >         -1 on error
> > 
> > Reads the TLB array from a virtual CPU.  A successful call to
> > KVM_SET_TLB must have been previously made on this vcpu.  The argument
> > must point to space for an array of the size and type of TLB entry
> > structs configured by the most recent successful call to KVM_SET_TLB.
> > 
> > For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
> > kvmppc_booke_tlb_entry", and must hold a number of entries equal to
> > the sum of the elements of tlb_sizes in the most recent successful
> > TLB configuration call.
> 
> We should add some sort of safety net here. The caller really should pass in how big that pointer is.

The caller must have previously called KVM_SET_TLB (or KVM_GET_TLB will
return an error), so it implicitly told KVM how much data to send, based
on MMU type and params.

I'm OK with an explicit buffer size for added safety, though.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-09 23:09                 ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-09 23:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Wed, 9 Feb 2011 18:21:40 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 07.02.2011, at 21:15, Scott Wood wrote:
> 
> > That's pretty much what the proposed API does -- except it uses a void
> > pointer instead of uint64_t *.
> 
> Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.

Right, I just meant in terms of avoiding a fixed reference to a hw-specific
type.

> > How about:
> > 
> > struct kvmppc_booke_tlb_entry {
> > 	union {
> > 		__u64 mas0_1;
> > 		struct {
> > 			__u32 mas0;
> > 			__u32 mas1;
> > 		};
> > 	};
> > 	__u64 mas2;
> > 	union {
> > 		__u64 mas7_3	
> > 		struct {
> > 			__u32 mas7;
> > 			__u32 mas3;
> > 		};
> > 	};
> > 	__u32 mas8;
> > 	__u32 pad;
> 
> Would it make sense to add some reserved fields or would we just bump up the mmu id?

I was thinking we'd just bump the ID.  I only stuck "pad" in there for
alignment.  And we're making a large array of it, so padding could hurt.

> > struct kvmppc_booke_tlb_params {
> > 	/*
> > 	 * book3e defines 4 TLBs.  Individual implementations may have
> > 	 * fewer.  TLBs that do not exist on the target must be configured
> > 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
> > 	 * configured here, though arrays greater than 2048 entries will
> > 	 * have TLBnCFG[NENTRY] set to zero.
> > 	 */
> > 	__u32 tlb_sizes[4];
> 
> Add some reserved fields?

MMU type ID also controls this, but could add some padding to make
extensions simpler (esp. since we're not making an array of it).  How much
would you recommend?

> > struct kvmppc_booke_tlb_search {
> 
> Search? I thought we agreed on having a search later, after the full get/set is settled?

We agreed on having a full array-like get/set... my preference was to keep
it all under one capability, which implies adding it at the same time.
But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
skeptical that array-only will not be a performance issue under any usage
pattern, but we can implement it and try it out before finalizing any of
this.

> > 	struct kvmppc_booke_tlb_entry entry;
> > 	union {
> > 		__u64 mas5_6;
> > 		struct {
> > 			__u64 mas5;
> > 			__u64 mas6;
> > 		};
> > 	};
> > };

The fields inside the struct should be __u32, of course. :-P

> > - An entry with MAS1[V] = 0 terminates the list early (but there will
> >   be no terminating entry if the full array is valid).  On a call to
> >   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
> >   On a call to KVM_SET_TLB, excess elements beyond the terminating
> >   entry may not be accessed by KVM.
> 
> Very implementation specific, but ok with me. 

I assumed most MMU types would have some straightforward way of marking an
entry invalid (if not, it can add a software field in the struct), and that
it would be MMU-specific code that is processing the list.

> It's constrained to the BOOKE implementation of that GET/SET anyway. Is
> this how the hardware works too?

Hardware doesn't process lists of entries.  But MAS1[V] is the valid
bit in hardware.

> > [Note: Once we implement sregs, Qemu can determine which TLBs are
> > implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
> > unsupported by KVM if its existence is implied by the target CPU]
> > 
> > KVM_SET_TLB
> > -----------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: struct kvm_set_tlb (in)
> > Returns: 0 on success
> >         -1 on error
> > 
> > struct kvm_set_tlb {
> > 	__u64 params;
> > 	__u64 array;
> > 	__u32 mmu_type;
> > };
> > 
> > [Note: I used __u64 rather than void * to avoid the need for special
> > compat handling with 32-bit userspace on a 64-bit kernel -- if the other
> > way is preferred, that's fine with me]
> 
> Oh, now I understand what you were proposing :). Sorry. No, this way is sane.

What about the ioctls that take only a pointer?  The actual calling
mechanism should work without compat, but in order for _IOR and such to not
assign a different IOCTL number based on the size of void *, we'd need to
lie and use plain _IO().  It looks like some ioctls such as
KVM_SET_TSS_ADDR already do this.

If we drop KVM_SEARCH_TLB and struct-ize KVM_GET_TLB to fit in a buffer
size parameter, it's moot though.

> > KVM_GET_TLB
> > -----------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: void pointer (out)
> > Returns: 0 on success
> >         -1 on error
> > 
> > Reads the TLB array from a virtual CPU.  A successful call to
> > KVM_SET_TLB must have been previously made on this vcpu.  The argument
> > must point to space for an array of the size and type of TLB entry
> > structs configured by the most recent successful call to KVM_SET_TLB.
> > 
> > For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
> > kvmppc_booke_tlb_entry", and must hold a number of entries equal to
> > the sum of the elements of tlb_sizes in the most recent successful
> > TLB configuration call.
> 
> We should add some sort of safety net here. The caller really should pass in how big that pointer is.

The caller must have previously called KVM_SET_TLB (or KVM_GET_TLB will
return an error), so it implicitly told KVM how much data to send, based
on MMU type and params.

I'm OK with an explicit buffer size for added safety, though.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-09 23:09                 ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-09 23:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Wed, 9 Feb 2011 18:21:40 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 07.02.2011, at 21:15, Scott Wood wrote:
> 
> > That's pretty much what the proposed API does -- except it uses a void
> > pointer instead of uint64_t *.
> 
> Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.

Right, I just meant in terms of avoiding a fixed reference to a hw-specific
type.

> > How about:
> > 
> > struct kvmppc_booke_tlb_entry {
> > 	union {
> > 		__u64 mas0_1;
> > 		struct {
> > 			__u32 mas0;
> > 			__u32 mas1;
> > 		};
> > 	};
> > 	__u64 mas2;
> > 	union {
> > 		__u64 mas7_3	
> > 		struct {
> > 			__u32 mas7;
> > 			__u32 mas3;
> > 		};
> > 	};
> > 	__u32 mas8;
> > 	__u32 pad;
> 
> Would it make sense to add some reserved fields or would we just bump up the mmu id?

I was thinking we'd just bump the ID.  I only stuck "pad" in there for
alignment.  And we're making a large array of it, so padding could hurt.

> > struct kvmppc_booke_tlb_params {
> > 	/*
> > 	 * book3e defines 4 TLBs.  Individual implementations may have
> > 	 * fewer.  TLBs that do not exist on the target must be configured
> > 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
> > 	 * configured here, though arrays greater than 2048 entries will
> > 	 * have TLBnCFG[NENTRY] set to zero.
> > 	 */
> > 	__u32 tlb_sizes[4];
> 
> Add some reserved fields?

MMU type ID also controls this, but could add some padding to make
extensions simpler (esp. since we're not making an array of it).  How much
would you recommend?

> > struct kvmppc_booke_tlb_search {
> 
> Search? I thought we agreed on having a search later, after the full get/set is settled?

We agreed on having a full array-like get/set... my preference was to keep
it all under one capability, which implies adding it at the same time.
But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
skeptical that array-only will not be a performance issue under any usage
pattern, but we can implement it and try it out before finalizing any of
this.

> > 	struct kvmppc_booke_tlb_entry entry;
> > 	union {
> > 		__u64 mas5_6;
> > 		struct {
> > 			__u64 mas5;
> > 			__u64 mas6;
> > 		};
> > 	};
> > };

The fields inside the struct should be __u32, of course. :-P

> > - An entry with MAS1[V] = 0 terminates the list early (but there will
> >   be no terminating entry if the full array is valid).  On a call to
> >   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
> >   On a call to KVM_SET_TLB, excess elements beyond the terminating
> >   entry may not be accessed by KVM.
> 
> Very implementation specific, but ok with me. 

I assumed most MMU types would have some straightforward way of marking an
entry invalid (if not, it can add a software field in the struct), and that
it would be MMU-specific code that is processing the list.

> It's constrained to the BOOKE implementation of that GET/SET anyway. Is
> this how the hardware works too?

Hardware doesn't process lists of entries.  But MAS1[V] is the valid
bit in hardware.

> > [Note: Once we implement sregs, Qemu can determine which TLBs are
> > implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
> > unsupported by KVM if its existence is implied by the target CPU]
> > 
> > KVM_SET_TLB
> > -----------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: struct kvm_set_tlb (in)
> > Returns: 0 on success
> >         -1 on error
> > 
> > struct kvm_set_tlb {
> > 	__u64 params;
> > 	__u64 array;
> > 	__u32 mmu_type;
> > };
> > 
> > [Note: I used __u64 rather than void * to avoid the need for special
> > compat handling with 32-bit userspace on a 64-bit kernel -- if the other
> > way is preferred, that's fine with me]
> 
> Oh, now I understand what you were proposing :). Sorry. No, this way is sane.

What about the ioctls that take only a pointer?  The actual calling
mechanism should work without compat, but in order for _IOR and such to not
assign a different IOCTL number based on the size of void *, we'd need to
lie and use plain _IO().  It looks like some ioctls such as
KVM_SET_TSS_ADDR already do this.

If we drop KVM_SEARCH_TLB and struct-ize KVM_GET_TLB to fit in a buffer
size parameter, it's moot though.

> > KVM_GET_TLB
> > -----------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: void pointer (out)
> > Returns: 0 on success
> >         -1 on error
> > 
> > Reads the TLB array from a virtual CPU.  A successful call to
> > KVM_SET_TLB must have been previously made on this vcpu.  The argument
> > must point to space for an array of the size and type of TLB entry
> > structs configured by the most recent successful call to KVM_SET_TLB.
> > 
> > For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
> > kvmppc_booke_tlb_entry", and must hold a number of entries equal to
> > the sum of the elements of tlb_sizes in the most recent successful
> > TLB configuration call.
> 
> We should add some sort of safety net here. The caller really should pass in how big that pointer is.

The caller must have previously called KVM_SET_TLB (or KVM_GET_TLB will
return an error), so it implicitly told KVM how much data to send, based
on MMU type and params.

I'm OK with an explicit buffer size for added safety, though.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-03  9:19       ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-10  0:04         ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-10  0:04 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf <agraf@suse.de> wrote:

> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
> 
>  * kvm goes to qemu
>  * qemu fetches all mmu and register data from kvm
>  * qemu runs its mmu resolution function as if the target was emulated
> 
> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).

I guess you mean that cpu_synchronize_state() is supposed to pull in the
MMU state, though I don't see where it gets called for 'm'/'M' commands in
the gdb stub.

The MMU code seems to be pretty target-specific.  It's not clear to what
extent there is a "normal" way, versus what book3s happens to rely on in
its get_physical_address() code.  I don't think there are any platforms
supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
implementation) that have a pure software-managed TLB.  x86 has page
tables, and book3s has the hash table (603/e300 doesn't, or more accurately
Linux doesn't use it, but I guess that's not supported by KVM yet?).

We could probably do some sort of lazy state transfer only when MMU code
that needs it is run.  This could initially include debug translations, for
testing a non-KVM-dependent get_physical_address() implementation, but
eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
trigger the state transfer.  I'd also like to add an "info tlb" command,
which would require the state transfer.

BTW, how much other than the MMU is missing to be able to run an e500
target in qemu, without kvm?

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10  0:04         ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-10  0:04 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf <agraf@suse.de> wrote:

> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
> 
>  * kvm goes to qemu
>  * qemu fetches all mmu and register data from kvm
>  * qemu runs its mmu resolution function as if the target was emulated
> 
> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).

I guess you mean that cpu_synchronize_state() is supposed to pull in the
MMU state, though I don't see where it gets called for 'm'/'M' commands in
the gdb stub.

The MMU code seems to be pretty target-specific.  It's not clear to what
extent there is a "normal" way, versus what book3s happens to rely on in
its get_physical_address() code.  I don't think there are any platforms
supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
implementation) that have a pure software-managed TLB.  x86 has page
tables, and book3s has the hash table (603/e300 doesn't, or more accurately
Linux doesn't use it, but I guess that's not supported by KVM yet?).

We could probably do some sort of lazy state transfer only when MMU code
that needs it is run.  This could initially include debug translations, for
testing a non-KVM-dependent get_physical_address() implementation, but
eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
trigger the state transfer.  I'd also like to add an "info tlb" command,
which would require the state transfer.

BTW, how much other than the MMU is missing to be able to run an e500
target in qemu, without kvm?

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10  0:04         ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-10  0:04 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf <agraf@suse.de> wrote:

> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
> 
>  * kvm goes to qemu
>  * qemu fetches all mmu and register data from kvm
>  * qemu runs its mmu resolution function as if the target was emulated
> 
> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).

I guess you mean that cpu_synchronize_state() is supposed to pull in the
MMU state, though I don't see where it gets called for 'm'/'M' commands in
the gdb stub.

The MMU code seems to be pretty target-specific.  It's not clear to what
extent there is a "normal" way, versus what book3s happens to rely on in
its get_physical_address() code.  I don't think there are any platforms
supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
implementation) that have a pure software-managed TLB.  x86 has page
tables, and book3s has the hash table (603/e300 doesn't, or more accurately
Linux doesn't use it, but I guess that's not supported by KVM yet?).

We could probably do some sort of lazy state transfer only when MMU code
that needs it is run.  This could initially include debug translations, for
testing a non-KVM-dependent get_physical_address() implementation, but
eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
trigger the state transfer.  I'd also like to add an "info tlb" command,
which would require the state transfer.

BTW, how much other than the MMU is missing to be able to run an e500
target in qemu, without kvm?

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-09 23:09                 ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-10 11:45                   ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-10 11:45 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

Scott Wood wrote:
> On Wed, 9 Feb 2011 18:21:40 +0100
> Alexander Graf <agraf@suse.de> wrote:
>
>   
>> On 07.02.2011, at 21:15, Scott Wood wrote:
>>
>>     
>>> That's pretty much what the proposed API does -- except it uses a void
>>> pointer instead of uint64_t *.
>>>       
>> Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.
>>     
>
> Right, I just meant in terms of avoiding a fixed reference to a hw-specific
> type.
>
>   
>>> How about:
>>>
>>> struct kvmppc_booke_tlb_entry {
>>> 	union {
>>> 		__u64 mas0_1;
>>> 		struct {
>>> 			__u32 mas0;
>>> 			__u32 mas1;
>>> 		};
>>> 	};
>>> 	__u64 mas2;
>>> 	union {
>>> 		__u64 mas7_3	
>>> 		struct {
>>> 			__u32 mas7;
>>> 			__u32 mas3;
>>> 		};
>>> 	};
>>> 	__u32 mas8;
>>> 	__u32 pad;
>>>       
>> Would it make sense to add some reserved fields or would we just bump up the mmu id?
>>     
>
> I was thinking we'd just bump the ID.  I only stuck "pad" in there for
> alignment.  And we're making a large array of it, so padding could hurt.
>   

Ok, thinking about this a bit more. You're basically proposing a list of
tlb set calls, with each array field identifying one tlb set call. What
I was thinking of was a full TLB sync, so we could keep qemu's internal
TLB representation identical to the ioctl layout and then just call that
one ioctl to completely overwrite all of qemu's internal data (and vice
versa).

>>> struct kvmppc_booke_tlb_params {
>>> 	/*
>>> 	 * book3e defines 4 TLBs.  Individual implementations may have
>>> 	 * fewer.  TLBs that do not exist on the target must be configured
>>> 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
>>> 	 * configured here, though arrays greater than 2048 entries will
>>> 	 * have TLBnCFG[NENTRY] set to zero.
>>> 	 */
>>> 	__u32 tlb_sizes[4];
>>>       
>> Add some reserved fields?
>>     
>
> MMU type ID also controls this, but could add some padding to make
> extensions simpler (esp. since we're not making an array of it).  How much
> would you recommend?
>   

How about making it 64 bytes? That should leave us plenty of room.

>   
>>> struct kvmppc_booke_tlb_search {
>>>       
>> Search? I thought we agreed on having a search later, after the full get/set is settled?
>>     
>
> We agreed on having a full array-like get/set... my preference was to keep
> it all under one capability, which implies adding it at the same time.
> But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
> skeptical that array-only will not be a performance issue under any usage
> pattern, but we can implement it and try it out before finalizing any of
> this.
>   

Yup. We can even implement it, measure what exactly is slow and then
decide on how to implement it. I'd bet that only the emulation stub is
slow - and for that KVM_TRANSLATE seems like a good fit.

>   
>>> 	struct kvmppc_booke_tlb_entry entry;
>>> 	union {
>>> 		__u64 mas5_6;
>>> 		struct {
>>> 			__u64 mas5;
>>> 			__u64 mas6;
>>> 		};
>>> 	};
>>> };
>>>       
>
> The fields inside the struct should be __u32, of course. :-P
>   

Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
right? :)

>   
>>> - An entry with MAS1[V] = 0 terminates the list early (but there will
>>>   be no terminating entry if the full array is valid).  On a call to
>>>   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
>>>   On a call to KVM_SET_TLB, excess elements beyond the terminating
>>>   entry may not be accessed by KVM.
>>>       
>> Very implementation specific, but ok with me. 
>>     
>
> I assumed most MMU types would have some straightforward way of marking an
> entry invalid (if not, it can add a software field in the struct), and that
> it would be MMU-specific code that is processing the list.
>   

See above :).

>   
>> It's constrained to the BOOKE implementation of that GET/SET anyway. Is
>> this how the hardware works too?
>>     
>
> Hardware doesn't process lists of entries.  But MAS1[V] is the valid
> bit in hardware.
>
>   
>>> [Note: Once we implement sregs, Qemu can determine which TLBs are
>>> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
>>> unsupported by KVM if its existence is implied by the target CPU]
>>>
>>> KVM_SET_TLB
>>> -----------
>>>
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: struct kvm_set_tlb (in)
>>> Returns: 0 on success
>>>         -1 on error
>>>
>>> struct kvm_set_tlb {
>>> 	__u64 params;
>>> 	__u64 array;
>>> 	__u32 mmu_type;
>>> };
>>>
>>> [Note: I used __u64 rather than void * to avoid the need for special
>>> compat handling with 32-bit userspace on a 64-bit kernel -- if the other
>>> way is preferred, that's fine with me]
>>>       
>> Oh, now I understand what you were proposing :). Sorry. No, this way is sane.
>>     
>
> What about the ioctls that take only a pointer?  The actual calling
> mechanism should work without compat, but in order for _IOR and such to not
> assign a different IOCTL number based on the size of void *, we'd need to
> lie and use plain _IO().  It looks like some ioctls such as
> KVM_SET_TSS_ADDR already do this.
>
> If we drop KVM_SEARCH_TLB and struct-ize KVM_GET_TLB to fit in a buffer
> size parameter, it's moot though.
>   

Yup :). Just always pass in a struct - makes it easier for later extensions.

>   
>>> KVM_GET_TLB
>>> -----------
>>>
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: void pointer (out)
>>> Returns: 0 on success
>>>         -1 on error
>>>
>>> Reads the TLB array from a virtual CPU.  A successful call to
>>> KVM_SET_TLB must have been previously made on this vcpu.  The argument
>>> must point to space for an array of the size and type of TLB entry
>>> structs configured by the most recent successful call to KVM_SET_TLB.
>>>
>>> For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
>>> kvmppc_booke_tlb_entry", and must hold a number of entries equal to
>>> the sum of the elements of tlb_sizes in the most recent successful
>>> TLB configuration call.
>>>       
>> We should add some sort of safety net here. The caller really should pass in how big that pointer is.
>>     
>
> The caller must have previously called KVM_SET_TLB (or KVM_GET_TLB will
> return an error), so it implicitly told KVM how much data to send, based
> on MMU type and params.
>
> I'm OK with an explicit buffer size for added safety, though.
>   

Yup, this way we're even more safe :). It's the same as snprintf vs
sprintf on a buffer of known length with contents of mostly known length.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 11:45                   ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-10 11:45 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

Scott Wood wrote:
> On Wed, 9 Feb 2011 18:21:40 +0100
> Alexander Graf <agraf@suse.de> wrote:
>
>   
>> On 07.02.2011, at 21:15, Scott Wood wrote:
>>
>>     
>>> That's pretty much what the proposed API does -- except it uses a void
>>> pointer instead of uint64_t *.
>>>       
>> Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.
>>     
>
> Right, I just meant in terms of avoiding a fixed reference to a hw-specific
> type.
>
>   
>>> How about:
>>>
>>> struct kvmppc_booke_tlb_entry {
>>> 	union {
>>> 		__u64 mas0_1;
>>> 		struct {
>>> 			__u32 mas0;
>>> 			__u32 mas1;
>>> 		};
>>> 	};
>>> 	__u64 mas2;
>>> 	union {
>>> 		__u64 mas7_3	
>>> 		struct {
>>> 			__u32 mas7;
>>> 			__u32 mas3;
>>> 		};
>>> 	};
>>> 	__u32 mas8;
>>> 	__u32 pad;
>>>       
>> Would it make sense to add some reserved fields or would we just bump up the mmu id?
>>     
>
> I was thinking we'd just bump the ID.  I only stuck "pad" in there for
> alignment.  And we're making a large array of it, so padding could hurt.
>   

Ok, thinking about this a bit more. You're basically proposing a list of
tlb set calls, with each array field identifying one tlb set call. What
I was thinking of was a full TLB sync, so we could keep qemu's internal
TLB representation identical to the ioctl layout and then just call that
one ioctl to completely overwrite all of qemu's internal data (and vice
versa).

>>> struct kvmppc_booke_tlb_params {
>>> 	/*
>>> 	 * book3e defines 4 TLBs.  Individual implementations may have
>>> 	 * fewer.  TLBs that do not exist on the target must be configured
>>> 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
>>> 	 * configured here, though arrays greater than 2048 entries will
>>> 	 * have TLBnCFG[NENTRY] set to zero.
>>> 	 */
>>> 	__u32 tlb_sizes[4];
>>>       
>> Add some reserved fields?
>>     
>
> MMU type ID also controls this, but could add some padding to make
> extensions simpler (esp. since we're not making an array of it).  How much
> would you recommend?
>   

How about making it 64 bytes? That should leave us plenty of room.

>   
>>> struct kvmppc_booke_tlb_search {
>>>       
>> Search? I thought we agreed on having a search later, after the full get/set is settled?
>>     
>
> We agreed on having a full array-like get/set... my preference was to keep
> it all under one capability, which implies adding it at the same time.
> But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
> skeptical that array-only will not be a performance issue under any usage
> pattern, but we can implement it and try it out before finalizing any of
> this.
>   

Yup. We can even implement it, measure what exactly is slow and then
decide on how to implement it. I'd bet that only the emulation stub is
slow - and for that KVM_TRANSLATE seems like a good fit.

>   
>>> 	struct kvmppc_booke_tlb_entry entry;
>>> 	union {
>>> 		__u64 mas5_6;
>>> 		struct {
>>> 			__u64 mas5;
>>> 			__u64 mas6;
>>> 		};
>>> 	};
>>> };
>>>       
>
> The fields inside the struct should be __u32, of course. :-P
>   

Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
right? :)

>   
>>> - An entry with MAS1[V] = 0 terminates the list early (but there will
>>>   be no terminating entry if the full array is valid).  On a call to
>>>   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
>>>   On a call to KVM_SET_TLB, excess elements beyond the terminating
>>>   entry may not be accessed by KVM.
>>>       
>> Very implementation specific, but ok with me. 
>>     
>
> I assumed most MMU types would have some straightforward way of marking an
> entry invalid (if not, it can add a software field in the struct), and that
> it would be MMU-specific code that is processing the list.
>   

See above :).

>   
>> It's constrained to the BOOKE implementation of that GET/SET anyway. Is
>> this how the hardware works too?
>>     
>
> Hardware doesn't process lists of entries.  But MAS1[V] is the valid
> bit in hardware.
>
>   
>>> [Note: Once we implement sregs, Qemu can determine which TLBs are
>>> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
>>> unsupported by KVM if its existence is implied by the target CPU]
>>>
>>> KVM_SET_TLB
>>> -----------
>>>
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: struct kvm_set_tlb (in)
>>> Returns: 0 on success
>>>         -1 on error
>>>
>>> struct kvm_set_tlb {
>>> 	__u64 params;
>>> 	__u64 array;
>>> 	__u32 mmu_type;
>>> };
>>>
>>> [Note: I used __u64 rather than void * to avoid the need for special
>>> compat handling with 32-bit userspace on a 64-bit kernel -- if the other
>>> way is preferred, that's fine with me]
>>>       
>> Oh, now I understand what you were proposing :). Sorry. No, this way is sane.
>>     
>
> What about the ioctls that take only a pointer?  The actual calling
> mechanism should work without compat, but in order for _IOR and such to not
> assign a different IOCTL number based on the size of void *, we'd need to
> lie and use plain _IO().  It looks like some ioctls such as
> KVM_SET_TSS_ADDR already do this.
>
> If we drop KVM_SEARCH_TLB and struct-ize KVM_GET_TLB to fit in a buffer
> size parameter, it's moot though.
>   

Yup :). Just always pass in a struct - makes it easier for later extensions.

>   
>>> KVM_GET_TLB
>>> -----------
>>>
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: void pointer (out)
>>> Returns: 0 on success
>>>         -1 on error
>>>
>>> Reads the TLB array from a virtual CPU.  A successful call to
>>> KVM_SET_TLB must have been previously made on this vcpu.  The argument
>>> must point to space for an array of the size and type of TLB entry
>>> structs configured by the most recent successful call to KVM_SET_TLB.
>>>
>>> For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
>>> kvmppc_booke_tlb_entry", and must hold a number of entries equal to
>>> the sum of the elements of tlb_sizes in the most recent successful
>>> TLB configuration call.
>>>       
>> We should add some sort of safety net here. The caller really should pass in how big that pointer is.
>>     
>
> The caller must have previously called KVM_SET_TLB (or KVM_GET_TLB will
> return an error), so it implicitly told KVM how much data to send, based
> on MMU type and params.
>
> I'm OK with an explicit buffer size for added safety, though.
>   

Yup, this way we're even more safe :). It's the same as snprintf vs
sprintf on a buffer of known length with contents of mostly known length.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 11:45                   ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-10 11:45 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

Scott Wood wrote:
> On Wed, 9 Feb 2011 18:21:40 +0100
> Alexander Graf <agraf@suse.de> wrote:
>
>   
>> On 07.02.2011, at 21:15, Scott Wood wrote:
>>
>>     
>>> That's pretty much what the proposed API does -- except it uses a void
>>> pointer instead of uint64_t *.
>>>       
>> Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time.
>>     
>
> Right, I just meant in terms of avoiding a fixed reference to a hw-specific
> type.
>
>   
>>> How about:
>>>
>>> struct kvmppc_booke_tlb_entry {
>>> 	union {
>>> 		__u64 mas0_1;
>>> 		struct {
>>> 			__u32 mas0;
>>> 			__u32 mas1;
>>> 		};
>>> 	};
>>> 	__u64 mas2;
>>> 	union {
>>> 		__u64 mas7_3	
>>> 		struct {
>>> 			__u32 mas7;
>>> 			__u32 mas3;
>>> 		};
>>> 	};
>>> 	__u32 mas8;
>>> 	__u32 pad;
>>>       
>> Would it make sense to add some reserved fields or would we just bump up the mmu id?
>>     
>
> I was thinking we'd just bump the ID.  I only stuck "pad" in there for
> alignment.  And we're making a large array of it, so padding could hurt.
>   

Ok, thinking about this a bit more. You're basically proposing a list of
tlb set calls, with each array field identifying one tlb set call. What
I was thinking of was a full TLB sync, so we could keep qemu's internal
TLB representation identical to the ioctl layout and then just call that
one ioctl to completely overwrite all of qemu's internal data (and vice
versa).

>>> struct kvmppc_booke_tlb_params {
>>> 	/*
>>> 	 * book3e defines 4 TLBs.  Individual implementations may have
>>> 	 * fewer.  TLBs that do not exist on the target must be configured
>>> 	 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
>>> 	 * configured here, though arrays greater than 2048 entries will
>>> 	 * have TLBnCFG[NENTRY] set to zero.
>>> 	 */
>>> 	__u32 tlb_sizes[4];
>>>       
>> Add some reserved fields?
>>     
>
> MMU type ID also controls this, but could add some padding to make
> extensions simpler (esp. since we're not making an array of it).  How much
> would you recommend?
>   

How about making it 64 bytes? That should leave us plenty of room.

>   
>>> struct kvmppc_booke_tlb_search {
>>>       
>> Search? I thought we agreed on having a search later, after the full get/set is settled?
>>     
>
> We agreed on having a full array-like get/set... my preference was to keep
> it all under one capability, which implies adding it at the same time.
> But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
> skeptical that array-only will not be a performance issue under any usage
> pattern, but we can implement it and try it out before finalizing any of
> this.
>   

Yup. We can even implement it, measure what exactly is slow and then
decide on how to implement it. I'd bet that only the emulation stub is
slow - and for that KVM_TRANSLATE seems like a good fit.

>   
>>> 	struct kvmppc_booke_tlb_entry entry;
>>> 	union {
>>> 		__u64 mas5_6;
>>> 		struct {
>>> 			__u64 mas5;
>>> 			__u64 mas6;
>>> 		};
>>> 	};
>>> };
>>>       
>
> The fields inside the struct should be __u32, of course. :-P
>   

Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
right? :)

>   
>>> - An entry with MAS1[V] = 0 terminates the list early (but there will
>>>   be no terminating entry if the full array is valid).  On a call to
>>>   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
>>>   On a call to KVM_SET_TLB, excess elements beyond the terminating
>>>   entry may not be accessed by KVM.
>>>       
>> Very implementation specific, but ok with me. 
>>     
>
> I assumed most MMU types would have some straightforward way of marking an
> entry invalid (if not, it can add a software field in the struct), and that
> it would be MMU-specific code that is processing the list.
>   

See above :).

>   
>> It's constrained to the BOOKE implementation of that GET/SET anyway. Is
>> this how the hardware works too?
>>     
>
> Hardware doesn't process lists of entries.  But MAS1[V] is the valid
> bit in hardware.
>
>   
>>> [Note: Once we implement sregs, Qemu can determine which TLBs are
>>> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
>>> unsupported by KVM if its existence is implied by the target CPU]
>>>
>>> KVM_SET_TLB
>>> -----------
>>>
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: struct kvm_set_tlb (in)
>>> Returns: 0 on success
>>>         -1 on error
>>>
>>> struct kvm_set_tlb {
>>> 	__u64 params;
>>> 	__u64 array;
>>> 	__u32 mmu_type;
>>> };
>>>
>>> [Note: I used __u64 rather than void * to avoid the need for special
>>> compat handling with 32-bit userspace on a 64-bit kernel -- if the other
>>> way is preferred, that's fine with me]
>>>       
>> Oh, now I understand what you were proposing :). Sorry. No, this way is sane.
>>     
>
> What about the ioctls that take only a pointer?  The actual calling
> mechanism should work without compat, but in order for _IOR and such to not
> assign a different IOCTL number based on the size of void *, we'd need to
> lie and use plain _IO().  It looks like some ioctls such as
> KVM_SET_TSS_ADDR already do this.
>
> If we drop KVM_SEARCH_TLB and struct-ize KVM_GET_TLB to fit in a buffer
> size parameter, it's moot though.
>   

Yup :). Just always pass in a struct - makes it easier for later extensions.

>   
>>> KVM_GET_TLB
>>> -----------
>>>
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: void pointer (out)
>>> Returns: 0 on success
>>>         -1 on error
>>>
>>> Reads the TLB array from a virtual CPU.  A successful call to
>>> KVM_SET_TLB must have been previously made on this vcpu.  The argument
>>> must point to space for an array of the size and type of TLB entry
>>> structs configured by the most recent successful call to KVM_SET_TLB.
>>>
>>> For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct
>>> kvmppc_booke_tlb_entry", and must hold a number of entries equal to
>>> the sum of the elements of tlb_sizes in the most recent successful
>>> TLB configuration call.
>>>       
>> We should add some sort of safety net here. The caller really should pass in how big that pointer is.
>>     
>
> The caller must have previously called KVM_SET_TLB (or KVM_GET_TLB will
> return an error), so it implicitly told KVM how much data to send, based
> on MMU type and params.
>
> I'm OK with an explicit buffer size for added safety, though.
>   

Yup, this way we're even more safe :). It's the same as snprintf vs
sprintf on a buffer of known length with contents of mostly known length.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-10  0:04         ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-10 11:55           ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-10 11:55 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel, Edgar E. Iglesias

Scott Wood wrote:
> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf <agraf@suse.de> wrote:
>
>   
>> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
>>
>>  * kvm goes to qemu
>>  * qemu fetches all mmu and register data from kvm
>>  * qemu runs its mmu resolution function as if the target was emulated
>>
>> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).
>>     
>
> I guess you mean that cpu_synchronize_state() is supposed to pull in the
> MMU state, though I don't see where it gets called for 'm'/'M' commands in
> the gdb stub.
>   

Well, we could also call it in get_phys_page_debug in target-ppc, but
yes. I guess the reason it works for now is that SDR1 is pretty constant
and was fetched earlier on. For BookE not syncing is obviously even more
broken.

> The MMU code seems to be pretty target-specific.  It's not clear to what
> extent there is a "normal" way, versus what book3s happens to rely on in
> its get_physical_address() code.  I don't think there are any platforms
> supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> implementation) that have a pure software-managed TLB.  x86 has page
> tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> Linux doesn't use it, but I guess that's not supported by KVM yet?).
>   

As for PPC, only 440, e500 and G3-5 are basically supported. It happens
to work on POWER4 and above too and I've even got reports that it's good
on e600 :).

> We could probably do some sort of lazy state transfer only when MMU code
> that needs it is run.  This could initially include debug translations, for
> testing a non-KVM-dependent get_physical_address() implementation, but
> eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
>   

Yup :).

> trigger the state transfer.  I'd also like to add an "info tlb" command,
> which would require the state transfer.
>   

Very nice.

> BTW, how much other than the MMU is missing to be able to run an e500
> target in qemu, without kvm?
>   

The last person working on BookE emulation was Edgar. Edgar, how far did
you get?


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 11:55           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-10 11:55 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, Edgar E. Iglesias, kvm, kvm-ppc, qemu-devel

Scott Wood wrote:
> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf <agraf@suse.de> wrote:
>
>   
>> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
>>
>>  * kvm goes to qemu
>>  * qemu fetches all mmu and register data from kvm
>>  * qemu runs its mmu resolution function as if the target was emulated
>>
>> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).
>>     
>
> I guess you mean that cpu_synchronize_state() is supposed to pull in the
> MMU state, though I don't see where it gets called for 'm'/'M' commands in
> the gdb stub.
>   

Well, we could also call it in get_phys_page_debug in target-ppc, but
yes. I guess the reason it works for now is that SDR1 is pretty constant
and was fetched earlier on. For BookE not syncing is obviously even more
broken.

> The MMU code seems to be pretty target-specific.  It's not clear to what
> extent there is a "normal" way, versus what book3s happens to rely on in
> its get_physical_address() code.  I don't think there are any platforms
> supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> implementation) that have a pure software-managed TLB.  x86 has page
> tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> Linux doesn't use it, but I guess that's not supported by KVM yet?).
>   

As for PPC, only 440, e500 and G3-5 are basically supported. It happens
to work on POWER4 and above too and I've even got reports that it's good
on e600 :).

> We could probably do some sort of lazy state transfer only when MMU code
> that needs it is run.  This could initially include debug translations, for
> testing a non-KVM-dependent get_physical_address() implementation, but
> eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
>   

Yup :).

> trigger the state transfer.  I'd also like to add an "info tlb" command,
> which would require the state transfer.
>   

Very nice.

> BTW, how much other than the MMU is missing to be able to run an e500
> target in qemu, without kvm?
>   

The last person working on BookE emulation was Edgar. Edgar, how far did
you get?


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 11:55           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-10 11:55 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel, Edgar E. Iglesias

Scott Wood wrote:
> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf <agraf@suse.de> wrote:
>
>   
>> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
>>
>>  * kvm goes to qemu
>>  * qemu fetches all mmu and register data from kvm
>>  * qemu runs its mmu resolution function as if the target was emulated
>>
>> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).
>>     
>
> I guess you mean that cpu_synchronize_state() is supposed to pull in the
> MMU state, though I don't see where it gets called for 'm'/'M' commands in
> the gdb stub.
>   

Well, we could also call it in get_phys_page_debug in target-ppc, but
yes. I guess the reason it works for now is that SDR1 is pretty constant
and was fetched earlier on. For BookE not syncing is obviously even more
broken.

> The MMU code seems to be pretty target-specific.  It's not clear to what
> extent there is a "normal" way, versus what book3s happens to rely on in
> its get_physical_address() code.  I don't think there are any platforms
> supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> implementation) that have a pure software-managed TLB.  x86 has page
> tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> Linux doesn't use it, but I guess that's not supported by KVM yet?).
>   

As for PPC, only 440, e500 and G3-5 are basically supported. It happens
to work on POWER4 and above too and I've even got reports that it's good
on e600 :).

> We could probably do some sort of lazy state transfer only when MMU code
> that needs it is run.  This could initially include debug translations, for
> testing a non-KVM-dependent get_physical_address() implementation, but
> eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
>   

Yup :).

> trigger the state transfer.  I'd also like to add an "info tlb" command,
> which would require the state transfer.
>   

Very nice.

> BTW, how much other than the MMU is missing to be able to run an e500
> target in qemu, without kvm?
>   

The last person working on BookE emulation was Edgar. Edgar, how far did
you get?


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-10 11:55           ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-10 12:31             ` Edgar E. Iglesias
  -1 siblings, 0 replies; 112+ messages in thread
From: Edgar E. Iglesias @ 2011-02-10 12:31 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, Feb 10, 2011 at 12:55:22PM +0100, Alexander Graf wrote:
> Scott Wood wrote:
> > On Thu, 3 Feb 2011 10:19:06 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >
> >   
> >> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
> >>
> >>  * kvm goes to qemu
> >>  * qemu fetches all mmu and register data from kvm
> >>  * qemu runs its mmu resolution function as if the target was emulated
> >>
> >> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).
> >>     
> >
> > I guess you mean that cpu_synchronize_state() is supposed to pull in the
> > MMU state, though I don't see where it gets called for 'm'/'M' commands in
> > the gdb stub.
> >   
> 
> Well, we could also call it in get_phys_page_debug in target-ppc, but
> yes. I guess the reason it works for now is that SDR1 is pretty constant
> and was fetched earlier on. For BookE not syncing is obviously even more
> broken.
> 
> > The MMU code seems to be pretty target-specific.  It's not clear to what
> > extent there is a "normal" way, versus what book3s happens to rely on in
> > its get_physical_address() code.  I don't think there are any platforms
> > supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> > implementation) that have a pure software-managed TLB.  x86 has page
> > tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> > Linux doesn't use it, but I guess that's not supported by KVM yet?).
> >   
> 
> As for PPC, only 440, e500 and G3-5 are basically supported. It happens
> to work on POWER4 and above too and I've even got reports that it's good
> on e600 :).
> 
> > We could probably do some sort of lazy state transfer only when MMU code
> > that needs it is run.  This could initially include debug translations, for
> > testing a non-KVM-dependent get_physical_address() implementation, but
> > eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
> >   
> 
> Yup :).
> 
> > trigger the state transfer.  I'd also like to add an "info tlb" command,
> > which would require the state transfer.
> >   
> 
> Very nice.
> 
> > BTW, how much other than the MMU is missing to be able to run an e500
> > target in qemu, without kvm?
> >   
> 
> The last person working on BookE emulation was Edgar. Edgar, how far did
> you get?

Hi,

TBH, I don't really know. My goal was to get linux running on an PPC-440
embedded with the Xilinx FPGA's. I managed to fix enough BookE emulation
to get that far.

After that, we've done a few more hacks to run fsboot and uboot. Also,
we've added support for some of the BookE debug registers to be able
to run gdbserver from within linux guests. Some of these patches haven't
made it upstream yet.

I haven't taken the time to compare the specs to qemu code, so I don't
really know how much is missing. My guess is that If you wan't to run
linux guests, the MMU won't be the limiting factor.

Cheers

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 12:31             ` Edgar E. Iglesias
  0 siblings, 0 replies; 112+ messages in thread
From: Edgar E. Iglesias @ 2011-02-10 12:31 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Thu, Feb 10, 2011 at 12:55:22PM +0100, Alexander Graf wrote:
> Scott Wood wrote:
> > On Thu, 3 Feb 2011 10:19:06 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >
> >   
> >> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
> >>
> >>  * kvm goes to qemu
> >>  * qemu fetches all mmu and register data from kvm
> >>  * qemu runs its mmu resolution function as if the target was emulated
> >>
> >> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).
> >>     
> >
> > I guess you mean that cpu_synchronize_state() is supposed to pull in the
> > MMU state, though I don't see where it gets called for 'm'/'M' commands in
> > the gdb stub.
> >   
> 
> Well, we could also call it in get_phys_page_debug in target-ppc, but
> yes. I guess the reason it works for now is that SDR1 is pretty constant
> and was fetched earlier on. For BookE not syncing is obviously even more
> broken.
> 
> > The MMU code seems to be pretty target-specific.  It's not clear to what
> > extent there is a "normal" way, versus what book3s happens to rely on in
> > its get_physical_address() code.  I don't think there are any platforms
> > supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> > implementation) that have a pure software-managed TLB.  x86 has page
> > tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> > Linux doesn't use it, but I guess that's not supported by KVM yet?).
> >   
> 
> As for PPC, only 440, e500 and G3-5 are basically supported. It happens
> to work on POWER4 and above too and I've even got reports that it's good
> on e600 :).
> 
> > We could probably do some sort of lazy state transfer only when MMU code
> > that needs it is run.  This could initially include debug translations, for
> > testing a non-KVM-dependent get_physical_address() implementation, but
> > eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
> >   
> 
> Yup :).
> 
> > trigger the state transfer.  I'd also like to add an "info tlb" command,
> > which would require the state transfer.
> >   
> 
> Very nice.
> 
> > BTW, how much other than the MMU is missing to be able to run an e500
> > target in qemu, without kvm?
> >   
> 
> The last person working on BookE emulation was Edgar. Edgar, how far did
> you get?

Hi,

TBH, I don't really know. My goal was to get linux running on an PPC-440
embedded with the Xilinx FPGA's. I managed to fix enough BookE emulation
to get that far.

After that, we've done a few more hacks to run fsboot and uboot. Also,
we've added support for some of the BookE debug registers to be able
to run gdbserver from within linux guests. Some of these patches haven't
made it upstream yet.

I haven't taken the time to compare the specs to qemu code, so I don't
really know how much is missing. My guess is that If you wan't to run
linux guests, the MMU won't be the limiting factor.

Cheers

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 12:31             ` Edgar E. Iglesias
  0 siblings, 0 replies; 112+ messages in thread
From: Edgar E. Iglesias @ 2011-02-10 12:31 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, Feb 10, 2011 at 12:55:22PM +0100, Alexander Graf wrote:
> Scott Wood wrote:
> > On Thu, 3 Feb 2011 10:19:06 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >
> >   
> >> Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:
> >>
> >>  * kvm goes to qemu
> >>  * qemu fetches all mmu and register data from kvm
> >>  * qemu runs its mmu resolution function as if the target was emulated
> >>
> >> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).
> >>     
> >
> > I guess you mean that cpu_synchronize_state() is supposed to pull in the
> > MMU state, though I don't see where it gets called for 'm'/'M' commands in
> > the gdb stub.
> >   
> 
> Well, we could also call it in get_phys_page_debug in target-ppc, but
> yes. I guess the reason it works for now is that SDR1 is pretty constant
> and was fetched earlier on. For BookE not syncing is obviously even more
> broken.
> 
> > The MMU code seems to be pretty target-specific.  It's not clear to what
> > extent there is a "normal" way, versus what book3s happens to rely on in
> > its get_physical_address() code.  I don't think there are any platforms
> > supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> > implementation) that have a pure software-managed TLB.  x86 has page
> > tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> > Linux doesn't use it, but I guess that's not supported by KVM yet?).
> >   
> 
> As for PPC, only 440, e500 and G3-5 are basically supported. It happens
> to work on POWER4 and above too and I've even got reports that it's good
> on e600 :).
> 
> > We could probably do some sort of lazy state transfer only when MMU code
> > that needs it is run.  This could initially include debug translations, for
> > testing a non-KVM-dependent get_physical_address() implementation, but
> > eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
> >   
> 
> Yup :).
> 
> > trigger the state transfer.  I'd also like to add an "info tlb" command,
> > which would require the state transfer.
> >   
> 
> Very nice.
> 
> > BTW, how much other than the MMU is missing to be able to run an e500
> > target in qemu, without kvm?
> >   
> 
> The last person working on BookE emulation was Edgar. Edgar, how far did
> you get?

Hi,

TBH, I don't really know. My goal was to get linux running on an PPC-440
embedded with the Xilinx FPGA's. I managed to fix enough BookE emulation
to get that far.

After that, we've done a few more hacks to run fsboot and uboot. Also,
we've added support for some of the BookE debug registers to be able
to run gdbserver from within linux guests. Some of these patches haven't
made it upstream yet.

I haven't taken the time to compare the specs to qemu code, so I don't
really know how much is missing. My guess is that If you wan't to run
linux guests, the MMU won't be the limiting factor.

Cheers

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-10 11:45                   ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-10 18:51                     ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-10 18:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, 10 Feb 2011 12:45:38 +0100
Alexander Graf <agraf@suse.de> wrote:

> Ok, thinking about this a bit more. You're basically proposing a list of
> tlb set calls, with each array field identifying one tlb set call. What
> I was thinking of was a full TLB sync, so we could keep qemu's internal
> TLB representation identical to the ioctl layout and then just call that
> one ioctl to completely overwrite all of qemu's internal data (and vice
> versa).

No, this is a full sync -- the list replaces any existing TLB entries (need
to make that explicit in the doc).  Basically it's an invalidate plus a
list of tlb set operations.

Qemu's internal representation will want to be ordered with no missing
entries.  If we require that of the transfer representation we can't do
early termination.  It would also limit Qemu's flexibility in choosing its
internal representation, and make it more awkward to support multiple MMU
types.

Let's see if the format conversion imposes significant overhead before
imposing a less flexible/larger transfer format. :-)

> > MMU type ID also controls this, but could add some padding to make
> > extensions simpler (esp. since we're not making an array of it).  How much
> > would you recommend?
> >   
> 
> How about making it 64 bytes? That should leave us plenty of room.

OK.

> > The fields inside the struct should be __u32, of course. :-P
> >   
> 
> Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
> right? :)

Right.

> > I assumed most MMU types would have some straightforward way of marking an
> > entry invalid (if not, it can add a software field in the struct), and that
> > it would be MMU-specific code that is processing the list.
> >   
> 
> See above :).

Which part?

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 18:51                     ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-10 18:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel

On Thu, 10 Feb 2011 12:45:38 +0100
Alexander Graf <agraf@suse.de> wrote:

> Ok, thinking about this a bit more. You're basically proposing a list of
> tlb set calls, with each array field identifying one tlb set call. What
> I was thinking of was a full TLB sync, so we could keep qemu's internal
> TLB representation identical to the ioctl layout and then just call that
> one ioctl to completely overwrite all of qemu's internal data (and vice
> versa).

No, this is a full sync -- the list replaces any existing TLB entries (need
to make that explicit in the doc).  Basically it's an invalidate plus a
list of tlb set operations.

Qemu's internal representation will want to be ordered with no missing
entries.  If we require that of the transfer representation we can't do
early termination.  It would also limit Qemu's flexibility in choosing its
internal representation, and make it more awkward to support multiple MMU
types.

Let's see if the format conversion imposes significant overhead before
imposing a less flexible/larger transfer format. :-)

> > MMU type ID also controls this, but could add some padding to make
> > extensions simpler (esp. since we're not making an array of it).  How much
> > would you recommend?
> >   
> 
> How about making it 64 bytes? That should leave us plenty of room.

OK.

> > The fields inside the struct should be __u32, of course. :-P
> >   
> 
> Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
> right? :)

Right.

> > I assumed most MMU types would have some straightforward way of marking an
> > entry invalid (if not, it can add a software field in the struct), and that
> > it would be MMU-specific code that is processing the list.
> >   
> 
> See above :).

Which part?

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-10 18:51                     ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-10 18:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel

On Thu, 10 Feb 2011 12:45:38 +0100
Alexander Graf <agraf@suse.de> wrote:

> Ok, thinking about this a bit more. You're basically proposing a list of
> tlb set calls, with each array field identifying one tlb set call. What
> I was thinking of was a full TLB sync, so we could keep qemu's internal
> TLB representation identical to the ioctl layout and then just call that
> one ioctl to completely overwrite all of qemu's internal data (and vice
> versa).

No, this is a full sync -- the list replaces any existing TLB entries (need
to make that explicit in the doc).  Basically it's an invalidate plus a
list of tlb set operations.

Qemu's internal representation will want to be ordered with no missing
entries.  If we require that of the transfer representation we can't do
early termination.  It would also limit Qemu's flexibility in choosing its
internal representation, and make it more awkward to support multiple MMU
types.

Let's see if the format conversion imposes significant overhead before
imposing a less flexible/larger transfer format. :-)

> > MMU type ID also controls this, but could add some padding to make
> > extensions simpler (esp. since we're not making an array of it).  How much
> > would you recommend?
> >   
> 
> How about making it 64 bytes? That should leave us plenty of room.

OK.

> > The fields inside the struct should be __u32, of course. :-P
> >   
> 
> Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
> right? :)

Right.

> > I assumed most MMU types would have some straightforward way of marking an
> > entry invalid (if not, it can add a software field in the struct), and that
> > it would be MMU-specific code that is processing the list.
> >   
> 
> See above :).

Which part?

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-10 18:51                     ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-11  0:20                       ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  0:20 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 10.02.2011, at 19:51, Scott Wood wrote:

> On Thu, 10 Feb 2011 12:45:38 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> Ok, thinking about this a bit more. You're basically proposing a list of
>> tlb set calls, with each array field identifying one tlb set call. What
>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>> TLB representation identical to the ioctl layout and then just call that
>> one ioctl to completely overwrite all of qemu's internal data (and vice
>> versa).
> 
> No, this is a full sync -- the list replaces any existing TLB entries (need
> to make that explicit in the doc).  Basically it's an invalidate plus a
> list of tlb set operations.
> 
> Qemu's internal representation will want to be ordered with no missing
> entries.  If we require that of the transfer representation we can't do
> early termination.  It would also limit Qemu's flexibility in choosing its
> internal representation, and make it more awkward to support multiple MMU
> types.

Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:

SET:
 * qemu assembles the list from its internal representation
 * kvm disassembles the list into its internal structure

GET:
 * kvm assembles the list from its internal representation
 * qemu disassembles the list into its internal structure

Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11  0:20                       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  0:20 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel


On 10.02.2011, at 19:51, Scott Wood wrote:

> On Thu, 10 Feb 2011 12:45:38 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> Ok, thinking about this a bit more. You're basically proposing a list of
>> tlb set calls, with each array field identifying one tlb set call. What
>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>> TLB representation identical to the ioctl layout and then just call that
>> one ioctl to completely overwrite all of qemu's internal data (and vice
>> versa).
> 
> No, this is a full sync -- the list replaces any existing TLB entries (need
> to make that explicit in the doc).  Basically it's an invalidate plus a
> list of tlb set operations.
> 
> Qemu's internal representation will want to be ordered with no missing
> entries.  If we require that of the transfer representation we can't do
> early termination.  It would also limit Qemu's flexibility in choosing its
> internal representation, and make it more awkward to support multiple MMU
> types.

Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:

SET:
 * qemu assembles the list from its internal representation
 * kvm disassembles the list into its internal structure

GET:
 * kvm assembles the list from its internal representation
 * qemu disassembles the list into its internal structure

Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11  0:20                       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  0:20 UTC (permalink / raw)
  To: Scott Wood; +Cc: Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 10.02.2011, at 19:51, Scott Wood wrote:

> On Thu, 10 Feb 2011 12:45:38 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> Ok, thinking about this a bit more. You're basically proposing a list of
>> tlb set calls, with each array field identifying one tlb set call. What
>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>> TLB representation identical to the ioctl layout and then just call that
>> one ioctl to completely overwrite all of qemu's internal data (and vice
>> versa).
> 
> No, this is a full sync -- the list replaces any existing TLB entries (need
> to make that explicit in the doc).  Basically it's an invalidate plus a
> list of tlb set operations.
> 
> Qemu's internal representation will want to be ordered with no missing
> entries.  If we require that of the transfer representation we can't do
> early termination.  It would also limit Qemu's flexibility in choosing its
> internal representation, and make it more awkward to support multiple MMU
> types.

Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:

SET:
 * qemu assembles the list from its internal representation
 * kvm disassembles the list into its internal structure

GET:
 * kvm assembles the list from its internal representation
 * qemu disassembles the list into its internal structure

Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-11  0:20                       ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-11  0:22                         ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  0:22 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 11.02.2011, at 01:20, Alexander Graf wrote:

> 
> On 10.02.2011, at 19:51, Scott Wood wrote:
> 
>> On Thu, 10 Feb 2011 12:45:38 +0100
>> Alexander Graf <agraf@suse.de> wrote:
>> 
>>> Ok, thinking about this a bit more. You're basically proposing a list of
>>> tlb set calls, with each array field identifying one tlb set call. What
>>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>>> TLB representation identical to the ioctl layout and then just call that
>>> one ioctl to completely overwrite all of qemu's internal data (and vice
>>> versa).
>> 
>> No, this is a full sync -- the list replaces any existing TLB entries (need
>> to make that explicit in the doc).  Basically it's an invalidate plus a
>> list of tlb set operations.
>> 
>> Qemu's internal representation will want to be ordered with no missing
>> entries.  If we require that of the transfer representation we can't do
>> early termination.  It would also limit Qemu's flexibility in choosing its
>> internal representation, and make it more awkward to support multiple MMU
>> types.
> 
> Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:
> 
> SET:
> * qemu assembles the list from its internal representation
> * kvm disassembles the list into its internal structure
> 
> GET:
> * kvm assembles the list from its internal representation
> * qemu disassembles the list into its internal structure
> 
> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?

If we could then keep qemu's internal representation == shared data with kvm == kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11  0:22                         ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  0:22 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm, kvm-ppc, qemu-devel


On 11.02.2011, at 01:20, Alexander Graf wrote:

> 
> On 10.02.2011, at 19:51, Scott Wood wrote:
> 
>> On Thu, 10 Feb 2011 12:45:38 +0100
>> Alexander Graf <agraf@suse.de> wrote:
>> 
>>> Ok, thinking about this a bit more. You're basically proposing a list of
>>> tlb set calls, with each array field identifying one tlb set call. What
>>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>>> TLB representation identical to the ioctl layout and then just call that
>>> one ioctl to completely overwrite all of qemu's internal data (and vice
>>> versa).
>> 
>> No, this is a full sync -- the list replaces any existing TLB entries (need
>> to make that explicit in the doc).  Basically it's an invalidate plus a
>> list of tlb set operations.
>> 
>> Qemu's internal representation will want to be ordered with no missing
>> entries.  If we require that of the transfer representation we can't do
>> early termination.  It would also limit Qemu's flexibility in choosing its
>> internal representation, and make it more awkward to support multiple MMU
>> types.
> 
> Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:
> 
> SET:
> * qemu assembles the list from its internal representation
> * kvm disassembles the list into its internal structure
> 
> GET:
> * kvm assembles the list from its internal representation
> * qemu disassembles the list into its internal structure
> 
> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?

If we could then keep qemu's internal representation == shared data with kvm == kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11  0:22                         ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  0:22 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, Yoder Stuart-B08248, kvm-ppc, kvm, qemu-devel


On 11.02.2011, at 01:20, Alexander Graf wrote:

> 
> On 10.02.2011, at 19:51, Scott Wood wrote:
> 
>> On Thu, 10 Feb 2011 12:45:38 +0100
>> Alexander Graf <agraf@suse.de> wrote:
>> 
>>> Ok, thinking about this a bit more. You're basically proposing a list of
>>> tlb set calls, with each array field identifying one tlb set call. What
>>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>>> TLB representation identical to the ioctl layout and then just call that
>>> one ioctl to completely overwrite all of qemu's internal data (and vice
>>> versa).
>> 
>> No, this is a full sync -- the list replaces any existing TLB entries (need
>> to make that explicit in the doc).  Basically it's an invalidate plus a
>> list of tlb set operations.
>> 
>> Qemu's internal representation will want to be ordered with no missing
>> entries.  If we require that of the transfer representation we can't do
>> early termination.  It would also limit Qemu's flexibility in choosing its
>> internal representation, and make it more awkward to support multiple MMU
>> types.
> 
> Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:
> 
> SET:
> * qemu assembles the list from its internal representation
> * kvm disassembles the list into its internal structure
> 
> GET:
> * kvm assembles the list from its internal representation
> * qemu disassembles the list into its internal structure
> 
> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?

If we could then keep qemu's internal representation = shared data with kvm = kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-11  0:22                         ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-11  1:41                           ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  1:41 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List


On 11.02.2011, at 01:22, Alexander Graf wrote:

> 
> On 11.02.2011, at 01:20, Alexander Graf wrote:
> 
>> 
>> On 10.02.2011, at 19:51, Scott Wood wrote:
>> 
>>> On Thu, 10 Feb 2011 12:45:38 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>> Ok, thinking about this a bit more. You're basically proposing a list of
>>>> tlb set calls, with each array field identifying one tlb set call. What
>>>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>>>> TLB representation identical to the ioctl layout and then just call that
>>>> one ioctl to completely overwrite all of qemu's internal data (and vice
>>>> versa).
>>> 
>>> No, this is a full sync -- the list replaces any existing TLB entries (need
>>> to make that explicit in the doc).  Basically it's an invalidate plus a
>>> list of tlb set operations.
>>> 
>>> Qemu's internal representation will want to be ordered with no missing
>>> entries.  If we require that of the transfer representation we can't do
>>> early termination.  It would also limit Qemu's flexibility in choosing its
>>> internal representation, and make it more awkward to support multiple MMU
>>> types.
>> 
>> Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:
>> 
>> SET:
>> * qemu assembles the list from its internal representation
>> * kvm disassembles the list into its internal structure
>> 
>> GET:
>> * kvm assembles the list from its internal representation
>> * qemu disassembles the list into its internal structure
>> 
>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> 
> If we could then keep qemu's internal representation == shared data with kvm == kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable.

So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.

>From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.

That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.

Also, please tell me you didn't implement the previous revisions already. It'd be a real bummer to see that work wasted only because we're still iterating through the spec O_o.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11  1:41                           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  1:41 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list, kvm-ppc,
	qemu-devel@nongnu.org List


On 11.02.2011, at 01:22, Alexander Graf wrote:

> 
> On 11.02.2011, at 01:20, Alexander Graf wrote:
> 
>> 
>> On 10.02.2011, at 19:51, Scott Wood wrote:
>> 
>>> On Thu, 10 Feb 2011 12:45:38 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>> Ok, thinking about this a bit more. You're basically proposing a list of
>>>> tlb set calls, with each array field identifying one tlb set call. What
>>>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>>>> TLB representation identical to the ioctl layout and then just call that
>>>> one ioctl to completely overwrite all of qemu's internal data (and vice
>>>> versa).
>>> 
>>> No, this is a full sync -- the list replaces any existing TLB entries (need
>>> to make that explicit in the doc).  Basically it's an invalidate plus a
>>> list of tlb set operations.
>>> 
>>> Qemu's internal representation will want to be ordered with no missing
>>> entries.  If we require that of the transfer representation we can't do
>>> early termination.  It would also limit Qemu's flexibility in choosing its
>>> internal representation, and make it more awkward to support multiple MMU
>>> types.
>> 
>> Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:
>> 
>> SET:
>> * qemu assembles the list from its internal representation
>> * kvm disassembles the list into its internal structure
>> 
>> GET:
>> * kvm assembles the list from its internal representation
>> * qemu disassembles the list into its internal structure
>> 
>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> 
> If we could then keep qemu's internal representation == shared data with kvm == kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable.

So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.

From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.

That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.

Also, please tell me you didn't implement the previous revisions already. It'd be a real bummer to see that work wasted only because we're still iterating through the spec O_o.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11  1:41                           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11  1:41 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List


On 11.02.2011, at 01:22, Alexander Graf wrote:

> 
> On 11.02.2011, at 01:20, Alexander Graf wrote:
> 
>> 
>> On 10.02.2011, at 19:51, Scott Wood wrote:
>> 
>>> On Thu, 10 Feb 2011 12:45:38 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>> Ok, thinking about this a bit more. You're basically proposing a list of
>>>> tlb set calls, with each array field identifying one tlb set call. What
>>>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>>>> TLB representation identical to the ioctl layout and then just call that
>>>> one ioctl to completely overwrite all of qemu's internal data (and vice
>>>> versa).
>>> 
>>> No, this is a full sync -- the list replaces any existing TLB entries (need
>>> to make that explicit in the doc).  Basically it's an invalidate plus a
>>> list of tlb set operations.
>>> 
>>> Qemu's internal representation will want to be ordered with no missing
>>> entries.  If we require that of the transfer representation we can't do
>>> early termination.  It would also limit Qemu's flexibility in choosing its
>>> internal representation, and make it more awkward to support multiple MMU
>>> types.
>> 
>> Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times:
>> 
>> SET:
>> * qemu assembles the list from its internal representation
>> * kvm disassembles the list into its internal structure
>> 
>> GET:
>> * kvm assembles the list from its internal representation
>> * qemu disassembles the list into its internal structure
>> 
>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> 
> If we could then keep qemu's internal representation = shared data with kvm = kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable.

So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.

From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.

That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.

Also, please tell me you didn't implement the previous revisions already. It'd be a real bummer to see that work wasted only because we're still iterating through the spec O_o.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-11  1:41                           ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-11 20:53                             ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-11 20:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List

On Fri, 11 Feb 2011 02:41:35 +0100
Alexander Graf <agraf@suse.de> wrote:

> >> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?

Hmm, the idea is growing on me.

> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
> 
> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
> 
> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.

OK.  We'll also need a config ioctl to specify MMU type/size and the address
of the arrays.

> Also, please tell me you didn't implement the previous revisions already.

I didn't. :-)

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11 20:53                             ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-11 20:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list, kvm-ppc,
	qemu-devel@nongnu.org List

On Fri, 11 Feb 2011 02:41:35 +0100
Alexander Graf <agraf@suse.de> wrote:

> >> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?

Hmm, the idea is growing on me.

> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
> 
> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
> 
> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.

OK.  We'll also need a config ioctl to specify MMU type/size and the address
of the arrays.

> Also, please tell me you didn't implement the previous revisions already.

I didn't. :-)

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11 20:53                             ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-11 20:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List

On Fri, 11 Feb 2011 02:41:35 +0100
Alexander Graf <agraf@suse.de> wrote:

> >> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?

Hmm, the idea is growing on me.

> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
> 
> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
> 
> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.

OK.  We'll also need a config ioctl to specify MMU type/size and the address
of the arrays.

> Also, please tell me you didn't implement the previous revisions already.

I didn't. :-)

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-11 20:53                             ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-11 21:07                               ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11 21:07 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List


On 11.02.2011, at 21:53, Scott Wood wrote:

> On Fri, 11 Feb 2011 02:41:35 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> 
> Hmm, the idea is growing on me.
> 
>> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
>> 
>> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
>> 
>> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
> 
> OK.  We'll also need a config ioctl to specify MMU type/size and the address
> of the arrays.

Right, a setup call basically :).


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11 21:07                               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11 21:07 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list, kvm-ppc,
	qemu-devel@nongnu.org  List


On 11.02.2011, at 21:53, Scott Wood wrote:

> On Fri, 11 Feb 2011 02:41:35 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> 
> Hmm, the idea is growing on me.
> 
>> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
>> 
>> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
>> 
>> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
> 
> OK.  We'll also need a config ioctl to specify MMU type/size and the address
> of the arrays.

Right, a setup call basically :).


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-11 21:07                               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-11 21:07 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List


On 11.02.2011, at 21:53, Scott Wood wrote:

> On Fri, 11 Feb 2011 02:41:35 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> 
> Hmm, the idea is growing on me.
> 
>> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
>> 
>> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
>> 
>> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
> 
> OK.  We'll also need a config ioctl to specify MMU type/size and the address
> of the arrays.

Right, a setup call basically :).


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-11 21:07                               ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-12  0:57                                 ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-12  0:57 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List

On Fri, 11 Feb 2011 22:07:11 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 11.02.2011, at 21:53, Scott Wood wrote:
> 
> > On Fri, 11 Feb 2011 02:41:35 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> > 
> > Hmm, the idea is growing on me.
> > 
> >> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
> >> 
> >> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
> >> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
> >> 
> >> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
> > 
> > OK.  We'll also need a config ioctl to specify MMU type/size and the address
> > of the arrays.
> 
> Right, a setup call basically :).

OK, here goes v3:

[Note: one final thought after writing this up -- this isn't going to work
too well in cases where the guest can directly manipulate its TLB, such as
with the LRAT feature of Power Arch 2.06.  We'll still need a
copy-in/copy-out mechanism for that.]

struct kvmppc_book3e_tlb_entry {
	union {
		__u64 mas8_1;
		struct {
			__u32 mas8;
			__u32 mas1;
		};
	};
	__u64 mas2;
	union {
		__u64 mas7_3	
		struct {
			__u32 mas7;
			__u32 mas3;
		};
	};
};

For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
present but not supported.

struct kvmppc_book3e_tlb_params {
	/*
	 * book3e defines 4 TLBs.  Individual implementations may have
	 * fewer.  TLBs that do not already exist on the target must be
	 * configured with a size of zero.  A tlb_ways value of zero means
	 * the array is fully associative.  Only TLBs that are already
	 * set-associative on the target may be configured with a different
	 * associativity.  A set-associative TLB may not exceed 255 ways.
	 *
	 * KVM will adjust TLBnCFG based on the sizes configured here,
	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
	 * set to zero.
	 *
	 * The size of any TLB that is set-associative must be a multiple of
	 * the number of ways, and the number of sets must be a power of two.
	 *
	 * The page sizes supported by a TLB shall be determined by reading
	 * the TLB configuration registers.  This is not adjustable by userspace.
	 * [Note: need sregs]
	 */
	__u32 tlb_sizes[4];
	__u8 tlb_ways[4];
	__u32 reserved[11];
};

KVM_CONFIG_TLB
--------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_config_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_config_tlb {
	__u64 params;
	__u64 array;
	__u32 mmu_type;
	__u32 array_len;
};

Configures the virtual CPU's TLB array, establishing a shared memory area
between userspace and KVM.  The "params" and "array" fields are userspace
addresses of mmu-type-specific data structures.  The "array_len" field is an
safety mechanism, and should be set to the size in bytes of the memory that
userspace has reserved for the array.  It must be at least the size dictated by
"mmu_type" and "params".

On return from this call, KVM assumes the state of the TLBs to be empty. 
Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
any valid entries.

While KVM_RUN is active, the shared region is under control of KVM.  Its
contents are undefined, and any modification by userspace results in boundedly
undefined behavior.

On return from KVM_RUN, the shared region will reflect the current state of
the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
to tell KVM which entries have been changed, prior to calling KVM_RUN again
on this vcpu.

For mmu types KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV:
 - The "params" field is of type "struct kvmppc_book3e_tlb_params".
 - The "array" field points to an array of type "struct kvmppc_book3e_tlb_entry".
 - The array consists of all entries in the first TLB, followed by all
   entries in the second TLB, etc.
 - Within a TLB, if the array is not set-associative, entries are ordered by
   increasing ESEL.  If the array is set-associative, entries are ordered first
   by set.  Within a set, entries are ordered by way (ESEL).
 - The hash for determining set number is:
     (MAS2[EPN] / page_size) & (num_sets - 1)
   where "page_size" is the smallest page size supported by the TLB, and
   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
   If a book3e chip is made for which a different hash is needed, a new
   MMU type must be used, to ensure that userspace and KVM agree on layout.

KVM_DIRTY_TLB
-------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_dirty_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_dirty_tlb {
	__u64 bitmap;
	__u32 num_dirty;
};

This must be called whenever userspace has changed an entry in the shared
TLB, prior to calling KVM_RUN on the associated vcpu.

The "bitmap" field is the userspace address of an array.  This array consists
of a number of bits, equal to the total number of TLB entries as determined by the
last successful call to KVM_CONFIG_TLB, rounded up to the nearest multiple of 64.

Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array.

The array is little-endian: the bit 0 is the least significant bit of the
first byte, bit 8 is the least significant bit of the second byte, etc.
This avoids any complications with differing word sizes.

The "num_dirty" field is a performance hint for KVM to determine whether it
should skip processing the bitmap and just invalidate everything.  It must
be set to the number of set bits in the bitmap.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-12  0:57                                 ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-12  0:57 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list, kvm-ppc,
	qemu-devel@nongnu.org List

On Fri, 11 Feb 2011 22:07:11 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 11.02.2011, at 21:53, Scott Wood wrote:
> 
> > On Fri, 11 Feb 2011 02:41:35 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> > 
> > Hmm, the idea is growing on me.
> > 
> >> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
> >> 
> >> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
> >> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
> >> 
> >> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
> > 
> > OK.  We'll also need a config ioctl to specify MMU type/size and the address
> > of the arrays.
> 
> Right, a setup call basically :).

OK, here goes v3:

[Note: one final thought after writing this up -- this isn't going to work
too well in cases where the guest can directly manipulate its TLB, such as
with the LRAT feature of Power Arch 2.06.  We'll still need a
copy-in/copy-out mechanism for that.]

struct kvmppc_book3e_tlb_entry {
	union {
		__u64 mas8_1;
		struct {
			__u32 mas8;
			__u32 mas1;
		};
	};
	__u64 mas2;
	union {
		__u64 mas7_3	
		struct {
			__u32 mas7;
			__u32 mas3;
		};
	};
};

For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
present but not supported.

struct kvmppc_book3e_tlb_params {
	/*
	 * book3e defines 4 TLBs.  Individual implementations may have
	 * fewer.  TLBs that do not already exist on the target must be
	 * configured with a size of zero.  A tlb_ways value of zero means
	 * the array is fully associative.  Only TLBs that are already
	 * set-associative on the target may be configured with a different
	 * associativity.  A set-associative TLB may not exceed 255 ways.
	 *
	 * KVM will adjust TLBnCFG based on the sizes configured here,
	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
	 * set to zero.
	 *
	 * The size of any TLB that is set-associative must be a multiple of
	 * the number of ways, and the number of sets must be a power of two.
	 *
	 * The page sizes supported by a TLB shall be determined by reading
	 * the TLB configuration registers.  This is not adjustable by userspace.
	 * [Note: need sregs]
	 */
	__u32 tlb_sizes[4];
	__u8 tlb_ways[4];
	__u32 reserved[11];
};

KVM_CONFIG_TLB
--------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_config_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_config_tlb {
	__u64 params;
	__u64 array;
	__u32 mmu_type;
	__u32 array_len;
};

Configures the virtual CPU's TLB array, establishing a shared memory area
between userspace and KVM.  The "params" and "array" fields are userspace
addresses of mmu-type-specific data structures.  The "array_len" field is an
safety mechanism, and should be set to the size in bytes of the memory that
userspace has reserved for the array.  It must be at least the size dictated by
"mmu_type" and "params".

On return from this call, KVM assumes the state of the TLBs to be empty. 
Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
any valid entries.

While KVM_RUN is active, the shared region is under control of KVM.  Its
contents are undefined, and any modification by userspace results in boundedly
undefined behavior.

On return from KVM_RUN, the shared region will reflect the current state of
the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
to tell KVM which entries have been changed, prior to calling KVM_RUN again
on this vcpu.

For mmu types KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV:
 - The "params" field is of type "struct kvmppc_book3e_tlb_params".
 - The "array" field points to an array of type "struct kvmppc_book3e_tlb_entry".
 - The array consists of all entries in the first TLB, followed by all
   entries in the second TLB, etc.
 - Within a TLB, if the array is not set-associative, entries are ordered by
   increasing ESEL.  If the array is set-associative, entries are ordered first
   by set.  Within a set, entries are ordered by way (ESEL).
 - The hash for determining set number is:
     (MAS2[EPN] / page_size) & (num_sets - 1)
   where "page_size" is the smallest page size supported by the TLB, and
   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
   If a book3e chip is made for which a different hash is needed, a new
   MMU type must be used, to ensure that userspace and KVM agree on layout.

KVM_DIRTY_TLB
-------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_dirty_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_dirty_tlb {
	__u64 bitmap;
	__u32 num_dirty;
};

This must be called whenever userspace has changed an entry in the shared
TLB, prior to calling KVM_RUN on the associated vcpu.

The "bitmap" field is the userspace address of an array.  This array consists
of a number of bits, equal to the total number of TLB entries as determined by the
last successful call to KVM_CONFIG_TLB, rounded up to the nearest multiple of 64.

Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array.

The array is little-endian: the bit 0 is the least significant bit of the
first byte, bit 8 is the least significant bit of the second byte, etc.
This avoids any complications with differing word sizes.

The "num_dirty" field is a performance hint for KVM to determine whether it
should skip processing the bitmap and just invalidate everything.  It must
be set to the number of set bits in the bitmap.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-12  0:57                                 ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-12  0:57 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List

On Fri, 11 Feb 2011 22:07:11 +0100
Alexander Graf <agraf@suse.de> wrote:

> 
> On 11.02.2011, at 21:53, Scott Wood wrote:
> 
> > On Fri, 11 Feb 2011 02:41:35 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
> > 
> > Hmm, the idea is growing on me.
> > 
> >> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
> >> 
> >> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
> >> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
> >> 
> >> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
> > 
> > OK.  We'll also need a config ioctl to specify MMU type/size and the address
> > of the arrays.
> 
> Right, a setup call basically :).

OK, here goes v3:

[Note: one final thought after writing this up -- this isn't going to work
too well in cases where the guest can directly manipulate its TLB, such as
with the LRAT feature of Power Arch 2.06.  We'll still need a
copy-in/copy-out mechanism for that.]

struct kvmppc_book3e_tlb_entry {
	union {
		__u64 mas8_1;
		struct {
			__u32 mas8;
			__u32 mas1;
		};
	};
	__u64 mas2;
	union {
		__u64 mas7_3	
		struct {
			__u32 mas7;
			__u32 mas3;
		};
	};
};

For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
present but not supported.

struct kvmppc_book3e_tlb_params {
	/*
	 * book3e defines 4 TLBs.  Individual implementations may have
	 * fewer.  TLBs that do not already exist on the target must be
	 * configured with a size of zero.  A tlb_ways value of zero means
	 * the array is fully associative.  Only TLBs that are already
	 * set-associative on the target may be configured with a different
	 * associativity.  A set-associative TLB may not exceed 255 ways.
	 *
	 * KVM will adjust TLBnCFG based on the sizes configured here,
	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
	 * set to zero.
	 *
	 * The size of any TLB that is set-associative must be a multiple of
	 * the number of ways, and the number of sets must be a power of two.
	 *
	 * The page sizes supported by a TLB shall be determined by reading
	 * the TLB configuration registers.  This is not adjustable by userspace.
	 * [Note: need sregs]
	 */
	__u32 tlb_sizes[4];
	__u8 tlb_ways[4];
	__u32 reserved[11];
};

KVM_CONFIG_TLB
--------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_config_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_config_tlb {
	__u64 params;
	__u64 array;
	__u32 mmu_type;
	__u32 array_len;
};

Configures the virtual CPU's TLB array, establishing a shared memory area
between userspace and KVM.  The "params" and "array" fields are userspace
addresses of mmu-type-specific data structures.  The "array_len" field is an
safety mechanism, and should be set to the size in bytes of the memory that
userspace has reserved for the array.  It must be at least the size dictated by
"mmu_type" and "params".

On return from this call, KVM assumes the state of the TLBs to be empty. 
Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
any valid entries.

While KVM_RUN is active, the shared region is under control of KVM.  Its
contents are undefined, and any modification by userspace results in boundedly
undefined behavior.

On return from KVM_RUN, the shared region will reflect the current state of
the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
to tell KVM which entries have been changed, prior to calling KVM_RUN again
on this vcpu.

For mmu types KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV:
 - The "params" field is of type "struct kvmppc_book3e_tlb_params".
 - The "array" field points to an array of type "struct kvmppc_book3e_tlb_entry".
 - The array consists of all entries in the first TLB, followed by all
   entries in the second TLB, etc.
 - Within a TLB, if the array is not set-associative, entries are ordered by
   increasing ESEL.  If the array is set-associative, entries are ordered first
   by set.  Within a set, entries are ordered by way (ESEL).
 - The hash for determining set number is:
     (MAS2[EPN] / page_size) & (num_sets - 1)
   where "page_size" is the smallest page size supported by the TLB, and
   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
   If a book3e chip is made for which a different hash is needed, a new
   MMU type must be used, to ensure that userspace and KVM agree on layout.

KVM_DIRTY_TLB
-------------

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_dirty_tlb (in)
Returns: 0 on success
         -1 on error

struct kvm_dirty_tlb {
	__u64 bitmap;
	__u32 num_dirty;
};

This must be called whenever userspace has changed an entry in the shared
TLB, prior to calling KVM_RUN on the associated vcpu.

The "bitmap" field is the userspace address of an array.  This array consists
of a number of bits, equal to the total number of TLB entries as determined by the
last successful call to KVM_CONFIG_TLB, rounded up to the nearest multiple of 64.

Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array.

The array is little-endian: the bit 0 is the least significant bit of the
first byte, bit 8 is the least significant bit of the second byte, etc.
This avoids any complications with differing word sizes.

The "num_dirty" field is a performance hint for KVM to determine whether it
should skip processing the bitmap and just invalidate everything.  It must
be set to the number of set bits in the bitmap.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-12  0:57                                 ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-13 22:43                                   ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-13 22:43 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List


On 12.02.2011, at 01:57, Scott Wood wrote:

> On Fri, 11 Feb 2011 22:07:11 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 11.02.2011, at 21:53, Scott Wood wrote:
>> 
>>> On Fri, 11 Feb 2011 02:41:35 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
>>> 
>>> Hmm, the idea is growing on me.
>>> 
>>>> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
>>>> 
>>>> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>>>> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
>>>> 
>>>> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
>>> 
>>> OK.  We'll also need a config ioctl to specify MMU type/size and the address
>>> of the arrays.
>> 
>> Right, a setup call basically :).
> 
> OK, here goes v3:
> 
> [Note: one final thought after writing this up -- this isn't going to work
> too well in cases where the guest can directly manipulate its TLB, such as
> with the LRAT feature of Power Arch 2.06.  We'll still need a
> copy-in/copy-out mechanism for that.]

In that case qemu sets the mmu mode respectively and is aware of the fact that it needs get/set methods. We'll get to that when we have LRAT implemented :).

> struct kvmppc_book3e_tlb_entry {
> 	union {
> 		__u64 mas8_1;
> 		struct {
> 			__u32 mas8;
> 			__u32 mas1;
> 		};
> 	};
> 	__u64 mas2;
> 	union {
> 		__u64 mas7_3	
> 		struct {
> 			__u32 mas7;
> 			__u32 mas3;
> 		};
> 	};
> };

Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). Is there any obvious reason we need to have unions in the first place? The compiler should be clever enough to pick the right size accessors when writing/reading masked  __u64 entries, no? The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right?

> 
> For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
> present but not supported.
> 
> struct kvmppc_book3e_tlb_params {
> 	/*
> 	 * book3e defines 4 TLBs.  Individual implementations may have
> 	 * fewer.  TLBs that do not already exist on the target must be
> 	 * configured with a size of zero.  A tlb_ways value of zero means
> 	 * the array is fully associative.  Only TLBs that are already
> 	 * set-associative on the target may be configured with a different
> 	 * associativity.  A set-associative TLB may not exceed 255 ways.
> 	 *
> 	 * KVM will adjust TLBnCFG based on the sizes configured here,
> 	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> 	 * set to zero.
> 	 *
> 	 * The size of any TLB that is set-associative must be a multiple of
> 	 * the number of ways, and the number of sets must be a power of two.
> 	 *
> 	 * The page sizes supported by a TLB shall be determined by reading
> 	 * the TLB configuration registers.  This is not adjustable by userspace.
> 	 * [Note: need sregs]
> 	 */
> 	__u32 tlb_sizes[4];
> 	__u8 tlb_ways[4];
> 	__u32 reserved[11];
> };
> 
> KVM_CONFIG_TLB
> --------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_config_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_config_tlb {
> 	__u64 params;
> 	__u64 array;
> 	__u32 mmu_type;
> 	__u32 array_len;

Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?

> };
> 
> Configures the virtual CPU's TLB array, establishing a shared memory area
> between userspace and KVM.  The "params" and "array" fields are userspace
> addresses of mmu-type-specific data structures.  The "array_len" field is an
> safety mechanism, and should be set to the size in bytes of the memory that
> userspace has reserved for the array.  It must be at least the size dictated by
> "mmu_type" and "params".
> 
> On return from this call, KVM assumes the state of the TLBs to be empty. 
> Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
> any valid entries.
> 
> While KVM_RUN is active, the shared region is under control of KVM.  Its
> contents are undefined, and any modification by userspace results in boundedly
> undefined behavior.
> 
> On return from KVM_RUN, the shared region will reflect the current state of
> the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> to tell KVM which entries have been changed, prior to calling KVM_RUN again
> on this vcpu.
> 
> For mmu types KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV:
> - The "params" field is of type "struct kvmppc_book3e_tlb_params".
> - The "array" field points to an array of type "struct kvmppc_book3e_tlb_entry".
> - The array consists of all entries in the first TLB, followed by all
>   entries in the second TLB, etc.
> - Within a TLB, if the array is not set-associative, entries are ordered by
>   increasing ESEL.  If the array is set-associative, entries are ordered first
>   by set.  Within a set, entries are ordered by way (ESEL).
> - The hash for determining set number is:
>     (MAS2[EPN] / page_size) & (num_sets - 1)
>   where "page_size" is the smallest page size supported by the TLB, and
>   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>   If a book3e chip is made for which a different hash is needed, a new
>   MMU type must be used, to ensure that userspace and KVM agree on layout.

Please state the size explicitly then. It's 1k, right?

> 
> KVM_DIRTY_TLB
> -------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_dirty_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_dirty_tlb {
> 	__u64 bitmap;
> 	__u32 num_dirty;
> };
> 
> This must be called whenever userspace has changed an entry in the shared
> TLB, prior to calling KVM_RUN on the associated vcpu.
> 
> The "bitmap" field is the userspace address of an array.  This array consists
> of a number of bits, equal to the total number of TLB entries as determined by the
> last successful call to KVM_CONFIG_TLB, rounded up to the nearest multiple of 64.
> 
> Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array.
> 
> The array is little-endian: the bit 0 is the least significant bit of the
> first byte, bit 8 is the least significant bit of the second byte, etc.
> This avoids any complications with differing word sizes.
> 
> The "num_dirty" field is a performance hint for KVM to determine whether it
> should skip processing the bitmap and just invalidate everything.  It must
> be set to the number of set bits in the bitmap.

Sounds very nice :). I'm not sure if we need extensibility for the dirty mechanism. It doesn't really hurt to add it, but why do so when it's not required? Hrm. I'll leave that up to you :).


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-13 22:43                                   ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-13 22:43 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list, kvm-ppc,
	qemu-devel@nongnu.org   List


On 12.02.2011, at 01:57, Scott Wood wrote:

> On Fri, 11 Feb 2011 22:07:11 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 11.02.2011, at 21:53, Scott Wood wrote:
>> 
>>> On Fri, 11 Feb 2011 02:41:35 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
>>> 
>>> Hmm, the idea is growing on me.
>>> 
>>>> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
>>>> 
>>>> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>>>> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
>>>> 
>>>> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
>>> 
>>> OK.  We'll also need a config ioctl to specify MMU type/size and the address
>>> of the arrays.
>> 
>> Right, a setup call basically :).
> 
> OK, here goes v3:
> 
> [Note: one final thought after writing this up -- this isn't going to work
> too well in cases where the guest can directly manipulate its TLB, such as
> with the LRAT feature of Power Arch 2.06.  We'll still need a
> copy-in/copy-out mechanism for that.]

In that case qemu sets the mmu mode respectively and is aware of the fact that it needs get/set methods. We'll get to that when we have LRAT implemented :).

> struct kvmppc_book3e_tlb_entry {
> 	union {
> 		__u64 mas8_1;
> 		struct {
> 			__u32 mas8;
> 			__u32 mas1;
> 		};
> 	};
> 	__u64 mas2;
> 	union {
> 		__u64 mas7_3	
> 		struct {
> 			__u32 mas7;
> 			__u32 mas3;
> 		};
> 	};
> };

Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). Is there any obvious reason we need to have unions in the first place? The compiler should be clever enough to pick the right size accessors when writing/reading masked  __u64 entries, no? The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right?

> 
> For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
> present but not supported.
> 
> struct kvmppc_book3e_tlb_params {
> 	/*
> 	 * book3e defines 4 TLBs.  Individual implementations may have
> 	 * fewer.  TLBs that do not already exist on the target must be
> 	 * configured with a size of zero.  A tlb_ways value of zero means
> 	 * the array is fully associative.  Only TLBs that are already
> 	 * set-associative on the target may be configured with a different
> 	 * associativity.  A set-associative TLB may not exceed 255 ways.
> 	 *
> 	 * KVM will adjust TLBnCFG based on the sizes configured here,
> 	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> 	 * set to zero.
> 	 *
> 	 * The size of any TLB that is set-associative must be a multiple of
> 	 * the number of ways, and the number of sets must be a power of two.
> 	 *
> 	 * The page sizes supported by a TLB shall be determined by reading
> 	 * the TLB configuration registers.  This is not adjustable by userspace.
> 	 * [Note: need sregs]
> 	 */
> 	__u32 tlb_sizes[4];
> 	__u8 tlb_ways[4];
> 	__u32 reserved[11];
> };
> 
> KVM_CONFIG_TLB
> --------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_config_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_config_tlb {
> 	__u64 params;
> 	__u64 array;
> 	__u32 mmu_type;
> 	__u32 array_len;

Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?

> };
> 
> Configures the virtual CPU's TLB array, establishing a shared memory area
> between userspace and KVM.  The "params" and "array" fields are userspace
> addresses of mmu-type-specific data structures.  The "array_len" field is an
> safety mechanism, and should be set to the size in bytes of the memory that
> userspace has reserved for the array.  It must be at least the size dictated by
> "mmu_type" and "params".
> 
> On return from this call, KVM assumes the state of the TLBs to be empty. 
> Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
> any valid entries.
> 
> While KVM_RUN is active, the shared region is under control of KVM.  Its
> contents are undefined, and any modification by userspace results in boundedly
> undefined behavior.
> 
> On return from KVM_RUN, the shared region will reflect the current state of
> the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> to tell KVM which entries have been changed, prior to calling KVM_RUN again
> on this vcpu.
> 
> For mmu types KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV:
> - The "params" field is of type "struct kvmppc_book3e_tlb_params".
> - The "array" field points to an array of type "struct kvmppc_book3e_tlb_entry".
> - The array consists of all entries in the first TLB, followed by all
>   entries in the second TLB, etc.
> - Within a TLB, if the array is not set-associative, entries are ordered by
>   increasing ESEL.  If the array is set-associative, entries are ordered first
>   by set.  Within a set, entries are ordered by way (ESEL).
> - The hash for determining set number is:
>     (MAS2[EPN] / page_size) & (num_sets - 1)
>   where "page_size" is the smallest page size supported by the TLB, and
>   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>   If a book3e chip is made for which a different hash is needed, a new
>   MMU type must be used, to ensure that userspace and KVM agree on layout.

Please state the size explicitly then. It's 1k, right?

> 
> KVM_DIRTY_TLB
> -------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_dirty_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_dirty_tlb {
> 	__u64 bitmap;
> 	__u32 num_dirty;
> };
> 
> This must be called whenever userspace has changed an entry in the shared
> TLB, prior to calling KVM_RUN on the associated vcpu.
> 
> The "bitmap" field is the userspace address of an array.  This array consists
> of a number of bits, equal to the total number of TLB entries as determined by the
> last successful call to KVM_CONFIG_TLB, rounded up to the nearest multiple of 64.
> 
> Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array.
> 
> The array is little-endian: the bit 0 is the least significant bit of the
> first byte, bit 8 is the least significant bit of the second byte, etc.
> This avoids any complications with differing word sizes.
> 
> The "num_dirty" field is a performance hint for KVM to determine whether it
> should skip processing the bitmap and just invalidate everything.  It must
> be set to the number of set bits in the bitmap.

Sounds very nice :). I'm not sure if we need extensibility for the dirty mechanism. It doesn't really hurt to add it, but why do so when it's not required? Hrm. I'll leave that up to you :).


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-13 22:43                                   ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-13 22:43 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List


On 12.02.2011, at 01:57, Scott Wood wrote:

> On Fri, 11 Feb 2011 22:07:11 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 11.02.2011, at 21:53, Scott Wood wrote:
>> 
>>> On Fri, 11 Feb 2011 02:41:35 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
>>> 
>>> Hmm, the idea is growing on me.
>>> 
>>>> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
>>>> 
>>>> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>>>> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
>>>> 
>>>> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
>>> 
>>> OK.  We'll also need a config ioctl to specify MMU type/size and the address
>>> of the arrays.
>> 
>> Right, a setup call basically :).
> 
> OK, here goes v3:
> 
> [Note: one final thought after writing this up -- this isn't going to work
> too well in cases where the guest can directly manipulate its TLB, such as
> with the LRAT feature of Power Arch 2.06.  We'll still need a
> copy-in/copy-out mechanism for that.]

In that case qemu sets the mmu mode respectively and is aware of the fact that it needs get/set methods. We'll get to that when we have LRAT implemented :).

> struct kvmppc_book3e_tlb_entry {
> 	union {
> 		__u64 mas8_1;
> 		struct {
> 			__u32 mas8;
> 			__u32 mas1;
> 		};
> 	};
> 	__u64 mas2;
> 	union {
> 		__u64 mas7_3	
> 		struct {
> 			__u32 mas7;
> 			__u32 mas3;
> 		};
> 	};
> };

Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). Is there any obvious reason we need to have unions in the first place? The compiler should be clever enough to pick the right size accessors when writing/reading masked  __u64 entries, no? The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right?

> 
> For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
> present but not supported.
> 
> struct kvmppc_book3e_tlb_params {
> 	/*
> 	 * book3e defines 4 TLBs.  Individual implementations may have
> 	 * fewer.  TLBs that do not already exist on the target must be
> 	 * configured with a size of zero.  A tlb_ways value of zero means
> 	 * the array is fully associative.  Only TLBs that are already
> 	 * set-associative on the target may be configured with a different
> 	 * associativity.  A set-associative TLB may not exceed 255 ways.
> 	 *
> 	 * KVM will adjust TLBnCFG based on the sizes configured here,
> 	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> 	 * set to zero.
> 	 *
> 	 * The size of any TLB that is set-associative must be a multiple of
> 	 * the number of ways, and the number of sets must be a power of two.
> 	 *
> 	 * The page sizes supported by a TLB shall be determined by reading
> 	 * the TLB configuration registers.  This is not adjustable by userspace.
> 	 * [Note: need sregs]
> 	 */
> 	__u32 tlb_sizes[4];
> 	__u8 tlb_ways[4];
> 	__u32 reserved[11];
> };
> 
> KVM_CONFIG_TLB
> --------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_config_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_config_tlb {
> 	__u64 params;
> 	__u64 array;
> 	__u32 mmu_type;
> 	__u32 array_len;

Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?

> };
> 
> Configures the virtual CPU's TLB array, establishing a shared memory area
> between userspace and KVM.  The "params" and "array" fields are userspace
> addresses of mmu-type-specific data structures.  The "array_len" field is an
> safety mechanism, and should be set to the size in bytes of the memory that
> userspace has reserved for the array.  It must be at least the size dictated by
> "mmu_type" and "params".
> 
> On return from this call, KVM assumes the state of the TLBs to be empty. 
> Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
> any valid entries.
> 
> While KVM_RUN is active, the shared region is under control of KVM.  Its
> contents are undefined, and any modification by userspace results in boundedly
> undefined behavior.
> 
> On return from KVM_RUN, the shared region will reflect the current state of
> the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> to tell KVM which entries have been changed, prior to calling KVM_RUN again
> on this vcpu.
> 
> For mmu types KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV:
> - The "params" field is of type "struct kvmppc_book3e_tlb_params".
> - The "array" field points to an array of type "struct kvmppc_book3e_tlb_entry".
> - The array consists of all entries in the first TLB, followed by all
>   entries in the second TLB, etc.
> - Within a TLB, if the array is not set-associative, entries are ordered by
>   increasing ESEL.  If the array is set-associative, entries are ordered first
>   by set.  Within a set, entries are ordered by way (ESEL).
> - The hash for determining set number is:
>     (MAS2[EPN] / page_size) & (num_sets - 1)
>   where "page_size" is the smallest page size supported by the TLB, and
>   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>   If a book3e chip is made for which a different hash is needed, a new
>   MMU type must be used, to ensure that userspace and KVM agree on layout.

Please state the size explicitly then. It's 1k, right?

> 
> KVM_DIRTY_TLB
> -------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_dirty_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_dirty_tlb {
> 	__u64 bitmap;
> 	__u32 num_dirty;
> };
> 
> This must be called whenever userspace has changed an entry in the shared
> TLB, prior to calling KVM_RUN on the associated vcpu.
> 
> The "bitmap" field is the userspace address of an array.  This array consists
> of a number of bits, equal to the total number of TLB entries as determined by the
> last successful call to KVM_CONFIG_TLB, rounded up to the nearest multiple of 64.
> 
> Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array.
> 
> The array is little-endian: the bit 0 is the least significant bit of the
> first byte, bit 8 is the least significant bit of the second byte, etc.
> This avoids any complications with differing word sizes.
> 
> The "num_dirty" field is a performance hint for KVM to determine whether it
> should skip processing the bitmap and just invalidate everything.  It must
> be set to the number of set bits in the bitmap.

Sounds very nice :). I'm not sure if we need extensibility for the dirty mechanism. It doesn't really hurt to add it, but why do so when it's not required? Hrm. I'll leave that up to you :).


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-13 22:43                                   ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-14 17:11                                     ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 17:11 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List

On Sun, 13 Feb 2011 23:43:40 +0100
Alexander Graf <agraf@suse.de> wrote:

> > struct kvmppc_book3e_tlb_entry {
> > 	union {
> > 		__u64 mas8_1;
> > 		struct {
> > 			__u32 mas8;
> > 			__u32 mas1;
> > 		};
> > 	};
> > 	__u64 mas2;
> > 	union {
> > 		__u64 mas7_3	
> > 		struct {
> > 			__u32 mas7;
> > 			__u32 mas3;
> > 		};
> > 	};
> > };
> 
> Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). 

:-(

> Is there any obvious reason we need to have unions in the first place? The
> compiler should be clever enough to pick the right size accessors when
> writing/reading masked  __u64 entries, no?

Yes, the intent was just to make it easier to access individual mas
registers, and reuse existing bit declarations that are defined relative
to individual registers.

Why clutter up the source code when the compiler can deal with it?  Same
applies to the anonymous unions/structs -- it's done that way to present
the fields in the most straightforward manner (equivalent to how the SPRs
themselves are named and aliased).

If it's a firm NACK on the existing structure, I think I'd rather just drop
the paired versions altogether (but still order them this way so it can be
changed back if there's any desire to in the future, without breaking
compatibility).

> The struct name should also have
> a version indicator - it's the data descriptor only a single specific
> mmu_type, right?

It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

> > For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
> > present but not supported.
> > 
> > struct kvmppc_book3e_tlb_params {
> > 	/*
> > 	 * book3e defines 4 TLBs.  Individual implementations may have
> > 	 * fewer.  TLBs that do not already exist on the target must be
> > 	 * configured with a size of zero.  A tlb_ways value of zero means
> > 	 * the array is fully associative.  Only TLBs that are already
> > 	 * set-associative on the target may be configured with a different
> > 	 * associativity.  A set-associative TLB may not exceed 255 ways.
> > 	 *
> > 	 * KVM will adjust TLBnCFG based on the sizes configured here,
> > 	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> > 	 * set to zero.
> > 	 *
> > 	 * The size of any TLB that is set-associative must be a multiple of
> > 	 * the number of ways, and the number of sets must be a power of two.
> > 	 *
> > 	 * The page sizes supported by a TLB shall be determined by reading
> > 	 * the TLB configuration registers.  This is not adjustable by userspace.
> > 	 * [Note: need sregs]
> > 	 */
> > 	__u32 tlb_sizes[4];
> > 	__u8 tlb_ways[4];
> > 	__u32 reserved[11];
> > };
> > 
> > KVM_CONFIG_TLB
> > --------------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: struct kvm_config_tlb (in)
> > Returns: 0 on success
> >         -1 on error
> > 
> > struct kvm_config_tlb {
> > 	__u64 params;
> > 	__u64 array;
> > 	__u32 mmu_type;
> > 	__u32 array_len;
> 
> Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?

params itself is a versioned struct, with reserved bits.  Do we really need
it here as well?

> > - The hash for determining set number is:
> >     (MAS2[EPN] / page_size) & (num_sets - 1)
> >   where "page_size" is the smallest page size supported by the TLB, and
> >   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> >   If a book3e chip is made for which a different hash is needed, a new
> >   MMU type must be used, to ensure that userspace and KVM agree on layout.
> 
> Please state the size explicitly then. It's 1k, right?

It's 4K on Freescale chips.  We should probably implement sregs first, in
which case qemu can read the MMU config registers to find out the minimum
supported page size.

If we specify 4K here, we should probably just go ahead and stick FSL in
the MMU type name.  Specifying the hash itself already makes me nervous
about claiming the more generic name.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 17:11                                     ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 17:11 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list, kvm-ppc,
	qemu-devel@nongnu.org List

On Sun, 13 Feb 2011 23:43:40 +0100
Alexander Graf <agraf@suse.de> wrote:

> > struct kvmppc_book3e_tlb_entry {
> > 	union {
> > 		__u64 mas8_1;
> > 		struct {
> > 			__u32 mas8;
> > 			__u32 mas1;
> > 		};
> > 	};
> > 	__u64 mas2;
> > 	union {
> > 		__u64 mas7_3	
> > 		struct {
> > 			__u32 mas7;
> > 			__u32 mas3;
> > 		};
> > 	};
> > };
> 
> Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). 

:-(

> Is there any obvious reason we need to have unions in the first place? The
> compiler should be clever enough to pick the right size accessors when
> writing/reading masked  __u64 entries, no?

Yes, the intent was just to make it easier to access individual mas
registers, and reuse existing bit declarations that are defined relative
to individual registers.

Why clutter up the source code when the compiler can deal with it?  Same
applies to the anonymous unions/structs -- it's done that way to present
the fields in the most straightforward manner (equivalent to how the SPRs
themselves are named and aliased).

If it's a firm NACK on the existing structure, I think I'd rather just drop
the paired versions altogether (but still order them this way so it can be
changed back if there's any desire to in the future, without breaking
compatibility).

> The struct name should also have
> a version indicator - it's the data descriptor only a single specific
> mmu_type, right?

It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

> > For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
> > present but not supported.
> > 
> > struct kvmppc_book3e_tlb_params {
> > 	/*
> > 	 * book3e defines 4 TLBs.  Individual implementations may have
> > 	 * fewer.  TLBs that do not already exist on the target must be
> > 	 * configured with a size of zero.  A tlb_ways value of zero means
> > 	 * the array is fully associative.  Only TLBs that are already
> > 	 * set-associative on the target may be configured with a different
> > 	 * associativity.  A set-associative TLB may not exceed 255 ways.
> > 	 *
> > 	 * KVM will adjust TLBnCFG based on the sizes configured here,
> > 	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> > 	 * set to zero.
> > 	 *
> > 	 * The size of any TLB that is set-associative must be a multiple of
> > 	 * the number of ways, and the number of sets must be a power of two.
> > 	 *
> > 	 * The page sizes supported by a TLB shall be determined by reading
> > 	 * the TLB configuration registers.  This is not adjustable by userspace.
> > 	 * [Note: need sregs]
> > 	 */
> > 	__u32 tlb_sizes[4];
> > 	__u8 tlb_ways[4];
> > 	__u32 reserved[11];
> > };
> > 
> > KVM_CONFIG_TLB
> > --------------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: struct kvm_config_tlb (in)
> > Returns: 0 on success
> >         -1 on error
> > 
> > struct kvm_config_tlb {
> > 	__u64 params;
> > 	__u64 array;
> > 	__u32 mmu_type;
> > 	__u32 array_len;
> 
> Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?

params itself is a versioned struct, with reserved bits.  Do we really need
it here as well?

> > - The hash for determining set number is:
> >     (MAS2[EPN] / page_size) & (num_sets - 1)
> >   where "page_size" is the smallest page size supported by the TLB, and
> >   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> >   If a book3e chip is made for which a different hash is needed, a new
> >   MMU type must be used, to ensure that userspace and KVM agree on layout.
> 
> Please state the size explicitly then. It's 1k, right?

It's 4K on Freescale chips.  We should probably implement sregs first, in
which case qemu can read the MMU config registers to find out the minimum
supported page size.

If we specify 4K here, we should probably just go ahead and stick FSL in
the MMU type name.  Specifying the hash itself already makes me nervous
about claiming the more generic name.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 17:11                                     ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 17:11 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm-ppc, kvm@vger.kernel.org list,
	qemu-devel@nongnu.org List

On Sun, 13 Feb 2011 23:43:40 +0100
Alexander Graf <agraf@suse.de> wrote:

> > struct kvmppc_book3e_tlb_entry {
> > 	union {
> > 		__u64 mas8_1;
> > 		struct {
> > 			__u32 mas8;
> > 			__u32 mas1;
> > 		};
> > 	};
> > 	__u64 mas2;
> > 	union {
> > 		__u64 mas7_3	
> > 		struct {
> > 			__u32 mas7;
> > 			__u32 mas3;
> > 		};
> > 	};
> > };
> 
> Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). 

:-(

> Is there any obvious reason we need to have unions in the first place? The
> compiler should be clever enough to pick the right size accessors when
> writing/reading masked  __u64 entries, no?

Yes, the intent was just to make it easier to access individual mas
registers, and reuse existing bit declarations that are defined relative
to individual registers.

Why clutter up the source code when the compiler can deal with it?  Same
applies to the anonymous unions/structs -- it's done that way to present
the fields in the most straightforward manner (equivalent to how the SPRs
themselves are named and aliased).

If it's a firm NACK on the existing structure, I think I'd rather just drop
the paired versions altogether (but still order them this way so it can be
changed back if there's any desire to in the future, without breaking
compatibility).

> The struct name should also have
> a version indicator - it's the data descriptor only a single specific
> mmu_type, right?

It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

> > For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
> > present but not supported.
> > 
> > struct kvmppc_book3e_tlb_params {
> > 	/*
> > 	 * book3e defines 4 TLBs.  Individual implementations may have
> > 	 * fewer.  TLBs that do not already exist on the target must be
> > 	 * configured with a size of zero.  A tlb_ways value of zero means
> > 	 * the array is fully associative.  Only TLBs that are already
> > 	 * set-associative on the target may be configured with a different
> > 	 * associativity.  A set-associative TLB may not exceed 255 ways.
> > 	 *
> > 	 * KVM will adjust TLBnCFG based on the sizes configured here,
> > 	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> > 	 * set to zero.
> > 	 *
> > 	 * The size of any TLB that is set-associative must be a multiple of
> > 	 * the number of ways, and the number of sets must be a power of two.
> > 	 *
> > 	 * The page sizes supported by a TLB shall be determined by reading
> > 	 * the TLB configuration registers.  This is not adjustable by userspace.
> > 	 * [Note: need sregs]
> > 	 */
> > 	__u32 tlb_sizes[4];
> > 	__u8 tlb_ways[4];
> > 	__u32 reserved[11];
> > };
> > 
> > KVM_CONFIG_TLB
> > --------------
> > 
> > Capability: KVM_CAP_SW_TLB
> > Type: vcpu ioctl
> > Parameters: struct kvm_config_tlb (in)
> > Returns: 0 on success
> >         -1 on error
> > 
> > struct kvm_config_tlb {
> > 	__u64 params;
> > 	__u64 array;
> > 	__u32 mmu_type;
> > 	__u32 array_len;
> 
> Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?

params itself is a versioned struct, with reserved bits.  Do we really need
it here as well?

> > - The hash for determining set number is:
> >     (MAS2[EPN] / page_size) & (num_sets - 1)
> >   where "page_size" is the smallest page size supported by the TLB, and
> >   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> >   If a book3e chip is made for which a different hash is needed, a new
> >   MMU type must be used, to ensure that userspace and KVM agree on layout.
> 
> Please state the size explicitly then. It's 1k, right?

It's 4K on Freescale chips.  We should probably implement sregs first, in
which case qemu can read the MMU config registers to find out the minimum
supported page size.

If we specify 4K here, we should probably just go ahead and stick FSL in
the MMU type name.  Specifying the hash itself already makes me nervous
about claiming the more generic name.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-14 17:11                                     ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-14 20:19                                       ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-14 20:19 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List


On 14.02.2011, at 18:11, Scott Wood <scottwood@freescale.com> wrote:

> On Sun, 13 Feb 2011 23:43:40 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>>> struct kvmppc_book3e_tlb_entry {
>>>    union {
>>>        __u64 mas8_1;
>>>        struct {
>>>            __u32 mas8;
>>>            __u32 mas1;
>>>        };
>>>    };
>>>    __u64 mas2;
>>>    union {
>>>        __u64 mas7_3    
>>>        struct {
>>>            __u32 mas7;
>>>            __u32 mas3;
>>>        };
>>>    };
>>> };
>> 
>> Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). 
> 
> :-(
> 
>> Is there any obvious reason we need to have unions in the first place? The
>> compiler should be clever enough to pick the right size accessors when
>> writing/reading masked  __u64 entries, no?
> 
> Yes, the intent was just to make it easier to access individual mas
> registers, and reuse existing bit declarations that are defined relative
> to individual registers.
> 
> Why clutter up the source code when the compiler can deal with it?  Same
> applies to the anonymous unions/structs -- it's done that way to present
> the fields in the most straightforward manner (equivalent to how the SPRs
> themselves are named and aliased).
> 
> If it's a firm NACK on the existing structure, I think I'd rather just drop
> the paired versions altogether (but still order them this way so it can be
> changed back if there's any desire to in the future, without breaking
> compatibility).

There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.

The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.

> 
>> The struct name should also have
>> a version indicator - it's the data descriptor only a single specific
>> mmu_type, right?
> 
> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

Even fictional future changes to the tlb layout? Either way, we can name that differently when it comes.

> 
>>> For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
>>> present but not supported.
>>> 
>>> struct kvmppc_book3e_tlb_params {
>>>    /*
>>>     * book3e defines 4 TLBs.  Individual implementations may have
>>>     * fewer.  TLBs that do not already exist on the target must be
>>>     * configured with a size of zero.  A tlb_ways value of zero means
>>>     * the array is fully associative.  Only TLBs that are already
>>>     * set-associative on the target may be configured with a different
>>>     * associativity.  A set-associative TLB may not exceed 255 ways.
>>>     *
>>>     * KVM will adjust TLBnCFG based on the sizes configured here,
>>>     * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
>>>     * set to zero.
>>>     *
>>>     * The size of any TLB that is set-associative must be a multiple of
>>>     * the number of ways, and the number of sets must be a power of two.
>>>     *
>>>     * The page sizes supported by a TLB shall be determined by reading
>>>     * the TLB configuration registers.  This is not adjustable by userspace.
>>>     * [Note: need sregs]
>>>     */
>>>    __u32 tlb_sizes[4];
>>>    __u8 tlb_ways[4];
>>>    __u32 reserved[11];
>>> };
>>> 
>>> KVM_CONFIG_TLB
>>> --------------
>>> 
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: struct kvm_config_tlb (in)
>>> Returns: 0 on success
>>>        -1 on error
>>> 
>>> struct kvm_config_tlb {
>>>    __u64 params;
>>>    __u64 array;
>>>    __u32 mmu_type;
>>>    __u32 array_len;
>> 
>> Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?
> 
> params itself is a versioned struct, with reserved bits.  Do we really need
> it here as well?

Right. Probably not :).

> 
>>> - The hash for determining set number is:
>>>    (MAS2[EPN] / page_size) & (num_sets - 1)
>>>  where "page_size" is the smallest page size supported by the TLB, and
>>>  "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>>>  If a book3e chip is made for which a different hash is needed, a new
>>>  MMU type must be used, to ensure that userspace and KVM agree on layout.
>> 
>> Please state the size explicitly then. It's 1k, right?
> 
> It's 4K on Freescale chips.  We should probably implement sregs first, in
> which case qemu can read the MMU config registers to find out the minimum
> supported page size.
> 
> If we specify 4K here, we should probably just go ahead and stick FSL in
> the MMU type name.  Specifying the hash itself already makes me nervous
> about claiming the more generic name.

Yup, sounds good :).


Alex

> 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 20:19                                       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-14 20:19 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list,
	<kvm-ppc@vger.kernel.org>,
	qemu-devel@nongnu.org List


On 14.02.2011, at 18:11, Scott Wood <scottwood@freescale.com> wrote:

> On Sun, 13 Feb 2011 23:43:40 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>>> struct kvmppc_book3e_tlb_entry {
>>>    union {
>>>        __u64 mas8_1;
>>>        struct {
>>>            __u32 mas8;
>>>            __u32 mas1;
>>>        };
>>>    };
>>>    __u64 mas2;
>>>    union {
>>>        __u64 mas7_3    
>>>        struct {
>>>            __u32 mas7;
>>>            __u32 mas3;
>>>        };
>>>    };
>>> };
>> 
>> Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). 
> 
> :-(
> 
>> Is there any obvious reason we need to have unions in the first place? The
>> compiler should be clever enough to pick the right size accessors when
>> writing/reading masked  __u64 entries, no?
> 
> Yes, the intent was just to make it easier to access individual mas
> registers, and reuse existing bit declarations that are defined relative
> to individual registers.
> 
> Why clutter up the source code when the compiler can deal with it?  Same
> applies to the anonymous unions/structs -- it's done that way to present
> the fields in the most straightforward manner (equivalent to how the SPRs
> themselves are named and aliased).
> 
> If it's a firm NACK on the existing structure, I think I'd rather just drop
> the paired versions altogether (but still order them this way so it can be
> changed back if there's any desire to in the future, without breaking
> compatibility).

There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.

The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.

> 
>> The struct name should also have
>> a version indicator - it's the data descriptor only a single specific
>> mmu_type, right?
> 
> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

Even fictional future changes to the tlb layout? Either way, we can name that differently when it comes.

> 
>>> For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
>>> present but not supported.
>>> 
>>> struct kvmppc_book3e_tlb_params {
>>>    /*
>>>     * book3e defines 4 TLBs.  Individual implementations may have
>>>     * fewer.  TLBs that do not already exist on the target must be
>>>     * configured with a size of zero.  A tlb_ways value of zero means
>>>     * the array is fully associative.  Only TLBs that are already
>>>     * set-associative on the target may be configured with a different
>>>     * associativity.  A set-associative TLB may not exceed 255 ways.
>>>     *
>>>     * KVM will adjust TLBnCFG based on the sizes configured here,
>>>     * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
>>>     * set to zero.
>>>     *
>>>     * The size of any TLB that is set-associative must be a multiple of
>>>     * the number of ways, and the number of sets must be a power of two.
>>>     *
>>>     * The page sizes supported by a TLB shall be determined by reading
>>>     * the TLB configuration registers.  This is not adjustable by userspace.
>>>     * [Note: need sregs]
>>>     */
>>>    __u32 tlb_sizes[4];
>>>    __u8 tlb_ways[4];
>>>    __u32 reserved[11];
>>> };
>>> 
>>> KVM_CONFIG_TLB
>>> --------------
>>> 
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: struct kvm_config_tlb (in)
>>> Returns: 0 on success
>>>        -1 on error
>>> 
>>> struct kvm_config_tlb {
>>>    __u64 params;
>>>    __u64 array;
>>>    __u32 mmu_type;
>>>    __u32 array_len;
>> 
>> Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?
> 
> params itself is a versioned struct, with reserved bits.  Do we really need
> it here as well?

Right. Probably not :).

> 
>>> - The hash for determining set number is:
>>>    (MAS2[EPN] / page_size) & (num_sets - 1)
>>>  where "page_size" is the smallest page size supported by the TLB, and
>>>  "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>>>  If a book3e chip is made for which a different hash is needed, a new
>>>  MMU type must be used, to ensure that userspace and KVM agree on layout.
>> 
>> Please state the size explicitly then. It's 1k, right?
> 
> It's 4K on Freescale chips.  We should probably implement sregs first, in
> which case qemu can read the MMU config registers to find out the minimum
> supported page size.
> 
> If we specify 4K here, we should probably just go ahead and stick FSL in
> the MMU type name.  Specifying the hash itself already makes me nervous
> about claiming the more generic name.

Yup, sounds good :).


Alex

> 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 20:19                                       ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-14 20:19 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List


On 14.02.2011, at 18:11, Scott Wood <scottwood@freescale.com> wrote:

> On Sun, 13 Feb 2011 23:43:40 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>>> struct kvmppc_book3e_tlb_entry {
>>>    union {
>>>        __u64 mas8_1;
>>>        struct {
>>>            __u32 mas8;
>>>            __u32 mas1;
>>>        };
>>>    };
>>>    __u64 mas2;
>>>    union {
>>>        __u64 mas7_3    
>>>        struct {
>>>            __u32 mas7;
>>>            __u32 mas3;
>>>        };
>>>    };
>>> };
>> 
>> Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). 
> 
> :-(
> 
>> Is there any obvious reason we need to have unions in the first place? The
>> compiler should be clever enough to pick the right size accessors when
>> writing/reading masked  __u64 entries, no?
> 
> Yes, the intent was just to make it easier to access individual mas
> registers, and reuse existing bit declarations that are defined relative
> to individual registers.
> 
> Why clutter up the source code when the compiler can deal with it?  Same
> applies to the anonymous unions/structs -- it's done that way to present
> the fields in the most straightforward manner (equivalent to how the SPRs
> themselves are named and aliased).
> 
> If it's a firm NACK on the existing structure, I think I'd rather just drop
> the paired versions altogether (but still order them this way so it can be
> changed back if there's any desire to in the future, without breaking
> compatibility).

There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.

The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.

> 
>> The struct name should also have
>> a version indicator - it's the data descriptor only a single specific
>> mmu_type, right?
> 
> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

Even fictional future changes to the tlb layout? Either way, we can name that differently when it comes.

> 
>>> For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
>>> present but not supported.
>>> 
>>> struct kvmppc_book3e_tlb_params {
>>>    /*
>>>     * book3e defines 4 TLBs.  Individual implementations may have
>>>     * fewer.  TLBs that do not already exist on the target must be
>>>     * configured with a size of zero.  A tlb_ways value of zero means
>>>     * the array is fully associative.  Only TLBs that are already
>>>     * set-associative on the target may be configured with a different
>>>     * associativity.  A set-associative TLB may not exceed 255 ways.
>>>     *
>>>     * KVM will adjust TLBnCFG based on the sizes configured here,
>>>     * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
>>>     * set to zero.
>>>     *
>>>     * The size of any TLB that is set-associative must be a multiple of
>>>     * the number of ways, and the number of sets must be a power of two.
>>>     *
>>>     * The page sizes supported by a TLB shall be determined by reading
>>>     * the TLB configuration registers.  This is not adjustable by userspace.
>>>     * [Note: need sregs]
>>>     */
>>>    __u32 tlb_sizes[4];
>>>    __u8 tlb_ways[4];
>>>    __u32 reserved[11];
>>> };
>>> 
>>> KVM_CONFIG_TLB
>>> --------------
>>> 
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: struct kvm_config_tlb (in)
>>> Returns: 0 on success
>>>        -1 on error
>>> 
>>> struct kvm_config_tlb {
>>>    __u64 params;
>>>    __u64 array;
>>>    __u32 mmu_type;
>>>    __u32 array_len;
>> 
>> Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?
> 
> params itself is a versioned struct, with reserved bits.  Do we really need
> it here as well?

Right. Probably not :).

> 
>>> - The hash for determining set number is:
>>>    (MAS2[EPN] / page_size) & (num_sets - 1)
>>>  where "page_size" is the smallest page size supported by the TLB, and
>>>  "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>>>  If a book3e chip is made for which a different hash is needed, a new
>>>  MMU type must be used, to ensure that userspace and KVM agree on layout.
>> 
>> Please state the size explicitly then. It's 1k, right?
> 
> It's 4K on Freescale chips.  We should probably implement sregs first, in
> which case qemu can read the MMU config registers to find out the minimum
> supported page size.
> 
> If we specify 4K here, we should probably just go ahead and stick FSL in
> the MMU type name.  Specifying the hash itself already makes me nervous
> about claiming the more generic name.

Yup, sounds good :).


Alex

> 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-14 20:19                                       ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-14 21:16                                         ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 21:16 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List

On Mon, 14 Feb 2011 21:19:19 +0100
Alexander Graf <agraf@suse.de> wrote:

> There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.

Ah, I thought it was an aesthetic objection -- didn't realize it was a
GNUism.  Oh well.

> The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.

After sending that, I was thinking that mas7_3 is more naturally used
as a pair, so I'd stick with the u64 there.

I think mas8_1 benefits less from the pairing, though -- it's only really
useful if you're going to put the value directly in hardware, which we
won't.

> >> The struct name should also have
> >> a version indicator - it's the data descriptor only a single specific
> >> mmu_type, right?
> > 
> > It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
> 
> Even fictional future changes to the tlb layout?

No, those need a new MMU type ID.

> >> Please state the size explicitly then. It's 1k, right?
> > 
> > It's 4K on Freescale chips.  We should probably implement sregs first, in
> > which case qemu can read the MMU config registers to find out the minimum
> > supported page size.
> > 
> > If we specify 4K here, we should probably just go ahead and stick FSL in
> > the MMU type name.  Specifying the hash itself already makes me nervous
> > about claiming the more generic name.
> 
> Yup, sounds good :).

Which one, "read the MMU config registers" or "specify 4K and stick FSL in
the name"?

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 21:16                                         ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 21:16 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list,
	<kvm-ppc@vger.kernel.org>,
	qemu-devel@nongnu.org List

On Mon, 14 Feb 2011 21:19:19 +0100
Alexander Graf <agraf@suse.de> wrote:

> There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.

Ah, I thought it was an aesthetic objection -- didn't realize it was a
GNUism.  Oh well.

> The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.

After sending that, I was thinking that mas7_3 is more naturally used
as a pair, so I'd stick with the u64 there.

I think mas8_1 benefits less from the pairing, though -- it's only really
useful if you're going to put the value directly in hardware, which we
won't.

> >> The struct name should also have
> >> a version indicator - it's the data descriptor only a single specific
> >> mmu_type, right?
> > 
> > It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
> 
> Even fictional future changes to the tlb layout?

No, those need a new MMU type ID.

> >> Please state the size explicitly then. It's 1k, right?
> > 
> > It's 4K on Freescale chips.  We should probably implement sregs first, in
> > which case qemu can read the MMU config registers to find out the minimum
> > supported page size.
> > 
> > If we specify 4K here, we should probably just go ahead and stick FSL in
> > the MMU type name.  Specifying the hash itself already makes me nervous
> > about claiming the more generic name.
> 
> Yup, sounds good :).

Which one, "read the MMU config registers" or "specify 4K and stick FSL in
the name"?

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 21:16                                         ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 21:16 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List

On Mon, 14 Feb 2011 21:19:19 +0100
Alexander Graf <agraf@suse.de> wrote:

> There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.

Ah, I thought it was an aesthetic objection -- didn't realize it was a
GNUism.  Oh well.

> The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.

After sending that, I was thinking that mas7_3 is more naturally used
as a pair, so I'd stick with the u64 there.

I think mas8_1 benefits less from the pairing, though -- it's only really
useful if you're going to put the value directly in hardware, which we
won't.

> >> The struct name should also have
> >> a version indicator - it's the data descriptor only a single specific
> >> mmu_type, right?
> > 
> > It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
> 
> Even fictional future changes to the tlb layout?

No, those need a new MMU type ID.

> >> Please state the size explicitly then. It's 1k, right?
> > 
> > It's 4K on Freescale chips.  We should probably implement sregs first, in
> > which case qemu can read the MMU config registers to find out the minimum
> > supported page size.
> > 
> > If we specify 4K here, we should probably just go ahead and stick FSL in
> > the MMU type name.  Specifying the hash itself already makes me nervous
> > about claiming the more generic name.
> 
> Yup, sounds good :).

Which one, "read the MMU config registers" or "specify 4K and stick FSL in
the name"?

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-14 21:16                                         ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-14 23:39                                           ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-14 23:39 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List


On 14.02.2011, at 22:16, Scott Wood wrote:

> On Mon, 14 Feb 2011 21:19:19 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.
> 
> Ah, I thought it was an aesthetic objection -- didn't realize it was a
> GNUism.  Oh well.

Maybe it was some other random extension, but it's certainly less compatible :).

> 
>> The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.
> 
> After sending that, I was thinking that mas7_3 is more naturally used
> as a pair, so I'd stick with the u64 there.
> 
> I think mas8_1 benefits less from the pairing, though -- it's only really
> useful if you're going to put the value directly in hardware, which we
> won't.

Sounds good. Make it 2 u32s then :).

> 
>>>> The struct name should also have
>>>> a version indicator - it's the data descriptor only a single specific
>>>> mmu_type, right?
>>> 
>>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
>> 
>> Even fictional future changes to the tlb layout?
> 
> No, those need a new MMU type ID.

... which means they need a new name, but then booke_tlb_entry is taken.

> 
>>>> Please state the size explicitly then. It's 1k, right?
>>> 
>>> It's 4K on Freescale chips.  We should probably implement sregs first, in
>>> which case qemu can read the MMU config registers to find out the minimum
>>> supported page size.
>>> 
>>> If we specify 4K here, we should probably just go ahead and stick FSL in
>>> the MMU type name.  Specifying the hash itself already makes me nervous
>>> about claiming the more generic name.
>> 
>> Yup, sounds good :).
> 
> Which one, "read the MMU config registers" or "specify 4K and stick FSL in
> the name"?

The "specify 4k and stick fsl in the name" part. If we simply always define it to 4k for all currently supported clients of the interface (e500) we should be good, no? No need for evaluations then.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 23:39                                           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-14 23:39 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list,
	<kvm-ppc@vger.kernel.org>,
	qemu-devel@nongnu.org List


On 14.02.2011, at 22:16, Scott Wood wrote:

> On Mon, 14 Feb 2011 21:19:19 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.
> 
> Ah, I thought it was an aesthetic objection -- didn't realize it was a
> GNUism.  Oh well.

Maybe it was some other random extension, but it's certainly less compatible :).

> 
>> The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.
> 
> After sending that, I was thinking that mas7_3 is more naturally used
> as a pair, so I'd stick with the u64 there.
> 
> I think mas8_1 benefits less from the pairing, though -- it's only really
> useful if you're going to put the value directly in hardware, which we
> won't.

Sounds good. Make it 2 u32s then :).

> 
>>>> The struct name should also have
>>>> a version indicator - it's the data descriptor only a single specific
>>>> mmu_type, right?
>>> 
>>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
>> 
>> Even fictional future changes to the tlb layout?
> 
> No, those need a new MMU type ID.

... which means they need a new name, but then booke_tlb_entry is taken.

> 
>>>> Please state the size explicitly then. It's 1k, right?
>>> 
>>> It's 4K on Freescale chips.  We should probably implement sregs first, in
>>> which case qemu can read the MMU config registers to find out the minimum
>>> supported page size.
>>> 
>>> If we specify 4K here, we should probably just go ahead and stick FSL in
>>> the MMU type name.  Specifying the hash itself already makes me nervous
>>> about claiming the more generic name.
>> 
>> Yup, sounds good :).
> 
> Which one, "read the MMU config registers" or "specify 4K and stick FSL in
> the name"?

The "specify 4k and stick fsl in the name" part. If we simply always define it to 4k for all currently supported clients of the interface (e500) we should be good, no? No need for evaluations then.


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 23:39                                           ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-14 23:39 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List


On 14.02.2011, at 22:16, Scott Wood wrote:

> On Mon, 14 Feb 2011 21:19:19 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well.
> 
> Ah, I thought it was an aesthetic objection -- didn't realize it was a
> GNUism.  Oh well.

Maybe it was some other random extension, but it's certainly less compatible :).

> 
>> The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry->u81.split.mas8 or entry->mas8_1 & MAS8_1_MAS8_MASK.
> 
> After sending that, I was thinking that mas7_3 is more naturally used
> as a pair, so I'd stick with the u64 there.
> 
> I think mas8_1 benefits less from the pairing, though -- it's only really
> useful if you're going to put the value directly in hardware, which we
> won't.

Sounds good. Make it 2 u32s then :).

> 
>>>> The struct name should also have
>>>> a version indicator - it's the data descriptor only a single specific
>>>> mmu_type, right?
>>> 
>>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
>> 
>> Even fictional future changes to the tlb layout?
> 
> No, those need a new MMU type ID.

... which means they need a new name, but then booke_tlb_entry is taken.

> 
>>>> Please state the size explicitly then. It's 1k, right?
>>> 
>>> It's 4K on Freescale chips.  We should probably implement sregs first, in
>>> which case qemu can read the MMU config registers to find out the minimum
>>> supported page size.
>>> 
>>> If we specify 4K here, we should probably just go ahead and stick FSL in
>>> the MMU type name.  Specifying the hash itself already makes me nervous
>>> about claiming the more generic name.
>> 
>> Yup, sounds good :).
> 
> Which one, "read the MMU config registers" or "specify 4K and stick FSL in
> the name"?

The "specify 4k and stick fsl in the name" part. If we simply always define it to 4k for all currently supported clients of the interface (e500) we should be good, no? No need for evaluations then.


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-14 23:39                                           ` [Qemu-devel] " Alexander Graf
  (?)
@ 2011-02-14 23:49                                             ` Scott Wood
  -1 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 23:49 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List

On Tue, 15 Feb 2011 00:39:51 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 14.02.2011, at 22:16, Scott Wood wrote:
> 
> > On Mon, 14 Feb 2011 21:19:19 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >>>> The struct name should also have
> >>>> a version indicator - it's the data descriptor only a single specific
> >>>> mmu_type, right?
> >>> 
> >>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
> >> 
> >> Even fictional future changes to the tlb layout?
> > 
> > No, those need a new MMU type ID.
> 
> ... which means they need a new name, but then booke_tlb_entry is taken.

The MMU ID name and struct name are about equally generic.  I'll add FSL to
both.  If there's a v2 down the road, then that goes in both.

> > Which one, "read the MMU config registers" or "specify 4K and stick FSL in
> > the name"?
> 
> The "specify 4k and stick fsl in the name" part.

OK.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 23:49                                             ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 23:49 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list,
	<kvm-ppc@vger.kernel.org>,
	qemu-devel@nongnu.org List

On Tue, 15 Feb 2011 00:39:51 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 14.02.2011, at 22:16, Scott Wood wrote:
> 
> > On Mon, 14 Feb 2011 21:19:19 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >>>> The struct name should also have
> >>>> a version indicator - it's the data descriptor only a single specific
> >>>> mmu_type, right?
> >>> 
> >>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
> >> 
> >> Even fictional future changes to the tlb layout?
> > 
> > No, those need a new MMU type ID.
> 
> ... which means they need a new name, but then booke_tlb_entry is taken.

The MMU ID name and struct name are about equally generic.  I'll add FSL to
both.  If there's a v2 down the road, then that goes in both.

> > Which one, "read the MMU config registers" or "specify 4K and stick FSL in
> > the name"?
> 
> The "specify 4k and stick fsl in the name" part.

OK.

-Scott

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-14 23:49                                             ` Scott Wood
  0 siblings, 0 replies; 112+ messages in thread
From: Scott Wood @ 2011-02-14 23:49 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List

On Tue, 15 Feb 2011 00:39:51 +0100
Alexander Graf <agraf@suse.de> wrote:

> On 14.02.2011, at 22:16, Scott Wood wrote:
> 
> > On Mon, 14 Feb 2011 21:19:19 +0100
> > Alexander Graf <agraf@suse.de> wrote:
> >>>> The struct name should also have
> >>>> a version indicator - it's the data descriptor only a single specific
> >>>> mmu_type, right?
> >>> 
> >>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
> >> 
> >> Even fictional future changes to the tlb layout?
> > 
> > No, those need a new MMU type ID.
> 
> ... which means they need a new name, but then booke_tlb_entry is taken.

The MMU ID name and struct name are about equally generic.  I'll add FSL to
both.  If there's a v2 down the road, then that goes in both.

> > Which one, "read the MMU config registers" or "specify 4K and stick FSL in
> > the name"?
> 
> The "specify 4k and stick fsl in the name" part.

OK.

-Scott


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
  2011-02-14 23:49                                             ` [Qemu-devel] " Scott Wood
  (?)
@ 2011-02-15  0:00                                               ` Alexander Graf
  -1 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-15  0:00 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List


On 15.02.2011, at 00:49, Scott Wood wrote:

> On Tue, 15 Feb 2011 00:39:51 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 14.02.2011, at 22:16, Scott Wood wrote:
>> 
>>> On Mon, 14 Feb 2011 21:19:19 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>>>>> The struct name should also have
>>>>>> a version indicator - it's the data descriptor only a single specific
>>>>>> mmu_type, right?
>>>>> 
>>>>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
>>>> 
>>>> Even fictional future changes to the tlb layout?
>>> 
>>> No, those need a new MMU type ID.
>> 
>> ... which means they need a new name, but then booke_tlb_entry is taken.
> 
> The MMU ID name and struct name are about equally generic.  I'll add FSL to
> both.  If there's a v2 down the road, then that goes in both.

Very good - thank you!


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-15  0:00                                               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-15  0:00 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, kvm@vger.kernel.org list,
	<kvm-ppc@vger.kernel.org>,
	qemu-devel@nongnu.org List


On 15.02.2011, at 00:49, Scott Wood wrote:

> On Tue, 15 Feb 2011 00:39:51 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 14.02.2011, at 22:16, Scott Wood wrote:
>> 
>>> On Mon, 14 Feb 2011 21:19:19 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>>>>> The struct name should also have
>>>>>> a version indicator - it's the data descriptor only a single specific
>>>>>> mmu_type, right?
>>>>> 
>>>>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
>>>> 
>>>> Even fictional future changes to the tlb layout?
>>> 
>>> No, those need a new MMU type ID.
>> 
>> ... which means they need a new name, but then booke_tlb_entry is taken.
> 
> The MMU ID name and struct name are about equally generic.  I'll add FSL to
> both.  If there's a v2 down the road, then that goes in both.

Very good - thank you!


Alex

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: RFC: New API for PPC for vcpu mmu access
@ 2011-02-15  0:00                                               ` Alexander Graf
  0 siblings, 0 replies; 112+ messages in thread
From: Alexander Graf @ 2011-02-15  0:00 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, <kvm-ppc@vger.kernel.org>,
	kvm@vger.kernel.org list, qemu-devel@nongnu.org List


On 15.02.2011, at 00:49, Scott Wood wrote:

> On Tue, 15 Feb 2011 00:39:51 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 14.02.2011, at 22:16, Scott Wood wrote:
>> 
>>> On Mon, 14 Feb 2011 21:19:19 +0100
>>> Alexander Graf <agraf@suse.de> wrote:
>>>>>> The struct name should also have
>>>>>> a version indicator - it's the data descriptor only a single specific
>>>>>> mmu_type, right?
>>>>> 
>>>>> It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
>>>> 
>>>> Even fictional future changes to the tlb layout?
>>> 
>>> No, those need a new MMU type ID.
>> 
>> ... which means they need a new name, but then booke_tlb_entry is taken.
> 
> The MMU ID name and struct name are about equally generic.  I'll add FSL to
> both.  If there's a v2 down the road, then that goes in both.

Very good - thank you!


Alex


^ permalink raw reply	[flat|nested] 112+ messages in thread

* RFC: New API for PPC for vcpu mmu access
@ 2011-02-02 20:30 Yoder Stuart-B08248
  0 siblings, 0 replies; 112+ messages in thread
From: Yoder Stuart-B08248 @ 2011-02-02 20:30 UTC (permalink / raw)
  To: kvm-ppc, kvm, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6996 bytes --]

Below is a proposal for a new API for PPC to allow KVM clients

to set MMU state in a vcpu.



BookE processors have one or more software managed TLBs and

currently there is no mechanism for Qemu to initialize

or access them.  This is needed for normal initialization

as well as debug.



There are 4 APIs:



-KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type

of MMU with KVM-- the type determines the size and format

of the data in the other APIs



-KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all

TLBs in the vcpu



-KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture

specifies the format of the MMU data passed in



-KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,

or iterating over an entire TLB.  Some TLBs may have an unspecified

geometry and thus the need to be able to iterate in order

to dump the TLB.  The Power architecture specifies the format

of the MMU data



Feedback welcome.



Thanks,

Stuart Yoder



------------------------------------------------------------------



KVM PPC MMU API

---------------



User space can query whether the APIs to access the vcpu mmu

is available with the KVM_CHECK_EXTENSION API using

the KVM_CAP_PPC_MMU argument.



If the KVM_CAP_PPC_MMU return value is non-zero it specifies that

the following APIs are available:



   KVM_PPC_SET_MMU_TYPE

   KVM_PPC_INVALIDATE_TLB

   KVM_PPC_SET_TLBE

   KVM_PPC_GET_MMU





KVM_PPC_SET_MMU_TYPE

--------------------



Capability: KVM_CAP_PPC_SET_MMU_TYPE

Architectures: powerpc

Type: vcpu ioctl

Parameters: __u32 mmu_type (in)

Returns: 0 if specified MMU type is supported, else -1



Sets the MMU type.  Valid input values are:

   BOOKE_NOHV   0x1

   BOOKE_HV     0x2



A return value of 0x0 indicates that KVM supports

the specified MMU type.



KVM_PPC_INVALIDATE_TLB

----------------------



Capability: KVM_CAP_PPC_MMU

Architectures: powerpc

Type: vcpu ioctl

Parameters: none

Returns: 0 on success, -1 on error



Invalidates all TLB entries in all TLBs of the vcpu.



KVM_PPC_SET_TLBE

----------------



Capability: KVM_CAP_PPC_MMU

Architectures: powerpc

Type: vcpu ioctl

Parameters:

        For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)

Returns: 0 on success, -1 on error



Sets an MMU entry in a virtual CPU.



For mmu types BOOKE_NOHV and BOOKE_HV:



      To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu

      as per the Power architecture.



      struct kvm_ppc_booke_mmu {

            union {

                  __u64 mas0_1;

                  struct {

                        __u32 mas0;

                        __u32 mas1;

                  };

            };

            __u64 mas2;

            union {

                  __u64 mas7_3

                  struct {

                        __u32 mas7;

                        __u32 mas3;

                  };

            };

            union {

                  __u64 mas5_6

                  struct {

                        __u64 mas5;

                        __u64 mas6;

                  };

            }

            __u32 mas8;

      };



      For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields

      in kvm_ppc_booke_mmu are present but not supported.





KVM_PPC_GET_TLB

---------------



Capability: KVM_CAP_PPC_MMU

Architectures: powerpc

Type: vcpu ioctl

Parameters: struct kvm_ppc_get_mmu (in/out)

Returns: 0 on success

         -1 on error

         errno = ENOENT when iterating and there are no more entries to read



Reads an MMU entry from a virtual CPU.



      struct kvm_ppc_get_mmu {

            /* in */

                void *mmu;

            __u32 flags;

                  /* a bitmask of flags to the API */

                    /*     TLB_READ_FIRST   0x1      */

                    /*     TLB_SEARCH       0x2      */

            /* out */

            __u32 max_entries;

      };



For mmu types BOOKE_NOHV and BOOKE_HV :



      The "void *mmu" field of kvm_ppc_get_mmu points to

        a struct of type "struct kvm_ppc_booke_mmu".



      If TLBnCFG[NENTRY] > 0 and TLBnCFG[ASSOC] > 0, the TLB has

      of known number of entries and associativity.  The mas0[ESEL]

      and mas2[EPN] fields specify which entry to read.



      If TLBnCFG[NENTRY] == 0 the number of TLB entries is

      undefined and this API can be used to iterate over

      the entire TLB selected with TLBSEL in mas0.



      -To read a TLB entry:



         set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):

            flags=0

            mas0[TLBSEL] // select which TLB is being read

            mas0[ESEL]   // select which entry is being read

            mas2[EPN]    // effective address



         On return the following fields are updated as per the Power architecture:

            mas0

            mas1

            mas2

            mas3

            mas7



      -To iterate over a TLB (read all entries):



        To start an interation sequence, set the following fields in

        the mmu struct (struct kvm_ppc_booke_mmu)

            flags=TLB_READ_FIRST

            mas0[TLBSEL]  // select which TLB is being read



        On return the following fields are updated:

            mas0           // set as per Power arch

            mas1           // set as per Power arch

            mas2           // set as per Power arch

            mas3           // set as per Power arch

            mas7           // set as per Power arch

            max_entries    // Contains upper limit on number of entries that may

                           // be returned. A value of 0xffffffff means there is

                           // no meaningful upper bound.



        For subsequent calls to the API the following output fields must

        be passed back into the API unmodified:

            flags

            mas0

            mas2



        A return value of -ENOENT indicates that there are no more

        entries to be read.



      -To search for TLB entry



         To search for TLB entry, set the following fields in

         the mmu struct (struct kvm_ppc_booke_mmu):

            flags=TLB_SEARCH

            mas2[EPN]    // effective address to search for

            mas6         // set as per the Power arch

            mas5         // set as per the Power arch



         On return, the following fields are updated as per the Power architecture:

            mas0

            mas1

            mas2

            mas3

            mas7

[-- Attachment #2: Type: text/html, Size: 25124 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2011-02-15  0:01 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-02 20:33 RFC: New API for PPC for vcpu mmu access Yoder Stuart-B08248
2011-02-02 20:33 ` Yoder Stuart-B08248
2011-02-02 20:33 ` [Qemu-devel] " Yoder Stuart-B08248
2011-02-02 21:33 ` Alexander Graf
2011-02-02 21:33   ` Alexander Graf
2011-02-02 21:33   ` [Qemu-devel] " Alexander Graf
2011-02-02 22:08   ` Scott Wood
2011-02-02 22:08     ` Scott Wood
2011-02-02 22:08     ` [Qemu-devel] " Scott Wood
2011-02-03  9:19     ` Alexander Graf
2011-02-03  9:19       ` Alexander Graf
2011-02-03  9:19       ` [Qemu-devel] " Alexander Graf
2011-02-04 22:33       ` Scott Wood
2011-02-04 22:33         ` Scott Wood
2011-02-04 22:33         ` [Qemu-devel] " Scott Wood
2011-02-07 15:43         ` Alexander Graf
2011-02-07 15:43           ` Alexander Graf
2011-02-07 15:43           ` [Qemu-devel] " Alexander Graf
2011-02-07 16:40           ` Yoder Stuart-B08248
2011-02-07 16:40             ` Yoder Stuart-B08248
2011-02-07 16:40             ` [Qemu-devel] " Yoder Stuart-B08248
2011-02-07 16:49             ` Alexander Graf
2011-02-07 16:49               ` Alexander Graf
2011-02-07 16:49               ` [Qemu-devel] " Alexander Graf
2011-02-07 18:52               ` Scott Wood
2011-02-07 18:52                 ` Scott Wood
2011-02-07 18:52                 ` [Qemu-devel] " Scott Wood
2011-02-07 19:56                 ` Yoder Stuart-B08248
2011-02-07 19:56                   ` Yoder Stuart-B08248
2011-02-07 19:56                   ` [Qemu-devel] " Yoder Stuart-B08248
2011-02-09 17:03                   ` Alexander Graf
2011-02-09 17:03                     ` Alexander Graf
2011-02-09 17:03                     ` [Qemu-devel] " Alexander Graf
2011-02-07 20:15           ` Scott Wood
2011-02-07 20:15             ` Scott Wood
2011-02-07 20:15             ` [Qemu-devel] " Scott Wood
2011-02-09 17:21             ` Alexander Graf
2011-02-09 17:21               ` Alexander Graf
2011-02-09 17:21               ` [Qemu-devel] " Alexander Graf
2011-02-09 23:09               ` Scott Wood
2011-02-09 23:09                 ` Scott Wood
2011-02-09 23:09                 ` [Qemu-devel] " Scott Wood
2011-02-10 11:45                 ` Alexander Graf
2011-02-10 11:45                   ` Alexander Graf
2011-02-10 11:45                   ` [Qemu-devel] " Alexander Graf
2011-02-10 18:51                   ` Scott Wood
2011-02-10 18:51                     ` Scott Wood
2011-02-10 18:51                     ` [Qemu-devel] " Scott Wood
2011-02-11  0:20                     ` Alexander Graf
2011-02-11  0:20                       ` Alexander Graf
2011-02-11  0:20                       ` [Qemu-devel] " Alexander Graf
2011-02-11  0:22                       ` Alexander Graf
2011-02-11  0:22                         ` Alexander Graf
2011-02-11  0:22                         ` [Qemu-devel] " Alexander Graf
2011-02-11  1:41                         ` Alexander Graf
2011-02-11  1:41                           ` Alexander Graf
2011-02-11  1:41                           ` [Qemu-devel] " Alexander Graf
2011-02-11 20:53                           ` Scott Wood
2011-02-11 20:53                             ` Scott Wood
2011-02-11 20:53                             ` [Qemu-devel] " Scott Wood
2011-02-11 21:07                             ` Alexander Graf
2011-02-11 21:07                               ` Alexander Graf
2011-02-11 21:07                               ` [Qemu-devel] " Alexander Graf
2011-02-12  0:57                               ` Scott Wood
2011-02-12  0:57                                 ` Scott Wood
2011-02-12  0:57                                 ` [Qemu-devel] " Scott Wood
2011-02-13 22:43                                 ` Alexander Graf
2011-02-13 22:43                                   ` Alexander Graf
2011-02-13 22:43                                   ` [Qemu-devel] " Alexander Graf
2011-02-14 17:11                                   ` Scott Wood
2011-02-14 17:11                                     ` Scott Wood
2011-02-14 17:11                                     ` [Qemu-devel] " Scott Wood
2011-02-14 20:19                                     ` Alexander Graf
2011-02-14 20:19                                       ` Alexander Graf
2011-02-14 20:19                                       ` [Qemu-devel] " Alexander Graf
2011-02-14 21:16                                       ` Scott Wood
2011-02-14 21:16                                         ` Scott Wood
2011-02-14 21:16                                         ` [Qemu-devel] " Scott Wood
2011-02-14 23:39                                         ` Alexander Graf
2011-02-14 23:39                                           ` Alexander Graf
2011-02-14 23:39                                           ` [Qemu-devel] " Alexander Graf
2011-02-14 23:49                                           ` Scott Wood
2011-02-14 23:49                                             ` Scott Wood
2011-02-14 23:49                                             ` [Qemu-devel] " Scott Wood
2011-02-15  0:00                                             ` Alexander Graf
2011-02-15  0:00                                               ` Alexander Graf
2011-02-15  0:00                                               ` [Qemu-devel] " Alexander Graf
2011-02-07 17:13       ` Avi Kivity
2011-02-07 17:13         ` Avi Kivity
2011-02-07 17:13         ` [Qemu-devel] " Avi Kivity
2011-02-07 17:30         ` Yoder Stuart-B08248
2011-02-07 17:30           ` Yoder Stuart-B08248
2011-02-07 17:30           ` [Qemu-devel] " Yoder Stuart-B08248
2011-02-08  9:10           ` Avi Kivity
2011-02-08  9:10             ` Avi Kivity
2011-02-08  9:10             ` [Qemu-devel] " Avi Kivity
2011-02-10  0:04       ` Scott Wood
2011-02-10  0:04         ` Scott Wood
2011-02-10  0:04         ` [Qemu-devel] " Scott Wood
2011-02-10 11:55         ` Alexander Graf
2011-02-10 11:55           ` Alexander Graf
2011-02-10 11:55           ` [Qemu-devel] " Alexander Graf
2011-02-10 12:31           ` Edgar E. Iglesias
2011-02-10 12:31             ` Edgar E. Iglesias
2011-02-10 12:31             ` [Qemu-devel] " Edgar E. Iglesias
2011-02-02 22:34   ` Yoder Stuart-B08248
2011-02-02 22:34     ` Yoder Stuart-B08248
2011-02-02 22:34     ` [Qemu-devel] " Yoder Stuart-B08248
2011-02-03  9:29     ` Alexander Graf
2011-02-03  9:29       ` Alexander Graf
2011-02-03  9:29       ` [Qemu-devel] " Alexander Graf
  -- strict thread matches above, loose matches on Subject: below --
2011-02-02 20:30 Yoder Stuart-B08248

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.