All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: linuxppc-dev@lists.ozlabs.org,
	David Gibson <david@gibson.dropbear.id.au>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>,
	Gavin Shan <gwshan@linux.vnet.ibm.com>,
	Wei Yang <weiyang@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH kernel v10 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2
Date: Thu, 14 May 2015 16:08:58 +1000	[thread overview]
Message-ID: <55543BFA.9030606@ozlabs.ru> (raw)
In-Reply-To: <1431552637.3625.78.camel@redhat.com>

On 05/14/2015 07:30 AM, Alex Williamson wrote:
> On Tue, 2015-05-12 at 01:39 +1000, Alexey Kardashevskiy wrote:
>> The existing implementation accounts the whole DMA window in
>> the locked_vm counter. This is going to be worse with multiple
>> containers and huge DMA windows. Also, real-time accounting would requite
>> additional tracking of accounted pages due to the page size difference -
>> IOMMU uses 4K pages and system uses 4K or 64K pages.
>>
>> Another issue is that actual pages pinning/unpinning happens on every
>> DMA map/unmap request. This does not affect the performance much now as
>> we spend way too much time now on switching context between
>> guest/userspace/host but this will start to matter when we add in-kernel
>> DMA map/unmap acceleration.
>>
>> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
>> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
>> 2 new ioctls to register/unregister DMA memory -
>> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
>> which receive user space address and size of a memory region which
>> needs to be pinned/unpinned and counted in locked_vm.
>> New IOMMU splits physical pages pinning and TCE table update
>> into 2 different operations. It requires:
>> 1) guest pages to be registered first
>> 2) consequent map/unmap requests to work only with pre-registered memory.
>> For the default single window case this means that the entire guest
>> (instead of 2GB) needs to be pinned before using VFIO.
>> When a huge DMA window is added, no additional pinning will be
>> required, otherwise it would be guest RAM + 2GB.
>>
>> The new memory registration ioctls are not supported by
>> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
>> will require memory to be preregistered in order to work.
>>
>> The accounting is done per the user process.
>>
>> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
>> can do with v1 or v2 IOMMUs.
>>
>> In order to support memory pre-registration, we need a way to track
>> the use of every registered memory region and only allow unregistration
>> if a region is not in use anymore. So we need a way to tell from what
>> region the just cleared TCE was from.
>>
>> This adds a userspace view of the TCE table into iommu_table struct.
>> It contains userspace address, one per TCE entry. The table is only
>> allocated when the ownership over an IOMMU group is taken which means
>> it is only used from outside of the powernv code (such as VFIO).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>>
>> Alex, should I remove your "acked-by" in the cases like this and
>> get another one?
>
>
> Generally if it's more than a trivial change, you'll want fresh acks.
>
>> ---
>> Changes:
>> v10:
>> * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
>> specific thing
>> * squashed "powerpc/iommu: Add userspace view of TCE table" into this as
>> it is
>> a part of IOMMU v2
>> * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
>> * fixed some function names to have "tce_iommu_" in the beginning rather
>> just "tce_"
>> * as mm_iommu_mapped_inc() can now fail, check for the return code
>>
>> v9:
>> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
>>
>> v7:
>> * now memory is registered per mm (i.e. process)
>> * moved memory registration code to powerpc/mmu
>> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
>> * limited new ioctls to v2 IOMMU
>> * updated doc
>> * unsupported ioclts return -ENOTTY instead of -EPERM
>>
>> v6:
>> * tce_get_hva_cached() returns hva via a pointer
>>
>> v4:
>> * updated docs
>> * s/kzmalloc/vzalloc/
>> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
>> replaced offset with index
>> * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
>> and removed duplicating vfio_iommu_spapr_register_memory
>> ---
>>   Documentation/vfio.txt              |  31 ++-
>>   arch/powerpc/include/asm/iommu.h    |   6 +
>>   drivers/vfio/vfio_iommu_spapr_tce.c | 516 ++++++++++++++++++++++++++++++------
>>   include/uapi/linux/vfio.h           |  27 ++
>>   4 files changed, 494 insertions(+), 86 deletions(-)
>>
>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>> index 96978ec..7dcf2b5 100644
>> --- a/Documentation/vfio.txt
>> +++ b/Documentation/vfio.txt
>> @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>>
>>   This implementation has some specifics:
>>
>> -1) Only one IOMMU group per container is supported as an IOMMU group
>> -represents the minimal entity which isolation can be guaranteed for and
>> -groups are allocated statically, one per a Partitionable Endpoint (PE)
>> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
>> +container is supported as an IOMMU table is allocated at the boot time,
>> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
>>   (PE is often a PCI domain but not always).
>> +Newer systems (POWER8 with IODA2) have improved hardware design which allows
>> +to remove this limitation and have multiple IOMMU groups per a VFIO container.
>>
>>   2) The hardware supports so called DMA windows - the PCI address range
>>   within which DMA transfer is allowed, any attempt to access address space
>> @@ -427,6 +429,29 @@ The code flow from the example above should be slightly changed:
>>
>>   	....
>>
>> +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
>> +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
>> +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
>> +(which are unsupported in v1 IOMMU).
>> +
>> +PPC64 paravirtualized guests generate a lot of map/unmap requests,
>> +and the handling of those includes pinning/unpinning pages and updating
>> +mm::locked_vm counter to make sure we do not exceed the rlimit.
>> +The v2 IOMMU splits accounting and pinning into separate operations:
>> +
>> +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
>> +receive a user space address and size of the block to be pinned.
>> +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
>> +be called with the exact address and size used for registering
>> +the memory block. The userspace is not expected to call these often.
>> +The ranges are stored in a linked list in a VFIO container.
>> +
>> +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
>> +IOMMU table and do not do pinning; instead these check that the userspace
>> +address is from pre-registered range.
>> +
>> +This separation helps in optimizing DMA for guests.
>> +
>>   -------------------------------------------------------------------------------
>>
>>   [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index c8bad21..763c041 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -113,10 +113,16 @@ struct iommu_table {
>>   	unsigned long  it_page_shift;/* table iommu page size */
>>   #ifdef CONFIG_IOMMU_API
>>   	struct list_head it_group_list;/* List of iommu_table_group_link */
>> +	unsigned long *it_userspace; /* userspace view of the table */
>>   #endif
>>   	struct iommu_table_ops *it_ops;
>>   };
>>
>> +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
>> +		((tbl)->it_userspace ? \
>> +			&((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
>> +			NULL)
>> +
>>   /* Pure 2^n version of get_order */
>>   static inline __attribute_const__
>>   int get_iommu_order(unsigned long size, struct iommu_table *tbl)
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 8943b29..e7e8db3 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -19,8 +19,10 @@
>>   #include <linux/uaccess.h>
>>   #include <linux/err.h>
>>   #include <linux/vfio.h>
>> +#include <linux/vmalloc.h>
>>   #include <asm/iommu.h>
>>   #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>>
>>   #define DRIVER_VERSION  "0.1"
>>   #define DRIVER_AUTHOR   "aik@ozlabs.ru"
>> @@ -81,6 +83,11 @@ static void decrement_locked_vm(long npages)
>>    * into DMA'ble space using the IOMMU
>>    */
>>
>> +struct tce_iommu_group {
>> +	struct list_head next;
>> +	struct iommu_group *grp;
>> +};
>> +
>>   /*
>>    * The container descriptor supports only a single group per container.
>>    * Required by the API as the container is not supplied with the IOMMU group
>> @@ -88,11 +95,98 @@ static void decrement_locked_vm(long npages)
>>    */
>>   struct tce_container {
>>   	struct mutex lock;
>> -	struct iommu_group *grp;
>>   	bool enabled;
>>   	unsigned long locked_pages;
>> +	bool v2;
>> +	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>> +	struct list_head group_list;
>
> You're wasting space by not packing your bools next to each other.

I'll fix it :)


>>   };
>>
>> +static long tce_iommu_unregister_pages(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	long ret;
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>> +		return -EINVAL;
>> +
>> +	mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
>> +	if (!mem)
>> +		return -EINVAL;
>> +
>> +	ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
>> +	if (!ret)
>> +		ret = mm_iommu_put(mem);
>
> Should \put\ really be able to fail?

tce_iommu_unregister_pages() is called from ioctl so yes, the userspace 
deserves to know that the memory will remain pinned.

> I think you really need to examine
> your reference model, mm_iommu_put() looks pretty suspicious.  If
> there's an implicit reference by being mapped, it should be handled that
> way, not via an atomic that gets decremented then corrected.

One implicit reference (*) in @mapped (from atomic_set(&mem->mapped, 1)) is 
only to protect against the race between checking for active mappings and 
putting the reference a registered memory descriptor.

If tce_iommu_unregister_pages() is called when @mapped > 1, then EBUSY is 
returned.

If tce_iommu_unregister_pages() is called when @mapped == 1 or 0, then 
there is no active mapping, @mapped becomes zero (if it is not already) and 
we can safely put the descriptor. All consequent mm_iommu_mapped_inc() 
calls will fail to increment @mapped and return error.

After looking there more, there are 2 bugs though:

--- a/arch/powerpc/mm/mmu_context_hash64_iommu.c
+++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c
@@ -178,9 +178,9 @@ EXPORT_SYMBOL_GPL(mm_iommu_get);

  long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
  {
-       if (1 != atomic_dec_if_positive(&mem->mapped)) {
+       if (atomic_dec_if_positive(&mem->mapped) > 1) {
                 /* There are mappings, exit */
-               atomic_inc(&mem->mapped);
+               atomic_inc_not_zero(&mem->mapped);
                 return -EBUSY;
         }

s/1!=/1</ is to allow putting second/third/... reference of mem->kref and 
atomic_inc_not_zero() is to not elevate the counter if another thread 
managed to release the very last mapping and  decrement my implicit 
reference (*).

Am I still missing something here?

> That's not only not atomic, but causes lots of fallout with references that don't
> get released.
> Notice how you don't even check the return value at the
> call location of this function?

Ouch. This is a bug. @ret needs to be returned to the userspace.

> How many references does that
> potentially leave and where do the get resolved?

Every successful "register" should be coupled with successful "unregister" 
(if it failed - just repeat). If this did not happen, memory remains pinned 
till the process exit, and then it is unpinned unconditionally.


-- 
Alexey

WARNING: multiple messages have this Message-ID (diff)
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>,
	Gavin Shan <gwshan@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, Paul Mackerras <paulus@samba.org>,
	linuxppc-dev@lists.ozlabs.org,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH kernel v10 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2
Date: Thu, 14 May 2015 16:08:58 +1000	[thread overview]
Message-ID: <55543BFA.9030606@ozlabs.ru> (raw)
In-Reply-To: <1431552637.3625.78.camel@redhat.com>

On 05/14/2015 07:30 AM, Alex Williamson wrote:
> On Tue, 2015-05-12 at 01:39 +1000, Alexey Kardashevskiy wrote:
>> The existing implementation accounts the whole DMA window in
>> the locked_vm counter. This is going to be worse with multiple
>> containers and huge DMA windows. Also, real-time accounting would requite
>> additional tracking of accounted pages due to the page size difference -
>> IOMMU uses 4K pages and system uses 4K or 64K pages.
>>
>> Another issue is that actual pages pinning/unpinning happens on every
>> DMA map/unmap request. This does not affect the performance much now as
>> we spend way too much time now on switching context between
>> guest/userspace/host but this will start to matter when we add in-kernel
>> DMA map/unmap acceleration.
>>
>> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
>> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
>> 2 new ioctls to register/unregister DMA memory -
>> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
>> which receive user space address and size of a memory region which
>> needs to be pinned/unpinned and counted in locked_vm.
>> New IOMMU splits physical pages pinning and TCE table update
>> into 2 different operations. It requires:
>> 1) guest pages to be registered first
>> 2) consequent map/unmap requests to work only with pre-registered memory.
>> For the default single window case this means that the entire guest
>> (instead of 2GB) needs to be pinned before using VFIO.
>> When a huge DMA window is added, no additional pinning will be
>> required, otherwise it would be guest RAM + 2GB.
>>
>> The new memory registration ioctls are not supported by
>> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
>> will require memory to be preregistered in order to work.
>>
>> The accounting is done per the user process.
>>
>> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
>> can do with v1 or v2 IOMMUs.
>>
>> In order to support memory pre-registration, we need a way to track
>> the use of every registered memory region and only allow unregistration
>> if a region is not in use anymore. So we need a way to tell from what
>> region the just cleared TCE was from.
>>
>> This adds a userspace view of the TCE table into iommu_table struct.
>> It contains userspace address, one per TCE entry. The table is only
>> allocated when the ownership over an IOMMU group is taken which means
>> it is only used from outside of the powernv code (such as VFIO).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>>
>> Alex, should I remove your "acked-by" in the cases like this and
>> get another one?
>
>
> Generally if it's more than a trivial change, you'll want fresh acks.
>
>> ---
>> Changes:
>> v10:
>> * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
>> specific thing
>> * squashed "powerpc/iommu: Add userspace view of TCE table" into this as
>> it is
>> a part of IOMMU v2
>> * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
>> * fixed some function names to have "tce_iommu_" in the beginning rather
>> just "tce_"
>> * as mm_iommu_mapped_inc() can now fail, check for the return code
>>
>> v9:
>> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
>>
>> v7:
>> * now memory is registered per mm (i.e. process)
>> * moved memory registration code to powerpc/mmu
>> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
>> * limited new ioctls to v2 IOMMU
>> * updated doc
>> * unsupported ioclts return -ENOTTY instead of -EPERM
>>
>> v6:
>> * tce_get_hva_cached() returns hva via a pointer
>>
>> v4:
>> * updated docs
>> * s/kzmalloc/vzalloc/
>> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
>> replaced offset with index
>> * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
>> and removed duplicating vfio_iommu_spapr_register_memory
>> ---
>>   Documentation/vfio.txt              |  31 ++-
>>   arch/powerpc/include/asm/iommu.h    |   6 +
>>   drivers/vfio/vfio_iommu_spapr_tce.c | 516 ++++++++++++++++++++++++++++++------
>>   include/uapi/linux/vfio.h           |  27 ++
>>   4 files changed, 494 insertions(+), 86 deletions(-)
>>
>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>> index 96978ec..7dcf2b5 100644
>> --- a/Documentation/vfio.txt
>> +++ b/Documentation/vfio.txt
>> @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>>
>>   This implementation has some specifics:
>>
>> -1) Only one IOMMU group per container is supported as an IOMMU group
>> -represents the minimal entity which isolation can be guaranteed for and
>> -groups are allocated statically, one per a Partitionable Endpoint (PE)
>> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
>> +container is supported as an IOMMU table is allocated at the boot time,
>> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
>>   (PE is often a PCI domain but not always).
>> +Newer systems (POWER8 with IODA2) have improved hardware design which allows
>> +to remove this limitation and have multiple IOMMU groups per a VFIO container.
>>
>>   2) The hardware supports so called DMA windows - the PCI address range
>>   within which DMA transfer is allowed, any attempt to access address space
>> @@ -427,6 +429,29 @@ The code flow from the example above should be slightly changed:
>>
>>   	....
>>
>> +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
>> +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
>> +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
>> +(which are unsupported in v1 IOMMU).
>> +
>> +PPC64 paravirtualized guests generate a lot of map/unmap requests,
>> +and the handling of those includes pinning/unpinning pages and updating
>> +mm::locked_vm counter to make sure we do not exceed the rlimit.
>> +The v2 IOMMU splits accounting and pinning into separate operations:
>> +
>> +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
>> +receive a user space address and size of the block to be pinned.
>> +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
>> +be called with the exact address and size used for registering
>> +the memory block. The userspace is not expected to call these often.
>> +The ranges are stored in a linked list in a VFIO container.
>> +
>> +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
>> +IOMMU table and do not do pinning; instead these check that the userspace
>> +address is from pre-registered range.
>> +
>> +This separation helps in optimizing DMA for guests.
>> +
>>   -------------------------------------------------------------------------------
>>
>>   [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index c8bad21..763c041 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -113,10 +113,16 @@ struct iommu_table {
>>   	unsigned long  it_page_shift;/* table iommu page size */
>>   #ifdef CONFIG_IOMMU_API
>>   	struct list_head it_group_list;/* List of iommu_table_group_link */
>> +	unsigned long *it_userspace; /* userspace view of the table */
>>   #endif
>>   	struct iommu_table_ops *it_ops;
>>   };
>>
>> +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
>> +		((tbl)->it_userspace ? \
>> +			&((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
>> +			NULL)
>> +
>>   /* Pure 2^n version of get_order */
>>   static inline __attribute_const__
>>   int get_iommu_order(unsigned long size, struct iommu_table *tbl)
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 8943b29..e7e8db3 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -19,8 +19,10 @@
>>   #include <linux/uaccess.h>
>>   #include <linux/err.h>
>>   #include <linux/vfio.h>
>> +#include <linux/vmalloc.h>
>>   #include <asm/iommu.h>
>>   #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>>
>>   #define DRIVER_VERSION  "0.1"
>>   #define DRIVER_AUTHOR   "aik@ozlabs.ru"
>> @@ -81,6 +83,11 @@ static void decrement_locked_vm(long npages)
>>    * into DMA'ble space using the IOMMU
>>    */
>>
>> +struct tce_iommu_group {
>> +	struct list_head next;
>> +	struct iommu_group *grp;
>> +};
>> +
>>   /*
>>    * The container descriptor supports only a single group per container.
>>    * Required by the API as the container is not supplied with the IOMMU group
>> @@ -88,11 +95,98 @@ static void decrement_locked_vm(long npages)
>>    */
>>   struct tce_container {
>>   	struct mutex lock;
>> -	struct iommu_group *grp;
>>   	bool enabled;
>>   	unsigned long locked_pages;
>> +	bool v2;
>> +	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>> +	struct list_head group_list;
>
> You're wasting space by not packing your bools next to each other.

I'll fix it :)


>>   };
>>
>> +static long tce_iommu_unregister_pages(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	long ret;
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>> +		return -EINVAL;
>> +
>> +	mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
>> +	if (!mem)
>> +		return -EINVAL;
>> +
>> +	ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
>> +	if (!ret)
>> +		ret = mm_iommu_put(mem);
>
> Should \put\ really be able to fail?

tce_iommu_unregister_pages() is called from ioctl so yes, the userspace 
deserves to know that the memory will remain pinned.

> I think you really need to examine
> your reference model, mm_iommu_put() looks pretty suspicious.  If
> there's an implicit reference by being mapped, it should be handled that
> way, not via an atomic that gets decremented then corrected.

One implicit reference (*) in @mapped (from atomic_set(&mem->mapped, 1)) is 
only to protect against the race between checking for active mappings and 
putting the reference a registered memory descriptor.

If tce_iommu_unregister_pages() is called when @mapped > 1, then EBUSY is 
returned.

If tce_iommu_unregister_pages() is called when @mapped == 1 or 0, then 
there is no active mapping, @mapped becomes zero (if it is not already) and 
we can safely put the descriptor. All consequent mm_iommu_mapped_inc() 
calls will fail to increment @mapped and return error.

After looking there more, there are 2 bugs though:

--- a/arch/powerpc/mm/mmu_context_hash64_iommu.c
+++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c
@@ -178,9 +178,9 @@ EXPORT_SYMBOL_GPL(mm_iommu_get);

  long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
  {
-       if (1 != atomic_dec_if_positive(&mem->mapped)) {
+       if (atomic_dec_if_positive(&mem->mapped) > 1) {
                 /* There are mappings, exit */
-               atomic_inc(&mem->mapped);
+               atomic_inc_not_zero(&mem->mapped);
                 return -EBUSY;
         }

s/1!=/1</ is to allow putting second/third/... reference of mem->kref and 
atomic_inc_not_zero() is to not elevate the counter if another thread 
managed to release the very last mapping and  decrement my implicit 
reference (*).

Am I still missing something here?

> That's not only not atomic, but causes lots of fallout with references that don't
> get released.
> Notice how you don't even check the return value at the
> call location of this function?

Ouch. This is a bug. @ret needs to be returned to the userspace.

> How many references does that
> potentially leave and where do the get resolved?

Every successful "register" should be coupled with successful "unregister" 
(if it failed - just repeat). If this did not happen, memory remains pinned 
till the process exit, and then it is unpinned unconditionally.


-- 
Alexey

  reply	other threads:[~2015-05-14  6:09 UTC|newest]

Thread overview: 163+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-11 15:38 [PATCH kernel v10 00/34] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
2015-05-11 15:38 ` Alexey Kardashevskiy
2015-05-11 15:38 ` [PATCH kernel v10 01/34] powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-12  1:51   ` Gavin Shan
2015-05-12  1:51     ` Gavin Shan
2015-05-11 15:38 ` [PATCH kernel v10 02/34] powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  5:18   ` Gavin Shan
2015-05-13  5:18     ` Gavin Shan
2015-05-13  7:26     ` Alexey Kardashevskiy
2015-05-13  7:26       ` Alexey Kardashevskiy
2015-05-11 15:38 ` [PATCH kernel v10 03/34] powerpc/powernv/ioda: Clean up IOMMU group registration Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  5:21   ` Gavin Shan
2015-05-13  5:21     ` Gavin Shan
2015-05-11 15:38 ` [PATCH kernel v10 04/34] powerpc/iommu: Put IOMMU group explicitly Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  5:27   ` Gavin Shan
2015-05-13  5:27     ` Gavin Shan
2015-05-11 15:38 ` [PATCH kernel v10 05/34] powerpc/iommu: Always release iommu_table in iommu_free_table() Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  5:33   ` Gavin Shan
2015-05-13  5:33     ` Gavin Shan
2015-05-13  6:30     ` Alexey Kardashevskiy
2015-05-13  6:30       ` Alexey Kardashevskiy
2015-05-13 12:51       ` Thomas Huth
2015-05-13 12:51         ` Thomas Huth
2015-05-13 23:27         ` Gavin Shan
2015-05-13 23:27           ` Gavin Shan
2015-05-14  2:34           ` Alexey Kardashevskiy
2015-05-14  2:53             ` Alex Williamson
2015-05-14  2:53               ` Alex Williamson
2015-05-14  6:29               ` Alexey Kardashevskiy
2015-05-14  6:29                 ` Alexey Kardashevskiy
2015-05-11 15:38 ` [PATCH kernel v10 06/34] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  5:58   ` Gavin Shan
2015-05-13  5:58     ` Gavin Shan
2015-05-13  6:32     ` Alexey Kardashevskiy
2015-05-13  6:32       ` Alexey Kardashevskiy
2015-05-11 15:38 ` [PATCH kernel v10 07/34] vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  6:06   ` Gavin Shan
2015-05-13  6:06     ` Gavin Shan
2015-05-11 15:38 ` [PATCH kernel v10 08/34] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  6:12   ` Gavin Shan
2015-05-13  6:12     ` Gavin Shan
2015-05-11 15:38 ` [PATCH kernel v10 09/34] vfio: powerpc/spapr: Move locked_vm accounting to helpers Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  6:18   ` Gavin Shan
2015-05-13  6:18     ` Gavin Shan
2015-05-11 15:38 ` [PATCH kernel v10 10/34] vfio: powerpc/spapr: Disable DMA mappings on disabled container Alexey Kardashevskiy
2015-05-11 15:38   ` Alexey Kardashevskiy
2015-05-13  6:20   ` Gavin Shan
2015-05-13  6:20     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 11/34] vfio: powerpc/spapr: Moving pinning/unpinning to helpers Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-13  6:32   ` Gavin Shan
2015-05-13  6:32     ` Gavin Shan
2015-05-13  7:30     ` Alexey Kardashevskiy
2015-05-13  7:30       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 12/34] vfio: powerpc/spapr: Rework groups attaching Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-13 23:35   ` Gavin Shan
2015-05-13 23:35     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 13/34] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  0:00   ` Gavin Shan
2015-05-14  0:00     ` Gavin Shan
2015-05-14  2:51     ` Alexey Kardashevskiy
2015-05-14  2:51       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 14/34] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  0:23   ` Gavin Shan
2015-05-14  0:23     ` Gavin Shan
2015-05-14  3:07     ` Alexey Kardashevskiy
2015-05-14  3:07       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 15/34] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free() Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  0:48   ` Gavin Shan
2015-05-14  0:48     ` Gavin Shan
2015-05-14  3:19     ` Alexey Kardashevskiy
2015-05-14  3:19       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 16/34] powerpc/spapr: vfio: Replace iommu_table with iommu_table_group Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-13 21:30   ` Alex Williamson
2015-05-13 21:30     ` Alex Williamson
2015-05-14  1:21   ` Gavin Shan
2015-05-14  1:21     ` Gavin Shan
2015-05-14  3:31     ` Alexey Kardashevskiy
2015-05-14  3:31       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 17/34] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  1:52   ` Gavin Shan
2015-05-14  1:52     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 18/34] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  2:01   ` Gavin Shan
2015-05-14  2:01     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 19/34] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  3:36   ` Gavin Shan
2015-05-14  3:36     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 20/34] powerpc/powernv/ioda2: Move TCE kill register address to PE Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  2:10   ` Gavin Shan
2015-05-14  2:10     ` Gavin Shan
2015-05-14  3:39     ` Alexey Kardashevskiy
2015-05-14  3:39       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 21/34] powerpc/powernv/ioda2: Add TCE invalidation for all attached groups Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  2:22   ` Gavin Shan
2015-05-14  2:22     ` Gavin Shan
2015-05-14  3:50     ` Alexey Kardashevskiy
2015-05-14  3:50       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 22/34] powerpc/powernv: Implement accessor to TCE entry Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  2:34   ` Gavin Shan
2015-05-14  2:34     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 23/34] powerpc/iommu/powernv: Release replaced TCE Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-13 15:00   ` Thomas Huth
2015-05-13 15:00     ` Thomas Huth
2015-05-14  3:53     ` Alexey Kardashevskiy
2015-05-14  3:53       ` Alexey Kardashevskiy
2015-05-15  8:09       ` Thomas Huth
2015-05-15  8:09         ` Thomas Huth
2015-05-11 15:39 ` [PATCH kernel v10 24/34] powerpc/powernv/ioda2: Rework iommu_table creation Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  4:14   ` Gavin Shan
2015-05-14  4:14     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 25/34] powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  4:31   ` Gavin Shan
2015-05-14  4:31     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 26/34] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-14  5:01   ` Gavin Shan
2015-05-14  5:01     ` Gavin Shan
2015-05-11 15:39 ` [PATCH kernel v10 27/34] powerpc/powernv: Implement multilevel TCE tables Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 28/34] vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-13 21:30   ` Alex Williamson
2015-05-13 21:30     ` Alex Williamson
2015-05-11 15:39 ` [PATCH kernel v10 29/34] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 30/34] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 31/34] vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 32/34] powerpc/mmu: Add userspace-to-physical addresses translation cache Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2 Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy
2015-05-13 21:30   ` Alex Williamson
2015-05-13 21:30     ` Alex Williamson
2015-05-14  6:08     ` Alexey Kardashevskiy [this message]
2015-05-14  6:08       ` Alexey Kardashevskiy
2015-05-11 15:39 ` [PATCH kernel v10 34/34] vfio: powerpc/spapr: Support Dynamic DMA windows Alexey Kardashevskiy
2015-05-11 15:39   ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55543BFA.9030606@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=benh@kernel.crashing.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=gwshan@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=paulus@samba.org \
    --cc=weiyang@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.