All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
@ 2017-02-22  4:43 Alexey Kardashevskiy
  2017-02-27  0:53 ` Gavin Shan
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-22  4:43 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Russell Currey, David Gibson

The IODA2 specification says that a 64 DMA address cannot use top 4 bits
(3 are reserved and one is a "TVE select"); bottom page_shift bits
cannot be used for multilevel table addressing either.

The existing IODA2 table allocation code aligns the minimum TCE table
size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages,
we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs
13 bits, the maximum number of levels is 48/13 = 3 so we physically
cannot address more and EEH happens on DMA accesses.

This adds a check that too many levels were requested.

It is still possible to have 5 levels in the case of 4K system page size.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

The alternative would be allocating TCE tables as big as PAGE_SIZE but
only using parts of it but this would complicate a bit bits of code
responsible for overall amount of memory used for TCE table.

Or kmem_cache_create() could be used to allocate as big TCE table levels
as we really need but that API does not seem to support NUMA nodes.

In the reality, even 3 levels give us way too much addressable memory.
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 24fa2de2a0af..1e92ec954321 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2631,6 +2631,9 @@ static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	level_shift = entries_shift + 3;
 	level_shift = max_t(unsigned, level_shift, PAGE_SHIFT);
 
+	if ((level_shift - 3) * levels + page_shift >= 60)
+		return -EINVAL;
+
 	/* Allocate TCE table */
 	addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
 			levels, tce_table_size, &offset, &total_allocated);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-02-22  4:43 [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested Alexey Kardashevskiy
@ 2017-02-27  0:53 ` Gavin Shan
  2017-02-27 11:00 ` Michael Ellerman
  2017-03-14 11:45 ` [kernel] " Michael Ellerman
  2 siblings, 0 replies; 9+ messages in thread
From: Gavin Shan @ 2017-02-27  0:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Gavin Shan, Russell Currey, David Gibson

On Wed, Feb 22, 2017 at 03:43:59PM +1100, Alexey Kardashevskiy wrote:
>The IODA2 specification says that a 64 DMA address cannot use top 4 bits
>(3 are reserved and one is a "TVE select"); bottom page_shift bits
>cannot be used for multilevel table addressing either.
>
>The existing IODA2 table allocation code aligns the minimum TCE table
>size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages,
>we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs
>13 bits, the maximum number of levels is 48/13 = 3 so we physically
>cannot address more and EEH happens on DMA accesses.
>
>This adds a check that too many levels were requested.
>
>It is still possible to have 5 levels in the case of 4K system page size.
>
>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>---

Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-02-22  4:43 [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested Alexey Kardashevskiy
  2017-02-27  0:53 ` Gavin Shan
@ 2017-02-27 11:00 ` Michael Ellerman
  2017-02-28  0:54   ` Alexey Kardashevskiy
  2017-03-05 23:03   ` Benjamin Herrenschmidt
  2017-03-14 11:45 ` [kernel] " Michael Ellerman
  2 siblings, 2 replies; 9+ messages in thread
From: Michael Ellerman @ 2017-02-27 11:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Gavin Shan

Alexey Kardashevskiy <aik@ozlabs.ru> writes:

> The IODA2 specification says that a 64 DMA address cannot use top 4 bits
> (3 are reserved and one is a "TVE select"); bottom page_shift bits
> cannot be used for multilevel table addressing either.
>
> The existing IODA2 table allocation code aligns the minimum TCE table
> size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages,
> we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs
> 13 bits, the maximum number of levels is 48/13 = 3 so we physically
> cannot address more and EEH happens on DMA accesses.
>
> This adds a check that too many levels were requested.
>
> It is still possible to have 5 levels in the case of 4K system page size.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>
> The alternative would be allocating TCE tables as big as PAGE_SIZE but
> only using parts of it but this would complicate a bit bits of code
> responsible for overall amount of memory used for TCE table.
>
> Or kmem_cache_create() could be used to allocate as big TCE table levels
> as we really need but that API does not seem to support NUMA nodes.

kmem_cache_alloc_node() ?

cheers

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-02-27 11:00 ` Michael Ellerman
@ 2017-02-28  0:54   ` Alexey Kardashevskiy
  2017-03-05 23:03   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 9+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-28  0:54 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: David Gibson, Gavin Shan

On 27/02/17 22:00, Michael Ellerman wrote:
> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> 
>> The IODA2 specification says that a 64 DMA address cannot use top 4 bits
>> (3 are reserved and one is a "TVE select"); bottom page_shift bits
>> cannot be used for multilevel table addressing either.
>>
>> The existing IODA2 table allocation code aligns the minimum TCE table
>> size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages,
>> we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs
>> 13 bits, the maximum number of levels is 48/13 = 3 so we physically
>> cannot address more and EEH happens on DMA accesses.
>>
>> This adds a check that too many levels were requested.
>>
>> It is still possible to have 5 levels in the case of 4K system page size.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> The alternative would be allocating TCE tables as big as PAGE_SIZE but
>> only using parts of it but this would complicate a bit bits of code
>> responsible for overall amount of memory used for TCE table.
>>
>> Or kmem_cache_create() could be used to allocate as big TCE table levels
>> as we really need but that API does not seem to support NUMA nodes.
> 
> kmem_cache_alloc_node() ?


Yeah, discovered this later. Still, if a single level is used, then the
table is 4MB and kmem_cache_alloc_node() does not seem the right tool here
(although I cannot find any enforced upper limit).

So to keep things simpler, I decided to stick to alloc_pages_node() and
avoid mixing memory allocation APIs.


-- 
Alexey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-02-27 11:00 ` Michael Ellerman
  2017-02-28  0:54   ` Alexey Kardashevskiy
@ 2017-03-05 23:03   ` Benjamin Herrenschmidt
  2017-03-06  1:28     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 9+ messages in thread
From: Benjamin Herrenschmidt @ 2017-03-05 23:03 UTC (permalink / raw)
  To: Michael Ellerman, Alexey Kardashevskiy, linuxppc-dev
  Cc: Gavin Shan, David Gibson

On Mon, 2017-02-27 at 22:00 +1100, Michael Ellerman wrote:
> > The alternative would be allocating TCE tables as big as PAGE_SIZE
> > but
> > only using parts of it but this would complicate a bit bits of code
> > responsible for overall amount of memory used for TCE table.
> > 
> > Or kmem_cache_create() could be used to allocate as big TCE table
> > levels
> > as we really need but that API does not seem to support NUMA nodes.
> 
> kmem_cache_alloc_node() ?

Is that 55 bits of address space (ie, 3 indirect levels + 64k pages) ?
Or only 39 (2 indirect level + 64k pages) ?

In the former case, I'm happy to limit the levels to 3 for 64K pages,
55 bits of TCE space is more than enough. 39 isn't however.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-03-05 23:03   ` Benjamin Herrenschmidt
@ 2017-03-06  1:28     ` Alexey Kardashevskiy
  2017-03-06  3:36       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 9+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-06  1:28 UTC (permalink / raw)
  To: benh, Michael Ellerman, linuxppc-dev; +Cc: Gavin Shan, David Gibson

On 06/03/17 10:03, Benjamin Herrenschmidt wrote:
> On Mon, 2017-02-27 at 22:00 +1100, Michael Ellerman wrote:
>>> The alternative would be allocating TCE tables as big as PAGE_SIZE
>>> but
>>> only using parts of it but this would complicate a bit bits of code
>>> responsible for overall amount of memory used for TCE table.
>>>
>>> Or kmem_cache_create() could be used to allocate as big TCE table
>>> levels
>>> as we really need but that API does not seem to support NUMA nodes.
>>
>> kmem_cache_alloc_node() ?
> 
> Is that 55 bits of address space (ie, 3 indirect levels + 64k pages) ?
> Or only 39 (2 indirect level + 64k pages) ?

39, yes.

> In the former case, I'm happy to limit the levels to 3 for 64K pages,
> 55 bits of TCE space is more than enough. 39 isn't however.

8192*8192*8192*65536>>40 = 32768TB of addressable memory (but there is no
good reason not to use huge pages);
8192*8192*8192*4096>>40 = 2048TB or addressable memory (even with 2
indirect levels but we can have all 5 levels with 4K IOMMU pages).

Looks enough to me...

And in this particular patch I am not limiting anything, I just replace
already existing EEH condition with -EINVAL. If it is this important to
have all 5 levels, then we can switch from alloc_pages_node() to
kmem_cache_alloc_node(), in a separate patch.


-- 
Alexey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-03-06  1:28     ` Alexey Kardashevskiy
@ 2017-03-06  3:36       ` Benjamin Herrenschmidt
  2017-03-06  4:02         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Herrenschmidt @ 2017-03-06  3:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Ellerman, linuxppc-dev
  Cc: Gavin Shan, David Gibson

On Mon, 2017-03-06 at 12:28 +1100, Alexey Kardashevskiy wrote:
> 8192*8192*8192*65536>>40 = 32768TB of addressable memory (but there is no
> good reason not to use huge pages);

No, 39 bits is half a TB. That's not enough.

> 8192*8192*8192*4096>>40 = 2048TB or addressable memory (even with 2
> indirect levels but we can have all 5 levels with 4K IOMMU pages).
> 
> Looks enough to me...
> 
> And in this particular patch I am not limiting anything, I just replace
> already existing EEH condition with -EINVAL. If it is this important to
> have all 5 levels, then we can switch from alloc_pages_node() to
> kmem_cache_alloc_node(), in a separate patch.
> 
> 
> -- 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-03-06  3:36       ` Benjamin Herrenschmidt
@ 2017-03-06  4:02         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 9+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-06  4:02 UTC (permalink / raw)
  To: benh, Michael Ellerman, linuxppc-dev; +Cc: Gavin Shan, David Gibson

On 06/03/17 14:36, Benjamin Herrenschmidt wrote:
> On Mon, 2017-03-06 at 12:28 +1100, Alexey Kardashevskiy wrote:
>> 8192*8192*8192*65536>>40 = 32768TB of addressable memory (but there is no
>> good reason not to use huge pages);
> 
> No, 39 bits is half a TB. That's not enough.

Ah. My bad. 55 bits it is. It is 2 "indirect" levels + 1 "direct" level,
each 8192 entries. So 13*3+16=55.


> 
>> 8192*8192*8192*4096>>40 = 2048TB or addressable memory (even with 2
>> indirect levels but we can have all 5 levels with 4K IOMMU pages).
>>
>> Looks enough to me...
>>
>> And in this particular patch I am not limiting anything, I just replace
>> already existing EEH condition with -EINVAL. If it is this important to
>> have all 5 levels, then we can switch from alloc_pages_node() to
>> kmem_cache_alloc_node(), in a separate patch.
>>
>>
>> -- 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  2017-02-22  4:43 [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested Alexey Kardashevskiy
  2017-02-27  0:53 ` Gavin Shan
  2017-02-27 11:00 ` Michael Ellerman
@ 2017-03-14 11:45 ` Michael Ellerman
  2 siblings, 0 replies; 9+ messages in thread
From: Michael Ellerman @ 2017-03-14 11:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Gavin Shan

On Wed, 2017-02-22 at 04:43:59 UTC, Alexey Kardashevskiy wrote:
> The IODA2 specification says that a 64 DMA address cannot use top 4 bits
> (3 are reserved and one is a "TVE select"); bottom page_shift bits
> cannot be used for multilevel table addressing either.
> 
> The existing IODA2 table allocation code aligns the minimum TCE table
> size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages,
> we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs
> 13 bits, the maximum number of levels is 48/13 = 3 so we physically
> cannot address more and EEH happens on DMA accesses.
> 
> This adds a check that too many levels were requested.
> 
> It is still possible to have 5 levels in the case of 4K system page size.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7aafac11e308d37ed3c509829bb43d

cheers

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-03-14 11:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-22  4:43 [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested Alexey Kardashevskiy
2017-02-27  0:53 ` Gavin Shan
2017-02-27 11:00 ` Michael Ellerman
2017-02-28  0:54   ` Alexey Kardashevskiy
2017-03-05 23:03   ` Benjamin Herrenschmidt
2017-03-06  1:28     ` Alexey Kardashevskiy
2017-03-06  3:36       ` Benjamin Herrenschmidt
2017-03-06  4:02         ` Alexey Kardashevskiy
2017-03-14 11:45 ` [kernel] " Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.