* [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-29 21:53 ` Suravee Suthikulpanit
0 siblings, 0 replies; 102+ messages in thread
From: Suravee Suthikulpanit @ 2015-04-29 21:53 UTC (permalink / raw)
To: linux-arm-kernel
On 4/29/15 11:25, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 08:44:09 Suravee Suthikulpanit wrote:
>> diff --git a/drivers/acpi/acpi_platform.c b/drivers/acpi/acpi_platform.c
>> index 4bf7559..a4db208 100644
>> --- a/drivers/acpi/acpi_platform.c
>> +++ b/drivers/acpi/acpi_platform.c
>> @@ -108,9 +108,12 @@ struct platform_device *acpi_create_platform_device(struct acpi_device *adev)
>> if (IS_ERR(pdev))
>> dev_err(&adev->dev, "platform device creation failed: %ld\n",
>> PTR_ERR(pdev));
>> - else
>> + else {
>> + arch_setup_dma_ops(&pdev->dev, 0, 0, NULL,
>> + adev->flags.is_coherent);
>> dev_dbg(&adev->dev, "created platform device %s\n",
>> dev_name(&pdev->dev));
>> + }
>>
>> kfree(resources);
>>
>
> Looking at this code in more detail, it seems that it unconditionally
> sets pdevinfo.dma_mask = DMA_BIT_MASK(32), before calling
> arch_setup_dma_ops().
I think that's just the default legacy value assigned when it first
create the platform_device from acpi_device.
> This assignment should really done inside of arch_setup_dma_ops()
> instead, which means we should implement that
> function on all architectures that support ACPI.
> For the case where _CCA is missing (or coherency disabled, if you ask
> me), we would not call that function.
Actually, I agree for the case of missing _CCA when needed, ACPI driver
probably should not make assumption and leave the decision for the
default underlying arch-specific default. Basically, it should not be
calling arch_setup_dma_ops().
As for the case where _CCA=0, I think the ACPI driver should essentially
communicate the information as HW is non-coherent as described in the
spec, and should be calling arch_setup_dma_ops(dev, false). It is true
that this in probably less-likely for the ARM64 server platforms.
However, I would think that the ACPI driver should not be making such
assumption.
> On a related note, I'm not sure how to handle different DMA masks here.
> arch_setup_dma_ops() gets passed a size (and offset) argument, which should
> match the DMA mask, but I don't know if there is a way to find out the
> size from ACPI. Should we assume it's always 64-bit DMA capable?
Looking at the ACPI spec, it does have the _DMA object. IIUC, this can
be used to describe DMA properties of a particular bus.
Method(_DMA, ResourceTemplate()
{
QWORDMemory(
ResourceConsumer,
PosDecode, // _DEC
MinFixed, // _MIF
MaxFixed, // _MAF
Prefetchable, // _MEM
ReadWrite, // _RW
0, // _GRA
0, // _MIN
0x1fffffff, // _MAX
0x200000000, // _TRA
0x20000000, // _LEN
, , ,
)
}
I am not sure if this is an appropriate use for this object, but this
seems to be similar to the dma-ranges property for OF, and probably can
be used to specify baseaddr and size information when calling
arch_setup_dma_ops().
> For legacy reasons, the default mask is probably best left at 32-bit,
> but drivers are expected to call dma_set_mask() if they can do 64-bit DMA,
> and that should fail based on the information provided by the platform
> if the bus is not capable of doing that.
>
> Arnd
>
However, on ARM64 the dma_base and size parameter for
arch_setup_dma_ops() is currently not used, and only coherent flag is
used. We probably should look at this separately. For the moment, we can
probably say that if _CCA object is missing when needed, the ACPI driver
won't set up dma_mask when creating platform_device, which should be
equivalent to saying DMA is not supported.
Please let me know if this is acceptable, and I'll make change in V2
accordingly.
Thanks,
Suravee
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-29 21:53 ` Suravee Suthikulpanit
0 siblings, 0 replies; 102+ messages in thread
From: Suravee Suthikulpanit @ 2015-04-29 21:53 UTC (permalink / raw)
To: Arnd Bergmann, linux-arm-kernel
Cc: rjw, lenb, catalin.marinas, will.deacon, al.stone, linaro-acpi,
linux-kernel, linux-acpi, leo.duran, hanjun.guo, msalter,
grant.likely
On 4/29/15 11:25, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 08:44:09 Suravee Suthikulpanit wrote:
>> diff --git a/drivers/acpi/acpi_platform.c b/drivers/acpi/acpi_platform.c
>> index 4bf7559..a4db208 100644
>> --- a/drivers/acpi/acpi_platform.c
>> +++ b/drivers/acpi/acpi_platform.c
>> @@ -108,9 +108,12 @@ struct platform_device *acpi_create_platform_device(struct acpi_device *adev)
>> if (IS_ERR(pdev))
>> dev_err(&adev->dev, "platform device creation failed: %ld\n",
>> PTR_ERR(pdev));
>> - else
>> + else {
>> + arch_setup_dma_ops(&pdev->dev, 0, 0, NULL,
>> + adev->flags.is_coherent);
>> dev_dbg(&adev->dev, "created platform device %s\n",
>> dev_name(&pdev->dev));
>> + }
>>
>> kfree(resources);
>>
>
> Looking at this code in more detail, it seems that it unconditionally
> sets pdevinfo.dma_mask = DMA_BIT_MASK(32), before calling
> arch_setup_dma_ops().
I think that's just the default legacy value assigned when it first
create the platform_device from acpi_device.
> This assignment should really done inside of arch_setup_dma_ops()
> instead, which means we should implement that
> function on all architectures that support ACPI.
> For the case where _CCA is missing (or coherency disabled, if you ask
> me), we would not call that function.
Actually, I agree for the case of missing _CCA when needed, ACPI driver
probably should not make assumption and leave the decision for the
default underlying arch-specific default. Basically, it should not be
calling arch_setup_dma_ops().
As for the case where _CCA=0, I think the ACPI driver should essentially
communicate the information as HW is non-coherent as described in the
spec, and should be calling arch_setup_dma_ops(dev, false). It is true
that this in probably less-likely for the ARM64 server platforms.
However, I would think that the ACPI driver should not be making such
assumption.
> On a related note, I'm not sure how to handle different DMA masks here.
> arch_setup_dma_ops() gets passed a size (and offset) argument, which should
> match the DMA mask, but I don't know if there is a way to find out the
> size from ACPI. Should we assume it's always 64-bit DMA capable?
Looking at the ACPI spec, it does have the _DMA object. IIUC, this can
be used to describe DMA properties of a particular bus.
Method(_DMA, ResourceTemplate()
{
QWORDMemory(
ResourceConsumer,
PosDecode, // _DEC
MinFixed, // _MIF
MaxFixed, // _MAF
Prefetchable, // _MEM
ReadWrite, // _RW
0, // _GRA
0, // _MIN
0x1fffffff, // _MAX
0x200000000, // _TRA
0x20000000, // _LEN
, , ,
)
}
I am not sure if this is an appropriate use for this object, but this
seems to be similar to the dma-ranges property for OF, and probably can
be used to specify baseaddr and size information when calling
arch_setup_dma_ops().
> For legacy reasons, the default mask is probably best left at 32-bit,
> but drivers are expected to call dma_set_mask() if they can do 64-bit DMA,
> and that should fail based on the information provided by the platform
> if the bus is not capable of doing that.
>
> Arnd
>
However, on ARM64 the dma_base and size parameter for
arch_setup_dma_ops() is currently not used, and only coherent flag is
used. We probably should look at this separately. For the moment, we can
probably say that if _CCA object is missing when needed, the ACPI driver
won't set up dma_mask when creating platform_device, which should be
equivalent to saying DMA is not supported.
Please let me know if this is acceptable, and I'll make change in V2
accordingly.
Thanks,
Suravee
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-29 21:53 ` Suravee Suthikulpanit
@ 2015-04-30 8:23 ` Arnd Bergmann
-1 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 8:23 UTC (permalink / raw)
To: linaro-acpi
Cc: Suravee Suthikulpanit, linux-arm-kernel, catalin.marinas, rjw,
linux-kernel, will.deacon, linux-acpi, lenb
On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> On 4/29/15 11:25, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 08:44:09 Suravee Suthikulpanit wrote:
> >> diff --git a/drivers/acpi/acpi_platform.c b/drivers/acpi/acpi_platform.c
> >> index 4bf7559..a4db208 100644
> >> --- a/drivers/acpi/acpi_platform.c
> >> +++ b/drivers/acpi/acpi_platform.c
> >> @@ -108,9 +108,12 @@ struct platform_device *acpi_create_platform_device(struct acpi_device *adev)
> >> if (IS_ERR(pdev))
> >> dev_err(&adev->dev, "platform device creation failed: %ld\n",
> >> PTR_ERR(pdev));
> >> - else
> >> + else {
> >> + arch_setup_dma_ops(&pdev->dev, 0, 0, NULL,
> >> + adev->flags.is_coherent);
> >> dev_dbg(&adev->dev, "created platform device %s\n",
> >> dev_name(&pdev->dev));
> >> + }
> >>
> >> kfree(resources);
> >>
> >
> > Looking at this code in more detail, it seems that it unconditionally
> > sets pdevinfo.dma_mask = DMA_BIT_MASK(32), before calling
> > arch_setup_dma_ops().
>
> I think that's just the default legacy value assigned when it first
> create the platform_device from acpi_device.
Understood. And on x86 there is no way to find out if a device supports
DMA or not, so it has to do this I guess.
> > This assignment should really done inside of arch_setup_dma_ops()
> > instead, which means we should implement that
> > function on all architectures that support ACPI.
>
>
> > For the case where _CCA is missing (or coherency disabled, if you ask
> > me), we would not call that function.
>
> Actually, I agree for the case of missing _CCA when needed, ACPI driver
> probably should not make assumption and leave the decision for the
> default underlying arch-specific default. Basically, it should not be
> calling arch_setup_dma_ops().
Ok.
> As for the case where _CCA=0, I think the ACPI driver should essentially
> communicate the information as HW is non-coherent as described in the
> spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> that this in probably less-likely for the ARM64 server platforms.
> However, I would think that the ACPI driver should not be making such
> assumption.
Can you add a description to the ACPI spec then to describe in detail what
"non-coherent" is supposed to mean, and which action the OS is supposed to
take when accessing data from device or CPU?
As I explained, the way we handle it by default on ARM64 is what embedded
systems typically do, but that might be completely different on the imagined
server chips that are not coherent for some reason. Just saying a device
is not coherent is like saying the CPU has known bugs but not saying how
to prevent it from crashing.
Is there some AML method that the OS can call to synchronize the cache
controller for all DMA to/from a particular device?
> > On a related note, I'm not sure how to handle different DMA masks here.
> > arch_setup_dma_ops() gets passed a size (and offset) argument, which should
> > match the DMA mask, but I don't know if there is a way to find out the
> > size from ACPI. Should we assume it's always 64-bit DMA capable?
>
> Looking at the ACPI spec, it does have the _DMA object. IIUC, this can
> be used to describe DMA properties of a particular bus.
>
> Method(_DMA, ResourceTemplate()
> {
> QWORDMemory(
> ResourceConsumer,
> PosDecode, // _DEC
> MinFixed, // _MIF
> MaxFixed, // _MAF
> Prefetchable, // _MEM
> ReadWrite, // _RW
> 0, // _GRA
> 0, // _MIN
> 0x1fffffff, // _MAX
> 0x200000000, // _TRA
> 0x20000000, // _LEN
> , , ,
> )
> }
>
> I am not sure if this is an appropriate use for this object, but this
> seems to be similar to the dma-ranges property for OF, and probably can
> be used to specify baseaddr and size information when calling
> arch_setup_dma_ops().
Yes, that seems like a good idea. What is the expected behavior when that
object is absent? Do we assume that the parent device is not DMA capable?
Is this sufficient to describe the case where a device can only do DMA
to a specific address range that is not at bus address zero but that maps
to the beginning of physical RAM?
> > For legacy reasons, the default mask is probably best left at 32-bit,
> > but drivers are expected to call dma_set_mask() if they can do 64-bit DMA,
> > and that should fail based on the information provided by the platform
> > if the bus is not capable of doing that.
> >
>
> However, on ARM64 the dma_base and size parameter for
> arch_setup_dma_ops() is currently not used, and only coherent flag is
> used.
We can hope that we won't need the dma_base setting here, but it's
good to have the option to pass it down if we need it.
Not passing the size is a bug that needs to be fixed ASAP, I believe
a number of folks have run into this, most recently the APM X-Gene
MMC controller
> We probably should look at this separately. For the moment, we can
> probably say that if _CCA object is missing when needed, the ACPI driver
> won't set up dma_mask when creating platform_device, which should be
> equivalent to saying DMA is not supported.
>
> Please let me know if this is acceptable, and I'll make change in V2
> accordingly.
I would still ask that you treat non-coherent to mean "no DMA" until
we have come up with a way to sufficiently describe the kind of
non-coherency in ACPI.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 8:23 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 8:23 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> On 4/29/15 11:25, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 08:44:09 Suravee Suthikulpanit wrote:
> >> diff --git a/drivers/acpi/acpi_platform.c b/drivers/acpi/acpi_platform.c
> >> index 4bf7559..a4db208 100644
> >> --- a/drivers/acpi/acpi_platform.c
> >> +++ b/drivers/acpi/acpi_platform.c
> >> @@ -108,9 +108,12 @@ struct platform_device *acpi_create_platform_device(struct acpi_device *adev)
> >> if (IS_ERR(pdev))
> >> dev_err(&adev->dev, "platform device creation failed: %ld\n",
> >> PTR_ERR(pdev));
> >> - else
> >> + else {
> >> + arch_setup_dma_ops(&pdev->dev, 0, 0, NULL,
> >> + adev->flags.is_coherent);
> >> dev_dbg(&adev->dev, "created platform device %s\n",
> >> dev_name(&pdev->dev));
> >> + }
> >>
> >> kfree(resources);
> >>
> >
> > Looking at this code in more detail, it seems that it unconditionally
> > sets pdevinfo.dma_mask = DMA_BIT_MASK(32), before calling
> > arch_setup_dma_ops().
>
> I think that's just the default legacy value assigned when it first
> create the platform_device from acpi_device.
Understood. And on x86 there is no way to find out if a device supports
DMA or not, so it has to do this I guess.
> > This assignment should really done inside of arch_setup_dma_ops()
> > instead, which means we should implement that
> > function on all architectures that support ACPI.
>
>
> > For the case where _CCA is missing (or coherency disabled, if you ask
> > me), we would not call that function.
>
> Actually, I agree for the case of missing _CCA when needed, ACPI driver
> probably should not make assumption and leave the decision for the
> default underlying arch-specific default. Basically, it should not be
> calling arch_setup_dma_ops().
Ok.
> As for the case where _CCA=0, I think the ACPI driver should essentially
> communicate the information as HW is non-coherent as described in the
> spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> that this in probably less-likely for the ARM64 server platforms.
> However, I would think that the ACPI driver should not be making such
> assumption.
Can you add a description to the ACPI spec then to describe in detail what
"non-coherent" is supposed to mean, and which action the OS is supposed to
take when accessing data from device or CPU?
As I explained, the way we handle it by default on ARM64 is what embedded
systems typically do, but that might be completely different on the imagined
server chips that are not coherent for some reason. Just saying a device
is not coherent is like saying the CPU has known bugs but not saying how
to prevent it from crashing.
Is there some AML method that the OS can call to synchronize the cache
controller for all DMA to/from a particular device?
> > On a related note, I'm not sure how to handle different DMA masks here.
> > arch_setup_dma_ops() gets passed a size (and offset) argument, which should
> > match the DMA mask, but I don't know if there is a way to find out the
> > size from ACPI. Should we assume it's always 64-bit DMA capable?
>
> Looking at the ACPI spec, it does have the _DMA object. IIUC, this can
> be used to describe DMA properties of a particular bus.
>
> Method(_DMA, ResourceTemplate()
> {
> QWORDMemory(
> ResourceConsumer,
> PosDecode, // _DEC
> MinFixed, // _MIF
> MaxFixed, // _MAF
> Prefetchable, // _MEM
> ReadWrite, // _RW
> 0, // _GRA
> 0, // _MIN
> 0x1fffffff, // _MAX
> 0x200000000, // _TRA
> 0x20000000, // _LEN
> , , ,
> )
> }
>
> I am not sure if this is an appropriate use for this object, but this
> seems to be similar to the dma-ranges property for OF, and probably can
> be used to specify baseaddr and size information when calling
> arch_setup_dma_ops().
Yes, that seems like a good idea. What is the expected behavior when that
object is absent? Do we assume that the parent device is not DMA capable?
Is this sufficient to describe the case where a device can only do DMA
to a specific address range that is not at bus address zero but that maps
to the beginning of physical RAM?
> > For legacy reasons, the default mask is probably best left at 32-bit,
> > but drivers are expected to call dma_set_mask() if they can do 64-bit DMA,
> > and that should fail based on the information provided by the platform
> > if the bus is not capable of doing that.
> >
>
> However, on ARM64 the dma_base and size parameter for
> arch_setup_dma_ops() is currently not used, and only coherent flag is
> used.
We can hope that we won't need the dma_base setting here, but it's
good to have the option to pass it down if we need it.
Not passing the size is a bug that needs to be fixed ASAP, I believe
a number of folks have run into this, most recently the APM X-Gene
MMC controller
> We probably should look at this separately. For the moment, we can
> probably say that if _CCA object is missing when needed, the ACPI driver
> won't set up dma_mask when creating platform_device, which should be
> equivalent to saying DMA is not supported.
>
> Please let me know if this is acceptable, and I'll make change in V2
> accordingly.
I would still ask that you treat non-coherent to mean "no DMA" until
we have come up with a way to sufficiently describe the kind of
non-coherency in ACPI.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 8:23 ` Arnd Bergmann
(?)
@ 2015-04-30 10:41 ` Will Deacon
-1 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 10:41 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
Hi Arnd,
On Thu, Apr 30, 2015 at 09:23:59AM +0100, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> > As for the case where _CCA=0, I think the ACPI driver should essentially
> > communicate the information as HW is non-coherent as described in the
> > spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> > that this in probably less-likely for the ARM64 server platforms.
> > However, I would think that the ACPI driver should not be making such
> > assumption.
>
> Can you add a description to the ACPI spec then to describe in detail what
> "non-coherent" is supposed to mean, and which action the OS is supposed to
> take when accessing data from device or CPU?
You may be interested in the IORT ACPI companion spec here:
http://infocenter.arm.com/help/topic/com.arm.doc.den0049a/DEN0049A_IO_Remapping_Table.pdf
On CCA, it says:
`This value must match the value returned by the _CCA object defined in
the DSDT for the device represented by this node. The attribute can take
the following values:
- 0x1: The device is fully coherent. No cache maintenance[1] is required for
memory shared with the device which is mapped on CPUs as
Inner Write-Back (IWB), Outer Write-back (OWB), and Inner
shareable (ISH). In addition, during system initialization at cold
boot, or after wakeup from low-power state, if the cache
coherency requires an SMMU override or some specific device
configuration, the platform firmware has to ensure that this has
been done. Therefore the semantics represented by a value of
0x1 are always correct at the time of hand-off from firmware to
OS.
- 0x0: The device is not coherent. Therefore:
* Cache maintenance is required for memory shared with the
device that is mapped on CPUs as IWB-OWB-ISH.
* No cache maintenance is required for memory shared with the
device that is mapped on the CPU as device or Non-cacheable.
All other values are reserved.
[1] Note: Caching operations described in this document apply to the CPU
caches and any other caches in the system where device memory accesses
can hit.'
This aside, the documented introduces some useful, related concepts such
as CPM (coherent path to memory) and DACS (device attributes are cacheable
and inner shareable) for describing different IO subsystems. It also has
mechanisms to descibe ID repainting from PCI->SMMU->ITS.
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 10:41 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 10:41 UTC (permalink / raw)
To: linux-arm-kernel
Hi Arnd,
On Thu, Apr 30, 2015 at 09:23:59AM +0100, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> > As for the case where _CCA=0, I think the ACPI driver should essentially
> > communicate the information as HW is non-coherent as described in the
> > spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> > that this in probably less-likely for the ARM64 server platforms.
> > However, I would think that the ACPI driver should not be making such
> > assumption.
>
> Can you add a description to the ACPI spec then to describe in detail what
> "non-coherent" is supposed to mean, and which action the OS is supposed to
> take when accessing data from device or CPU?
You may be interested in the IORT ACPI companion spec here:
http://infocenter.arm.com/help/topic/com.arm.doc.den0049a/DEN0049A_IO_Remapping_Table.pdf
On CCA, it says:
`This value must match the value returned by the _CCA object defined in
the DSDT for the device represented by this node. The attribute can take
the following values:
- 0x1: The device is fully coherent. No cache maintenance[1] is required for
memory shared with the device which is mapped on CPUs as
Inner Write-Back (IWB), Outer Write-back (OWB), and Inner
shareable (ISH). In addition, during system initialization at cold
boot, or after wakeup from low-power state, if the cache
coherency requires an SMMU override or some specific device
configuration, the platform firmware has to ensure that this has
been done. Therefore the semantics represented by a value of
0x1 are always correct at the time of hand-off from firmware to
OS.
- 0x0: The device is not coherent. Therefore:
* Cache maintenance is required for memory shared with the
device that is mapped on CPUs as IWB-OWB-ISH.
* No cache maintenance is required for memory shared with the
device that is mapped on the CPU as device or Non-cacheable.
All other values are reserved.
[1] Note: Caching operations described in this document apply to the CPU
caches and any other caches in the system where device memory accesses
can hit.'
This aside, the documented introduces some useful, related concepts such
as CPM (coherent path to memory) and DACS (device attributes are cacheable
and inner shareable) for describing different IO subsystems. It also has
mechanisms to descibe ID repainting from PCI->SMMU->ITS.
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 10:41 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 10:41 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
Hi Arnd,
On Thu, Apr 30, 2015 at 09:23:59AM +0100, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> > As for the case where _CCA=0, I think the ACPI driver should essentially
> > communicate the information as HW is non-coherent as described in the
> > spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> > that this in probably less-likely for the ARM64 server platforms.
> > However, I would think that the ACPI driver should not be making such
> > assumption.
>
> Can you add a description to the ACPI spec then to describe in detail what
> "non-coherent" is supposed to mean, and which action the OS is supposed to
> take when accessing data from device or CPU?
You may be interested in the IORT ACPI companion spec here:
http://infocenter.arm.com/help/topic/com.arm.doc.den0049a/DEN0049A_IO_Remapping_Table.pdf
On CCA, it says:
`This value must match the value returned by the _CCA object defined in
the DSDT for the device represented by this node. The attribute can take
the following values:
- 0x1: The device is fully coherent. No cache maintenance[1] is required for
memory shared with the device which is mapped on CPUs as
Inner Write-Back (IWB), Outer Write-back (OWB), and Inner
shareable (ISH). In addition, during system initialization at cold
boot, or after wakeup from low-power state, if the cache
coherency requires an SMMU override or some specific device
configuration, the platform firmware has to ensure that this has
been done. Therefore the semantics represented by a value of
0x1 are always correct at the time of hand-off from firmware to
OS.
- 0x0: The device is not coherent. Therefore:
* Cache maintenance is required for memory shared with the
device that is mapped on CPUs as IWB-OWB-ISH.
* No cache maintenance is required for memory shared with the
device that is mapped on the CPU as device or Non-cacheable.
All other values are reserved.
[1] Note: Caching operations described in this document apply to the CPU
caches and any other caches in the system where device memory accesses
can hit.'
This aside, the documented introduces some useful, related concepts such
as CPM (coherent path to memory) and DACS (device attributes are cacheable
and inner shareable) for describing different IO subsystems. It also has
mechanisms to descibe ID repainting from PCI->SMMU->ITS.
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 10:41 ` Will Deacon
(?)
@ 2015-04-30 10:47 ` Arnd Bergmann
-1 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 10:47 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> Hi Arnd,
>
> On Thu, Apr 30, 2015 at 09:23:59AM +0100, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> > > As for the case where _CCA=0, I think the ACPI driver should essentially
> > > communicate the information as HW is non-coherent as described in the
> > > spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> > > that this in probably less-likely for the ARM64 server platforms.
> > > However, I would think that the ACPI driver should not be making such
> > > assumption.
> >
> > Can you add a description to the ACPI spec then to describe in detail what
> > "non-coherent" is supposed to mean, and which action the OS is supposed to
> > take when accessing data from device or CPU?
>
> You may be interested in the IORT ACPI companion spec here:
>
> http://infocenter.arm.com/help/topic/com.arm.doc.den0049a/DEN0049A_IO_Remapping_Table.pdf
>
> On CCA, it says:
>
> `This value must match the value returned by the _CCA object defined in
> the DSDT for the device represented by this node. The attribute can take
> the following values:
>
> - 0x1: The device is fully coherent. No cache maintenance[1] is required for
> memory shared with the device which is mapped on CPUs as
> Inner Write-Back (IWB), Outer Write-back (OWB), and Inner
> shareable (ISH). In addition, during system initialization at cold
> boot, or after wakeup from low-power state, if the cache
> coherency requires an SMMU override or some specific device
> configuration, the platform firmware has to ensure that this has
> been done. Therefore the semantics represented by a value of
> 0x1 are always correct at the time of hand-off from firmware to
> OS.
Ok, this part absolutely makes sense.
> - 0x0: The device is not coherent. Therefore:
> * Cache maintenance is required for memory shared with the
> device that is mapped on CPUs as IWB-OWB-ISH.
This still seems insufficient. I guess this excludes having to
synchronize external bridges or write buffers, but it does not specify
what cache maintenance is required. Should there be an "outer-flush"?
Should the CPU cache be invalidated or flushed (or both), and do
we need to care about caches inside of the device or just inside of
the CPU?
> * No cache maintenance is required for memory shared with the
> device that is mapped on the CPU as device or Non-cacheable.
>
> All other values are reserved.
>
> [1] Note: Caching operations described in this document apply to the CPU
> caches and any other caches in the system where device memory accesses
> can hit.'
>
> This aside, the documented introduces some useful, related concepts such
> as CPM (coherent path to memory) and DACS (device attributes are cacheable
> and inner shareable) for describing different IO subsystems. It also has
> mechanisms to descibe ID repainting from PCI->SMMU->ITS.
Ah, good.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 10:47 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 10:47 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> Hi Arnd,
>
> On Thu, Apr 30, 2015 at 09:23:59AM +0100, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> > > As for the case where _CCA=0, I think the ACPI driver should essentially
> > > communicate the information as HW is non-coherent as described in the
> > > spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> > > that this in probably less-likely for the ARM64 server platforms.
> > > However, I would think that the ACPI driver should not be making such
> > > assumption.
> >
> > Can you add a description to the ACPI spec then to describe in detail what
> > "non-coherent" is supposed to mean, and which action the OS is supposed to
> > take when accessing data from device or CPU?
>
> You may be interested in the IORT ACPI companion spec here:
>
> http://infocenter.arm.com/help/topic/com.arm.doc.den0049a/DEN0049A_IO_Remapping_Table.pdf
>
> On CCA, it says:
>
> `This value must match the value returned by the _CCA object defined in
> the DSDT for the device represented by this node. The attribute can take
> the following values:
>
> - 0x1: The device is fully coherent. No cache maintenance[1] is required for
> memory shared with the device which is mapped on CPUs as
> Inner Write-Back (IWB), Outer Write-back (OWB), and Inner
> shareable (ISH). In addition, during system initialization at cold
> boot, or after wakeup from low-power state, if the cache
> coherency requires an SMMU override or some specific device
> configuration, the platform firmware has to ensure that this has
> been done. Therefore the semantics represented by a value of
> 0x1 are always correct at the time of hand-off from firmware to
> OS.
Ok, this part absolutely makes sense.
> - 0x0: The device is not coherent. Therefore:
> * Cache maintenance is required for memory shared with the
> device that is mapped on CPUs as IWB-OWB-ISH.
This still seems insufficient. I guess this excludes having to
synchronize external bridges or write buffers, but it does not specify
what cache maintenance is required. Should there be an "outer-flush"?
Should the CPU cache be invalidated or flushed (or both), and do
we need to care about caches inside of the device or just inside of
the CPU?
> * No cache maintenance is required for memory shared with the
> device that is mapped on the CPU as device or Non-cacheable.
>
> All other values are reserved.
>
> [1] Note: Caching operations described in this document apply to the CPU
> caches and any other caches in the system where device memory accesses
> can hit.'
>
> This aside, the documented introduces some useful, related concepts such
> as CPM (coherent path to memory) and DACS (device attributes are cacheable
> and inner shareable) for describing different IO subsystems. It also has
> mechanisms to descibe ID repainting from PCI->SMMU->ITS.
Ah, good.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 10:47 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 10:47 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> Hi Arnd,
>
> On Thu, Apr 30, 2015 at 09:23:59AM +0100, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
> > > As for the case where _CCA=0, I think the ACPI driver should essentially
> > > communicate the information as HW is non-coherent as described in the
> > > spec, and should be calling arch_setup_dma_ops(dev, false). It is true
> > > that this in probably less-likely for the ARM64 server platforms.
> > > However, I would think that the ACPI driver should not be making such
> > > assumption.
> >
> > Can you add a description to the ACPI spec then to describe in detail what
> > "non-coherent" is supposed to mean, and which action the OS is supposed to
> > take when accessing data from device or CPU?
>
> You may be interested in the IORT ACPI companion spec here:
>
> http://infocenter.arm.com/help/topic/com.arm.doc.den0049a/DEN0049A_IO_Remapping_Table.pdf
>
> On CCA, it says:
>
> `This value must match the value returned by the _CCA object defined in
> the DSDT for the device represented by this node. The attribute can take
> the following values:
>
> - 0x1: The device is fully coherent. No cache maintenance[1] is required for
> memory shared with the device which is mapped on CPUs as
> Inner Write-Back (IWB), Outer Write-back (OWB), and Inner
> shareable (ISH). In addition, during system initialization at cold
> boot, or after wakeup from low-power state, if the cache
> coherency requires an SMMU override or some specific device
> configuration, the platform firmware has to ensure that this has
> been done. Therefore the semantics represented by a value of
> 0x1 are always correct at the time of hand-off from firmware to
> OS.
Ok, this part absolutely makes sense.
> - 0x0: The device is not coherent. Therefore:
> * Cache maintenance is required for memory shared with the
> device that is mapped on CPUs as IWB-OWB-ISH.
This still seems insufficient. I guess this excludes having to
synchronize external bridges or write buffers, but it does not specify
what cache maintenance is required. Should there be an "outer-flush"?
Should the CPU cache be invalidated or flushed (or both), and do
we need to care about caches inside of the device or just inside of
the CPU?
> * No cache maintenance is required for memory shared with the
> device that is mapped on the CPU as device or Non-cacheable.
>
> All other values are reserved.
>
> [1] Note: Caching operations described in this document apply to the CPU
> caches and any other caches in the system where device memory accesses
> can hit.'
>
> This aside, the documented introduces some useful, related concepts such
> as CPM (coherent path to memory) and DACS (device attributes are cacheable
> and inner shareable) for describing different IO subsystems. It also has
> mechanisms to descibe ID repainting from PCI->SMMU->ITS.
Ah, good.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 10:47 ` Arnd Bergmann
(?)
@ 2015-04-30 11:07 ` Will Deacon
-1 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 11:07 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thu, Apr 30, 2015 at 11:47:46AM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> > - 0x0: The device is not coherent. Therefore:
> > * Cache maintenance is required for memory shared with the
> > device that is mapped on CPUs as IWB-OWB-ISH.
>
> This still seems insufficient. I guess this excludes having to
> synchronize external bridges or write buffers, but it does not specify
> what cache maintenance is required. Should there be an "outer-flush"?
> Should the CPU cache be invalidated or flushed (or both), and do
> we need to care about caches inside of the device or just inside of
> the CPU?
See the note below:
> > [1] Note: Caching operations described in this document apply to the CPU
> > caches and any other caches in the system where device memory accesses
> > can hit.'
So for the CPU caches we'd do the usual clean to push dirty lines to the device
and (clean+)invalidate before reading data from the device. For the "other
caches in the system" we currently assume (for ARM64) that cache maintenance
will be broadcast and therefore I wouldn't anticipate doing anything extra.
If people want to build system caches that don't respect broadcast cache
maintenance and require explicit management (e.g outer_flush), then I
consider that a broken system and we should try to disable the cache before
entering the kernel. ARMv8 explicitly prohibits this type of cache in the
architecture (type 1 below):
`Conceptually, three classes of system cache can be envisaged:
1. System caches which lie before the point of coherency and cannot
be managed by any cache maintenance instructions. Such systems
fundamentally undermine the concept of cache maintenance
instructions operating to the point of coherency, as they imply
the use of non-architecture mechanisms to manage coherency. The
use of such systems in the ARM architecture is explicitly
prohibited.
2. System caches which lie before the point of coherency and can be
managed by cache maintenance by address instructions that apply to
the point of coherency, but cannot be managed by cache maintenance
by set/way instructions. Where maintenance of the entirety of such
a cache must be performed, as in the case for power management, it
must be performed using non-architectural mechanisms.
3. System caches which lie beyond the point of coherency and so are
invisible to the software. The management of such caches is
outside the scope of the architecture.'
(sorry to keep throwing the book at you!)
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 11:07 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 11:07 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Apr 30, 2015 at 11:47:46AM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> > - 0x0: The device is not coherent. Therefore:
> > * Cache maintenance is required for memory shared with the
> > device that is mapped on CPUs as IWB-OWB-ISH.
>
> This still seems insufficient. I guess this excludes having to
> synchronize external bridges or write buffers, but it does not specify
> what cache maintenance is required. Should there be an "outer-flush"?
> Should the CPU cache be invalidated or flushed (or both), and do
> we need to care about caches inside of the device or just inside of
> the CPU?
See the note below:
> > [1] Note: Caching operations described in this document apply to the CPU
> > caches and any other caches in the system where device memory accesses
> > can hit.'
So for the CPU caches we'd do the usual clean to push dirty lines to the device
and (clean+)invalidate before reading data from the device. For the "other
caches in the system" we currently assume (for ARM64) that cache maintenance
will be broadcast and therefore I wouldn't anticipate doing anything extra.
If people want to build system caches that don't respect broadcast cache
maintenance and require explicit management (e.g outer_flush), then I
consider that a broken system and we should try to disable the cache before
entering the kernel. ARMv8 explicitly prohibits this type of cache in the
architecture (type 1 below):
`Conceptually, three classes of system cache can be envisaged:
1. System caches which lie before the point of coherency and cannot
be managed by any cache maintenance instructions. Such systems
fundamentally undermine the concept of cache maintenance
instructions operating to the point of coherency, as they imply
the use of non-architecture mechanisms to manage coherency. The
use of such systems in the ARM architecture is explicitly
prohibited.
2. System caches which lie before the point of coherency and can be
managed by cache maintenance by address instructions that apply to
the point of coherency, but cannot be managed by cache maintenance
by set/way instructions. Where maintenance of the entirety of such
a cache must be performed, as in the case for power management, it
must be performed using non-architectural mechanisms.
3. System caches which lie beyond the point of coherency and so are
invisible to the software. The management of such caches is
outside the scope of the architecture.'
(sorry to keep throwing the book at you!)
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 11:07 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 11:07 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thu, Apr 30, 2015 at 11:47:46AM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> > - 0x0: The device is not coherent. Therefore:
> > * Cache maintenance is required for memory shared with the
> > device that is mapped on CPUs as IWB-OWB-ISH.
>
> This still seems insufficient. I guess this excludes having to
> synchronize external bridges or write buffers, but it does not specify
> what cache maintenance is required. Should there be an "outer-flush"?
> Should the CPU cache be invalidated or flushed (or both), and do
> we need to care about caches inside of the device or just inside of
> the CPU?
See the note below:
> > [1] Note: Caching operations described in this document apply to the CPU
> > caches and any other caches in the system where device memory accesses
> > can hit.'
So for the CPU caches we'd do the usual clean to push dirty lines to the device
and (clean+)invalidate before reading data from the device. For the "other
caches in the system" we currently assume (for ARM64) that cache maintenance
will be broadcast and therefore I wouldn't anticipate doing anything extra.
If people want to build system caches that don't respect broadcast cache
maintenance and require explicit management (e.g outer_flush), then I
consider that a broken system and we should try to disable the cache before
entering the kernel. ARMv8 explicitly prohibits this type of cache in the
architecture (type 1 below):
`Conceptually, three classes of system cache can be envisaged:
1. System caches which lie before the point of coherency and cannot
be managed by any cache maintenance instructions. Such systems
fundamentally undermine the concept of cache maintenance
instructions operating to the point of coherency, as they imply
the use of non-architecture mechanisms to manage coherency. The
use of such systems in the ARM architecture is explicitly
prohibited.
2. System caches which lie before the point of coherency and can be
managed by cache maintenance by address instructions that apply to
the point of coherency, but cannot be managed by cache maintenance
by set/way instructions. Where maintenance of the entirety of such
a cache must be performed, as in the case for power management, it
must be performed using non-architectural mechanisms.
3. System caches which lie beyond the point of coherency and so are
invisible to the software. The management of such caches is
outside the scope of the architecture.'
(sorry to keep throwing the book at you!)
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 11:07 ` Will Deacon
(?)
@ 2015-04-30 11:24 ` Arnd Bergmann
-1 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 11:24 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 11:47:46AM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> > > - 0x0: The device is not coherent. Therefore:
> > > * Cache maintenance is required for memory shared with the
> > > device that is mapped on CPUs as IWB-OWB-ISH.
> >
> > This still seems insufficient. I guess this excludes having to
> > synchronize external bridges or write buffers, but it does not specify
> > what cache maintenance is required. Should there be an "outer-flush"?
> > Should the CPU cache be invalidated or flushed (or both), and do
> > we need to care about caches inside of the device or just inside of
> > the CPU?
>
> See the note below:
>
> > > [1] Note: Caching operations described in this document apply to the CPU
> > > caches and any other caches in the system where device memory accesses
> > > can hit.'
>
> So for the CPU caches we'd do the usual clean to push dirty lines to the device
> and (clean+)invalidate before reading data from the device. For the "other
> caches in the system" we currently assume (for ARM64) that cache maintenance
> will be broadcast and therefore I wouldn't anticipate doing anything extra.
>
> If people want to build system caches that don't respect broadcast cache
> maintenance and require explicit management (e.g outer_flush), then I
> consider that a broken system and we should try to disable the cache before
> entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> architecture (type 1 below):
>
> `Conceptually, three classes of system cache can be envisaged:
>
> 1. System caches which lie before the point of coherency and cannot
> be managed by any cache maintenance instructions. Such systems
> fundamentally undermine the concept of cache maintenance
> instructions operating to the point of coherency, as they imply
> the use of non-architecture mechanisms to manage coherency. The
> use of such systems in the ARM architecture is explicitly
> prohibited.
Hmm, I thought this was what GPUs typically have, with their own
internal caches that are managed by the GPU rather than the normal
cache maintenance instructions. Does this prohibit the use of most
GPU devices with ARMv8, or did I misunderstand what they do?
> 2. System caches which lie before the point of coherency and can be
> managed by cache maintenance by address instructions that apply to
> the point of coherency, but cannot be managed by cache maintenance
> by set/way instructions. Where maintenance of the entirety of such
> a cache must be performed, as in the case for power management, it
> must be performed using non-architectural mechanisms.
That still doesn't define which cache maintenance instructions are
required for a device that is marked as not coherent using the _CCA
property.
Here, I know that I have a cache that I can flush or invalidate or sync
using architected instructions, but should I?
In particular, there are two common models that we support in Linux:
a) embedded ARM32 and others
dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
dma_cache_sync() == not supportable
dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
b) NUMA servers (parisc, itanium) and others
dma_alloc_noncoherent() == alloc cached
dma_alloc_coherent() == alloc uncached
dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
There are probably other models that could happen, but the patch
set seems to assume a) is the only possible model, while the
architecture description you cite seems to still allow both a) and
b), as well as some variations, and it's possible that we will
see b) on arm64 servers but not a).
You could also have a system that requires cache invalidation for
sending data from the device to memory, but does not require anything
for memory-to-device data, or you could have the opposite.
> 3. System caches which lie beyond the point of coherency and so are
> invisible to the software. The management of such caches is
> outside the scope of the architecture.'
>
> (sorry to keep throwing the book at you!)
That's fine, at least I don't have to read it cover-to-cover then ;-)
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 11:24 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 11:24 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 11:47:46AM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> > > - 0x0: The device is not coherent. Therefore:
> > > * Cache maintenance is required for memory shared with the
> > > device that is mapped on CPUs as IWB-OWB-ISH.
> >
> > This still seems insufficient. I guess this excludes having to
> > synchronize external bridges or write buffers, but it does not specify
> > what cache maintenance is required. Should there be an "outer-flush"?
> > Should the CPU cache be invalidated or flushed (or both), and do
> > we need to care about caches inside of the device or just inside of
> > the CPU?
>
> See the note below:
>
> > > [1] Note: Caching operations described in this document apply to the CPU
> > > caches and any other caches in the system where device memory accesses
> > > can hit.'
>
> So for the CPU caches we'd do the usual clean to push dirty lines to the device
> and (clean+)invalidate before reading data from the device. For the "other
> caches in the system" we currently assume (for ARM64) that cache maintenance
> will be broadcast and therefore I wouldn't anticipate doing anything extra.
>
> If people want to build system caches that don't respect broadcast cache
> maintenance and require explicit management (e.g outer_flush), then I
> consider that a broken system and we should try to disable the cache before
> entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> architecture (type 1 below):
>
> `Conceptually, three classes of system cache can be envisaged:
>
> 1. System caches which lie before the point of coherency and cannot
> be managed by any cache maintenance instructions. Such systems
> fundamentally undermine the concept of cache maintenance
> instructions operating to the point of coherency, as they imply
> the use of non-architecture mechanisms to manage coherency. The
> use of such systems in the ARM architecture is explicitly
> prohibited.
Hmm, I thought this was what GPUs typically have, with their own
internal caches that are managed by the GPU rather than the normal
cache maintenance instructions. Does this prohibit the use of most
GPU devices with ARMv8, or did I misunderstand what they do?
> 2. System caches which lie before the point of coherency and can be
> managed by cache maintenance by address instructions that apply to
> the point of coherency, but cannot be managed by cache maintenance
> by set/way instructions. Where maintenance of the entirety of such
> a cache must be performed, as in the case for power management, it
> must be performed using non-architectural mechanisms.
That still doesn't define which cache maintenance instructions are
required for a device that is marked as not coherent using the _CCA
property.
Here, I know that I have a cache that I can flush or invalidate or sync
using architected instructions, but should I?
In particular, there are two common models that we support in Linux:
a) embedded ARM32 and others
dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
dma_cache_sync() == not supportable
dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
b) NUMA servers (parisc, itanium) and others
dma_alloc_noncoherent() == alloc cached
dma_alloc_coherent() == alloc uncached
dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
There are probably other models that could happen, but the patch
set seems to assume a) is the only possible model, while the
architecture description you cite seems to still allow both a) and
b), as well as some variations, and it's possible that we will
see b) on arm64 servers but not a).
You could also have a system that requires cache invalidation for
sending data from the device to memory, but does not require anything
for memory-to-device data, or you could have the opposite.
> 3. System caches which lie beyond the point of coherency and so are
> invisible to the software. The management of such caches is
> outside the scope of the architecture.'
>
> (sorry to keep throwing the book at you!)
That's fine, at least I don't have to read it cover-to-cover then ;-)
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 11:24 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 11:24 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 11:47:46AM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 11:41:02 Will Deacon wrote:
> > > - 0x0: The device is not coherent. Therefore:
> > > * Cache maintenance is required for memory shared with the
> > > device that is mapped on CPUs as IWB-OWB-ISH.
> >
> > This still seems insufficient. I guess this excludes having to
> > synchronize external bridges or write buffers, but it does not specify
> > what cache maintenance is required. Should there be an "outer-flush"?
> > Should the CPU cache be invalidated or flushed (or both), and do
> > we need to care about caches inside of the device or just inside of
> > the CPU?
>
> See the note below:
>
> > > [1] Note: Caching operations described in this document apply to the CPU
> > > caches and any other caches in the system where device memory accesses
> > > can hit.'
>
> So for the CPU caches we'd do the usual clean to push dirty lines to the device
> and (clean+)invalidate before reading data from the device. For the "other
> caches in the system" we currently assume (for ARM64) that cache maintenance
> will be broadcast and therefore I wouldn't anticipate doing anything extra.
>
> If people want to build system caches that don't respect broadcast cache
> maintenance and require explicit management (e.g outer_flush), then I
> consider that a broken system and we should try to disable the cache before
> entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> architecture (type 1 below):
>
> `Conceptually, three classes of system cache can be envisaged:
>
> 1. System caches which lie before the point of coherency and cannot
> be managed by any cache maintenance instructions. Such systems
> fundamentally undermine the concept of cache maintenance
> instructions operating to the point of coherency, as they imply
> the use of non-architecture mechanisms to manage coherency. The
> use of such systems in the ARM architecture is explicitly
> prohibited.
Hmm, I thought this was what GPUs typically have, with their own
internal caches that are managed by the GPU rather than the normal
cache maintenance instructions. Does this prohibit the use of most
GPU devices with ARMv8, or did I misunderstand what they do?
> 2. System caches which lie before the point of coherency and can be
> managed by cache maintenance by address instructions that apply to
> the point of coherency, but cannot be managed by cache maintenance
> by set/way instructions. Where maintenance of the entirety of such
> a cache must be performed, as in the case for power management, it
> must be performed using non-architectural mechanisms.
That still doesn't define which cache maintenance instructions are
required for a device that is marked as not coherent using the _CCA
property.
Here, I know that I have a cache that I can flush or invalidate or sync
using architected instructions, but should I?
In particular, there are two common models that we support in Linux:
a) embedded ARM32 and others
dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
dma_cache_sync() == not supportable
dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
b) NUMA servers (parisc, itanium) and others
dma_alloc_noncoherent() == alloc cached
dma_alloc_coherent() == alloc uncached
dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
There are probably other models that could happen, but the patch
set seems to assume a) is the only possible model, while the
architecture description you cite seems to still allow both a) and
b), as well as some variations, and it's possible that we will
see b) on arm64 servers but not a).
You could also have a system that requires cache invalidation for
sending data from the device to memory, but does not require anything
for memory-to-device data, or you could have the opposite.
> 3. System caches which lie beyond the point of coherency and so are
> invisible to the software. The management of such caches is
> outside the scope of the architecture.'
>
> (sorry to keep throwing the book at you!)
That's fine, at least I don't have to read it cover-to-cover then ;-)
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 11:24 ` Arnd Bergmann
(?)
@ 2015-04-30 11:46 ` Will Deacon
-1 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 11:46 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> > So for the CPU caches we'd do the usual clean to push dirty lines to the device
> > and (clean+)invalidate before reading data from the device. For the "other
> > caches in the system" we currently assume (for ARM64) that cache maintenance
> > will be broadcast and therefore I wouldn't anticipate doing anything extra.
> >
> > If people want to build system caches that don't respect broadcast cache
> > maintenance and require explicit management (e.g outer_flush), then I
> > consider that a broken system and we should try to disable the cache before
> > entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> > architecture (type 1 below):
> >
> > `Conceptually, three classes of system cache can be envisaged:
> >
> > 1. System caches which lie before the point of coherency and cannot
> > be managed by any cache maintenance instructions. Such systems
> > fundamentally undermine the concept of cache maintenance
> > instructions operating to the point of coherency, as they imply
> > the use of non-architecture mechanisms to manage coherency. The
> > use of such systems in the ARM architecture is explicitly
> > prohibited.
>
> Hmm, I thought this was what GPUs typically have, with their own
> internal caches that are managed by the GPU rather than the normal
> cache maintenance instructions. Does this prohibit the use of most
> GPU devices with ARMv8, or did I misunderstand what they do?
No, because it's the responsibility of the GPU/GPU driver to ensure
that the internal caches are not visible to the CPU. I guess you can
think of data in the GPU private cache like data sitting in a CPU's write
buffer (i.e. non-snoopable).
> > 2. System caches which lie before the point of coherency and can be
> > managed by cache maintenance by address instructions that apply to
> > the point of coherency, but cannot be managed by cache maintenance
> > by set/way instructions. Where maintenance of the entirety of such
> > a cache must be performed, as in the case for power management, it
> > must be performed using non-architectural mechanisms.
>
> That still doesn't define which cache maintenance instructions are
> required for a device that is marked as not coherent using the _CCA
> property.
>
> Here, I know that I have a cache that I can flush or invalidate or sync
> using architected instructions, but should I?
Table 15 in the IORT spec show the 8 combinations of CCA/CPM/DACs,
the mapping requirements and whether or not maintenance is required.
The actual maintenance operations aren't described, but they would
correspond with what we currently do in the ARM and arm64 kernels (clean to
device, clean+inv from device).
> In particular, there are two common models that we support in Linux:
>
> a) embedded ARM32 and others
>
> dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> dma_cache_sync() == not supportable
> dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
>
> b) NUMA servers (parisc, itanium) and others
>
> dma_alloc_noncoherent() == alloc cached
This would lead to mismatched memory attributes on ARM/arm64.
> dma_alloc_coherent() == alloc uncached
> dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
Cache sync doesn't exist in the ARM/arm64architecture, what are the
semantics supposed to be? Maybe it's just DSB for us (complete all pending
maintenance).
> There are probably other models that could happen, but the patch
> set seems to assume a) is the only possible model, while the
> architecture description you cite seems to still allow both a) and
> b), as well as some variations, and it's possible that we will
> see b) on arm64 servers but not a)
Well, we should be careful not to confuse the ACPI spec with the ARM
architecture. The latter is more permissive, but does disallow system
caches that do not respect broadcast maintenance.
It's also worth pointing out that the architecture doesn't distinguish
between embedded and server machines using A-class processors.
> You could also have a system that requires cache invalidation for
> sending data from the device to memory, but does not require anything
> for memory-to-device data, or you could have the opposite.
You could theoretically build all sorts of strange devices, but that doesn't
mean we have to support them. In the case you describe, they'd have to put
up with the cost of redundant cache cleaning but it should at least function
correctly.
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 11:46 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 11:46 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> > So for the CPU caches we'd do the usual clean to push dirty lines to the device
> > and (clean+)invalidate before reading data from the device. For the "other
> > caches in the system" we currently assume (for ARM64) that cache maintenance
> > will be broadcast and therefore I wouldn't anticipate doing anything extra.
> >
> > If people want to build system caches that don't respect broadcast cache
> > maintenance and require explicit management (e.g outer_flush), then I
> > consider that a broken system and we should try to disable the cache before
> > entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> > architecture (type 1 below):
> >
> > `Conceptually, three classes of system cache can be envisaged:
> >
> > 1. System caches which lie before the point of coherency and cannot
> > be managed by any cache maintenance instructions. Such systems
> > fundamentally undermine the concept of cache maintenance
> > instructions operating to the point of coherency, as they imply
> > the use of non-architecture mechanisms to manage coherency. The
> > use of such systems in the ARM architecture is explicitly
> > prohibited.
>
> Hmm, I thought this was what GPUs typically have, with their own
> internal caches that are managed by the GPU rather than the normal
> cache maintenance instructions. Does this prohibit the use of most
> GPU devices with ARMv8, or did I misunderstand what they do?
No, because it's the responsibility of the GPU/GPU driver to ensure
that the internal caches are not visible to the CPU. I guess you can
think of data in the GPU private cache like data sitting in a CPU's write
buffer (i.e. non-snoopable).
> > 2. System caches which lie before the point of coherency and can be
> > managed by cache maintenance by address instructions that apply to
> > the point of coherency, but cannot be managed by cache maintenance
> > by set/way instructions. Where maintenance of the entirety of such
> > a cache must be performed, as in the case for power management, it
> > must be performed using non-architectural mechanisms.
>
> That still doesn't define which cache maintenance instructions are
> required for a device that is marked as not coherent using the _CCA
> property.
>
> Here, I know that I have a cache that I can flush or invalidate or sync
> using architected instructions, but should I?
Table 15 in the IORT spec show the 8 combinations of CCA/CPM/DACs,
the mapping requirements and whether or not maintenance is required.
The actual maintenance operations aren't described, but they would
correspond with what we currently do in the ARM and arm64 kernels (clean to
device, clean+inv from device).
> In particular, there are two common models that we support in Linux:
>
> a) embedded ARM32 and others
>
> dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> dma_cache_sync() == not supportable
> dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
>
> b) NUMA servers (parisc, itanium) and others
>
> dma_alloc_noncoherent() == alloc cached
This would lead to mismatched memory attributes on ARM/arm64.
> dma_alloc_coherent() == alloc uncached
> dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
Cache sync doesn't exist in the ARM/arm64architecture, what are the
semantics supposed to be? Maybe it's just DSB for us (complete all pending
maintenance).
> There are probably other models that could happen, but the patch
> set seems to assume a) is the only possible model, while the
> architecture description you cite seems to still allow both a) and
> b), as well as some variations, and it's possible that we will
> see b) on arm64 servers but not a)
Well, we should be careful not to confuse the ACPI spec with the ARM
architecture. The latter is more permissive, but does disallow system
caches that do not respect broadcast maintenance.
It's also worth pointing out that the architecture doesn't distinguish
between embedded and server machines using A-class processors.
> You could also have a system that requires cache invalidation for
> sending data from the device to memory, but does not require anything
> for memory-to-device data, or you could have the opposite.
You could theoretically build all sorts of strange devices, but that doesn't
mean we have to support them. In the case you describe, they'd have to put
up with the cost of redundant cache cleaning but it should at least function
correctly.
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 11:46 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 11:46 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> > So for the CPU caches we'd do the usual clean to push dirty lines to the device
> > and (clean+)invalidate before reading data from the device. For the "other
> > caches in the system" we currently assume (for ARM64) that cache maintenance
> > will be broadcast and therefore I wouldn't anticipate doing anything extra.
> >
> > If people want to build system caches that don't respect broadcast cache
> > maintenance and require explicit management (e.g outer_flush), then I
> > consider that a broken system and we should try to disable the cache before
> > entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> > architecture (type 1 below):
> >
> > `Conceptually, three classes of system cache can be envisaged:
> >
> > 1. System caches which lie before the point of coherency and cannot
> > be managed by any cache maintenance instructions. Such systems
> > fundamentally undermine the concept of cache maintenance
> > instructions operating to the point of coherency, as they imply
> > the use of non-architecture mechanisms to manage coherency. The
> > use of such systems in the ARM architecture is explicitly
> > prohibited.
>
> Hmm, I thought this was what GPUs typically have, with their own
> internal caches that are managed by the GPU rather than the normal
> cache maintenance instructions. Does this prohibit the use of most
> GPU devices with ARMv8, or did I misunderstand what they do?
No, because it's the responsibility of the GPU/GPU driver to ensure
that the internal caches are not visible to the CPU. I guess you can
think of data in the GPU private cache like data sitting in a CPU's write
buffer (i.e. non-snoopable).
> > 2. System caches which lie before the point of coherency and can be
> > managed by cache maintenance by address instructions that apply to
> > the point of coherency, but cannot be managed by cache maintenance
> > by set/way instructions. Where maintenance of the entirety of such
> > a cache must be performed, as in the case for power management, it
> > must be performed using non-architectural mechanisms.
>
> That still doesn't define which cache maintenance instructions are
> required for a device that is marked as not coherent using the _CCA
> property.
>
> Here, I know that I have a cache that I can flush or invalidate or sync
> using architected instructions, but should I?
Table 15 in the IORT spec show the 8 combinations of CCA/CPM/DACs,
the mapping requirements and whether or not maintenance is required.
The actual maintenance operations aren't described, but they would
correspond with what we currently do in the ARM and arm64 kernels (clean to
device, clean+inv from device).
> In particular, there are two common models that we support in Linux:
>
> a) embedded ARM32 and others
>
> dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> dma_cache_sync() == not supportable
> dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
>
> b) NUMA servers (parisc, itanium) and others
>
> dma_alloc_noncoherent() == alloc cached
This would lead to mismatched memory attributes on ARM/arm64.
> dma_alloc_coherent() == alloc uncached
> dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
Cache sync doesn't exist in the ARM/arm64architecture, what are the
semantics supposed to be? Maybe it's just DSB for us (complete all pending
maintenance).
> There are probably other models that could happen, but the patch
> set seems to assume a) is the only possible model, while the
> architecture description you cite seems to still allow both a) and
> b), as well as some variations, and it's possible that we will
> see b) on arm64 servers but not a)
Well, we should be careful not to confuse the ACPI spec with the ARM
architecture. The latter is more permissive, but does disallow system
caches that do not respect broadcast maintenance.
It's also worth pointing out that the architecture doesn't distinguish
between embedded and server machines using A-class processors.
> You could also have a system that requires cache invalidation for
> sending data from the device to memory, but does not require anything
> for memory-to-device data, or you could have the opposite.
You could theoretically build all sorts of strange devices, but that doesn't
mean we have to support them. In the case you describe, they'd have to put
up with the cost of redundant cache cleaning but it should at least function
correctly.
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 11:46 ` Will Deacon
(?)
@ 2015-04-30 13:03 ` Arnd Bergmann
-1 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:03 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> > > So for the CPU caches we'd do the usual clean to push dirty lines to the device
> > > and (clean+)invalidate before reading data from the device. For the "other
> > > caches in the system" we currently assume (for ARM64) that cache maintenance
> > > will be broadcast and therefore I wouldn't anticipate doing anything extra.
> > >
> > > If people want to build system caches that don't respect broadcast cache
> > > maintenance and require explicit management (e.g outer_flush), then I
> > > consider that a broken system and we should try to disable the cache before
> > > entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> > > architecture (type 1 below):
> > >
> > > `Conceptually, three classes of system cache can be envisaged:
> > >
> > > 1. System caches which lie before the point of coherency and cannot
> > > be managed by any cache maintenance instructions. Such systems
> > > fundamentally undermine the concept of cache maintenance
> > > instructions operating to the point of coherency, as they imply
> > > the use of non-architecture mechanisms to manage coherency. The
> > > use of such systems in the ARM architecture is explicitly
> > > prohibited.
> >
> > Hmm, I thought this was what GPUs typically have, with their own
> > internal caches that are managed by the GPU rather than the normal
> > cache maintenance instructions. Does this prohibit the use of most
> > GPU devices with ARMv8, or did I misunderstand what they do?
>
> No, because it's the responsibility of the GPU/GPU driver to ensure
> that the internal caches are not visible to the CPU. I guess you can
> think of data in the GPU private cache like data sitting in a CPU's write
> buffer (i.e. non-snoopable).
Ok.
> > In particular, there are two common models that we support in Linux:
> >
> > a) embedded ARM32 and others
> >
> > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > dma_cache_sync() == not supportable
> > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> >
> > b) NUMA servers (parisc, itanium) and others
> >
> > dma_alloc_noncoherent() == alloc cached
>
> This would lead to mismatched memory attributes on ARM/arm64.
How so? This is just what __dma_alloc() on arm64 does for
coherent devices:
/* no need for non-cacheable mapping if coherent */
if (coherent)
return ptr;
> > dma_alloc_coherent() == alloc uncached
> > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
>
> Cache sync doesn't exist in the ARM/arm64architecture, what are the
> semantics supposed to be? Maybe it's just DSB for us (complete all pending
> maintenance).
It ensures that a state of a buffer as observed by CPU and device is
identical. It's possible that we removed all platforms that did something
interesting here, so it's one of these:
a) On architectures that are mostly coherent, it's a barrier
that is broadcast to all devices, like I assume DSB is. IA64
currently does this for all machines, but IIRC it used to
access some cluster interconnect at some point to enforce a
flush.
The ARM32 based ArmadaXP also falls into this model if the cache
coherency fabric is enabled, as that needs to be synchronized
b) On architectures where the device may not see the state of the cache,
but the CPU is always aware of anything the device sends it,
it flushes the cache. This seems to be the case on parisc,
and in particular, there are some variants that do not support
dma_alloc_coherent but only dma_alloc_noncoherent.
c) On architectures that need the synchronization both ways,
it does (almost) the same invalidate/clean/flush thing as
ARM, except it doesn't have to worry about cache lines from
speculative prefetch which make it impossible to implement on
ARM.
> > There are probably other models that could happen, but the patch
> > set seems to assume a) is the only possible model, while the
> > architecture description you cite seems to still allow both a) and
> > b), as well as some variations, and it's possible that we will
> > see b) on arm64 servers but not a)
>
> Well, we should be careful not to confuse the ACPI spec with the ARM
> architecture. The latter is more permissive, but does disallow system
> caches that do not respect broadcast maintenance.
>
> It's also worth pointing out that the architecture doesn't distinguish
> between embedded and server machines using A-class processors.
>
> > You could also have a system that requires cache invalidation for
> > sending data from the device to memory, but does not require anything
> > for memory-to-device data, or you could have the opposite.
>
> You could theoretically build all sorts of strange devices, but that doesn't
> mean we have to support them. In the case you describe, they'd have to put
> up with the cost of redundant cache cleaning but it should at least function
> correctly.
Which case would a variant of ArmadaXP with a 64-bit core fall into then?
Do I understand it right that requiring to sync the coherency fabric
would make it noncompliant with ACPI but still architecturally compliant?
I guess we could handle that case as well, by requiring any ACPI based
firmware to turn off the coherency fabric on that system and just making
it dog slow.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 13:03 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:03 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> > > So for the CPU caches we'd do the usual clean to push dirty lines to the device
> > > and (clean+)invalidate before reading data from the device. For the "other
> > > caches in the system" we currently assume (for ARM64) that cache maintenance
> > > will be broadcast and therefore I wouldn't anticipate doing anything extra.
> > >
> > > If people want to build system caches that don't respect broadcast cache
> > > maintenance and require explicit management (e.g outer_flush), then I
> > > consider that a broken system and we should try to disable the cache before
> > > entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> > > architecture (type 1 below):
> > >
> > > `Conceptually, three classes of system cache can be envisaged:
> > >
> > > 1. System caches which lie before the point of coherency and cannot
> > > be managed by any cache maintenance instructions. Such systems
> > > fundamentally undermine the concept of cache maintenance
> > > instructions operating to the point of coherency, as they imply
> > > the use of non-architecture mechanisms to manage coherency. The
> > > use of such systems in the ARM architecture is explicitly
> > > prohibited.
> >
> > Hmm, I thought this was what GPUs typically have, with their own
> > internal caches that are managed by the GPU rather than the normal
> > cache maintenance instructions. Does this prohibit the use of most
> > GPU devices with ARMv8, or did I misunderstand what they do?
>
> No, because it's the responsibility of the GPU/GPU driver to ensure
> that the internal caches are not visible to the CPU. I guess you can
> think of data in the GPU private cache like data sitting in a CPU's write
> buffer (i.e. non-snoopable).
Ok.
> > In particular, there are two common models that we support in Linux:
> >
> > a) embedded ARM32 and others
> >
> > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > dma_cache_sync() == not supportable
> > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> >
> > b) NUMA servers (parisc, itanium) and others
> >
> > dma_alloc_noncoherent() == alloc cached
>
> This would lead to mismatched memory attributes on ARM/arm64.
How so? This is just what __dma_alloc() on arm64 does for
coherent devices:
/* no need for non-cacheable mapping if coherent */
if (coherent)
return ptr;
> > dma_alloc_coherent() == alloc uncached
> > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
>
> Cache sync doesn't exist in the ARM/arm64architecture, what are the
> semantics supposed to be? Maybe it's just DSB for us (complete all pending
> maintenance).
It ensures that a state of a buffer as observed by CPU and device is
identical. It's possible that we removed all platforms that did something
interesting here, so it's one of these:
a) On architectures that are mostly coherent, it's a barrier
that is broadcast to all devices, like I assume DSB is. IA64
currently does this for all machines, but IIRC it used to
access some cluster interconnect at some point to enforce a
flush.
The ARM32 based ArmadaXP also falls into this model if the cache
coherency fabric is enabled, as that needs to be synchronized
b) On architectures where the device may not see the state of the cache,
but the CPU is always aware of anything the device sends it,
it flushes the cache. This seems to be the case on parisc,
and in particular, there are some variants that do not support
dma_alloc_coherent but only dma_alloc_noncoherent.
c) On architectures that need the synchronization both ways,
it does (almost) the same invalidate/clean/flush thing as
ARM, except it doesn't have to worry about cache lines from
speculative prefetch which make it impossible to implement on
ARM.
> > There are probably other models that could happen, but the patch
> > set seems to assume a) is the only possible model, while the
> > architecture description you cite seems to still allow both a) and
> > b), as well as some variations, and it's possible that we will
> > see b) on arm64 servers but not a)
>
> Well, we should be careful not to confuse the ACPI spec with the ARM
> architecture. The latter is more permissive, but does disallow system
> caches that do not respect broadcast maintenance.
>
> It's also worth pointing out that the architecture doesn't distinguish
> between embedded and server machines using A-class processors.
>
> > You could also have a system that requires cache invalidation for
> > sending data from the device to memory, but does not require anything
> > for memory-to-device data, or you could have the opposite.
>
> You could theoretically build all sorts of strange devices, but that doesn't
> mean we have to support them. In the case you describe, they'd have to put
> up with the cost of redundant cache cleaning but it should at least function
> correctly.
Which case would a variant of ArmadaXP with a 64-bit core fall into then?
Do I understand it right that requiring to sync the coherency fabric
would make it noncompliant with ACPI but still architecturally compliant?
I guess we could handle that case as well, by requiring any ACPI based
firmware to turn off the coherency fabric on that system and just making
it dog slow.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 13:03 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:03 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 12:07:18 Will Deacon wrote:
> > > So for the CPU caches we'd do the usual clean to push dirty lines to the device
> > > and (clean+)invalidate before reading data from the device. For the "other
> > > caches in the system" we currently assume (for ARM64) that cache maintenance
> > > will be broadcast and therefore I wouldn't anticipate doing anything extra.
> > >
> > > If people want to build system caches that don't respect broadcast cache
> > > maintenance and require explicit management (e.g outer_flush), then I
> > > consider that a broken system and we should try to disable the cache before
> > > entering the kernel. ARMv8 explicitly prohibits this type of cache in the
> > > architecture (type 1 below):
> > >
> > > `Conceptually, three classes of system cache can be envisaged:
> > >
> > > 1. System caches which lie before the point of coherency and cannot
> > > be managed by any cache maintenance instructions. Such systems
> > > fundamentally undermine the concept of cache maintenance
> > > instructions operating to the point of coherency, as they imply
> > > the use of non-architecture mechanisms to manage coherency. The
> > > use of such systems in the ARM architecture is explicitly
> > > prohibited.
> >
> > Hmm, I thought this was what GPUs typically have, with their own
> > internal caches that are managed by the GPU rather than the normal
> > cache maintenance instructions. Does this prohibit the use of most
> > GPU devices with ARMv8, or did I misunderstand what they do?
>
> No, because it's the responsibility of the GPU/GPU driver to ensure
> that the internal caches are not visible to the CPU. I guess you can
> think of data in the GPU private cache like data sitting in a CPU's write
> buffer (i.e. non-snoopable).
Ok.
> > In particular, there are two common models that we support in Linux:
> >
> > a) embedded ARM32 and others
> >
> > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > dma_cache_sync() == not supportable
> > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> >
> > b) NUMA servers (parisc, itanium) and others
> >
> > dma_alloc_noncoherent() == alloc cached
>
> This would lead to mismatched memory attributes on ARM/arm64.
How so? This is just what __dma_alloc() on arm64 does for
coherent devices:
/* no need for non-cacheable mapping if coherent */
if (coherent)
return ptr;
> > dma_alloc_coherent() == alloc uncached
> > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
>
> Cache sync doesn't exist in the ARM/arm64architecture, what are the
> semantics supposed to be? Maybe it's just DSB for us (complete all pending
> maintenance).
It ensures that a state of a buffer as observed by CPU and device is
identical. It's possible that we removed all platforms that did something
interesting here, so it's one of these:
a) On architectures that are mostly coherent, it's a barrier
that is broadcast to all devices, like I assume DSB is. IA64
currently does this for all machines, but IIRC it used to
access some cluster interconnect at some point to enforce a
flush.
The ARM32 based ArmadaXP also falls into this model if the cache
coherency fabric is enabled, as that needs to be synchronized
b) On architectures where the device may not see the state of the cache,
but the CPU is always aware of anything the device sends it,
it flushes the cache. This seems to be the case on parisc,
and in particular, there are some variants that do not support
dma_alloc_coherent but only dma_alloc_noncoherent.
c) On architectures that need the synchronization both ways,
it does (almost) the same invalidate/clean/flush thing as
ARM, except it doesn't have to worry about cache lines from
speculative prefetch which make it impossible to implement on
ARM.
> > There are probably other models that could happen, but the patch
> > set seems to assume a) is the only possible model, while the
> > architecture description you cite seems to still allow both a) and
> > b), as well as some variations, and it's possible that we will
> > see b) on arm64 servers but not a)
>
> Well, we should be careful not to confuse the ACPI spec with the ARM
> architecture. The latter is more permissive, but does disallow system
> caches that do not respect broadcast maintenance.
>
> It's also worth pointing out that the architecture doesn't distinguish
> between embedded and server machines using A-class processors.
>
> > You could also have a system that requires cache invalidation for
> > sending data from the device to memory, but does not require anything
> > for memory-to-device data, or you could have the opposite.
>
> You could theoretically build all sorts of strange devices, but that doesn't
> mean we have to support them. In the case you describe, they'd have to put
> up with the cost of redundant cache cleaning but it should at least function
> correctly.
Which case would a variant of ArmadaXP with a 64-bit core fall into then?
Do I understand it right that requiring to sync the coherency fabric
would make it noncompliant with ACPI but still architecturally compliant?
I guess we could handle that case as well, by requiring any ACPI based
firmware to turn off the coherency fabric on that system and just making
it dog slow.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 13:03 ` Arnd Bergmann
(?)
@ 2015-04-30 13:13 ` Will Deacon
-1 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 13:13 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > > In particular, there are two common models that we support in Linux:
> > >
> > > a) embedded ARM32 and others
> > >
> > > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > > dma_cache_sync() == not supportable
> > > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> > >
> > > b) NUMA servers (parisc, itanium) and others
> > >
> > > dma_alloc_noncoherent() == alloc cached
> >
> > This would lead to mismatched memory attributes on ARM/arm64.
>
> How so? This is just what __dma_alloc() on arm64 does for
> coherent devices:
>
> /* no need for non-cacheable mapping if coherent */
> if (coherent)
> return ptr;
Ok, I thought that you were only describing the cases when the device is
non-coherent (_CCA=0). Otherwise, your assertion above that
dma_alloc_coherent == alloc uncached isn't true for coherent devices.
So now I'm confused...
> > > dma_alloc_coherent() == alloc uncached
> > > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
> >
> > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > maintenance).
>
> It ensures that a state of a buffer as observed by CPU and device is
> identical. It's possible that we removed all platforms that did something
> interesting here, so it's one of these:
>
> a) On architectures that are mostly coherent, it's a barrier
> that is broadcast to all devices, like I assume DSB is. IA64
> currently does this for all machines, but IIRC it used to
> access some cluster interconnect at some point to enforce a
> flush.
> The ARM32 based ArmadaXP also falls into this model if the cache
> coherency fabric is enabled, as that needs to be synchronized
> b) On architectures where the device may not see the state of the cache,
> but the CPU is always aware of anything the device sends it,
> it flushes the cache. This seems to be the case on parisc,
> and in particular, there are some variants that do not support
> dma_alloc_coherent but only dma_alloc_noncoherent.
> c) On architectures that need the synchronization both ways,
> it does (almost) the same invalidate/clean/flush thing as
> ARM, except it doesn't have to worry about cache lines from
> speculative prefetch which make it impossible to implement on
> ARM.
Okey doke, thanks for the explanation. It sounds like we can just build
the primitive out of the existing cache maintenance routines if we need
to implement it.
> > > There are probably other models that could happen, but the patch
> > > set seems to assume a) is the only possible model, while the
> > > architecture description you cite seems to still allow both a) and
> > > b), as well as some variations, and it's possible that we will
> > > see b) on arm64 servers but not a)
> >
> > Well, we should be careful not to confuse the ACPI spec with the ARM
> > architecture. The latter is more permissive, but does disallow system
> > caches that do not respect broadcast maintenance.
> >
> > It's also worth pointing out that the architecture doesn't distinguish
> > between embedded and server machines using A-class processors.
> >
> > > You could also have a system that requires cache invalidation for
> > > sending data from the device to memory, but does not require anything
> > > for memory-to-device data, or you could have the opposite.
> >
> > You could theoretically build all sorts of strange devices, but that doesn't
> > mean we have to support them. In the case you describe, they'd have to put
> > up with the cost of redundant cache cleaning but it should at least function
> > correctly.
>
> Which case would a variant of ArmadaXP with a 64-bit core fall into then?
> Do I understand it right that requiring to sync the coherency fabric
> would make it noncompliant with ACPI but still architecturally compliant?
I would say that the ArmadaXP coherency fabric is not compliant with ARMv8
as it requires additional steps over those cache maintenance instructions
described by the architecture (i.e. it falls into class (1) of the three
classes of system cache in the architecture).
> I guess we could handle that case as well, by requiring any ACPI based
> firmware to turn off the coherency fabric on that system and just making
> it dog slow.
We already require something similar in Documentation/arm64/booting.txt:
`System caches which do not respect architected cache maintenance by VA
operations (not recommended) must be configured and disabled.'
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 13:13 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 13:13 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > > In particular, there are two common models that we support in Linux:
> > >
> > > a) embedded ARM32 and others
> > >
> > > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > > dma_cache_sync() == not supportable
> > > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> > >
> > > b) NUMA servers (parisc, itanium) and others
> > >
> > > dma_alloc_noncoherent() == alloc cached
> >
> > This would lead to mismatched memory attributes on ARM/arm64.
>
> How so? This is just what __dma_alloc() on arm64 does for
> coherent devices:
>
> /* no need for non-cacheable mapping if coherent */
> if (coherent)
> return ptr;
Ok, I thought that you were only describing the cases when the device is
non-coherent (_CCA=0). Otherwise, your assertion above that
dma_alloc_coherent == alloc uncached isn't true for coherent devices.
So now I'm confused...
> > > dma_alloc_coherent() == alloc uncached
> > > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
> >
> > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > maintenance).
>
> It ensures that a state of a buffer as observed by CPU and device is
> identical. It's possible that we removed all platforms that did something
> interesting here, so it's one of these:
>
> a) On architectures that are mostly coherent, it's a barrier
> that is broadcast to all devices, like I assume DSB is. IA64
> currently does this for all machines, but IIRC it used to
> access some cluster interconnect at some point to enforce a
> flush.
> The ARM32 based ArmadaXP also falls into this model if the cache
> coherency fabric is enabled, as that needs to be synchronized
> b) On architectures where the device may not see the state of the cache,
> but the CPU is always aware of anything the device sends it,
> it flushes the cache. This seems to be the case on parisc,
> and in particular, there are some variants that do not support
> dma_alloc_coherent but only dma_alloc_noncoherent.
> c) On architectures that need the synchronization both ways,
> it does (almost) the same invalidate/clean/flush thing as
> ARM, except it doesn't have to worry about cache lines from
> speculative prefetch which make it impossible to implement on
> ARM.
Okey doke, thanks for the explanation. It sounds like we can just build
the primitive out of the existing cache maintenance routines if we need
to implement it.
> > > There are probably other models that could happen, but the patch
> > > set seems to assume a) is the only possible model, while the
> > > architecture description you cite seems to still allow both a) and
> > > b), as well as some variations, and it's possible that we will
> > > see b) on arm64 servers but not a)
> >
> > Well, we should be careful not to confuse the ACPI spec with the ARM
> > architecture. The latter is more permissive, but does disallow system
> > caches that do not respect broadcast maintenance.
> >
> > It's also worth pointing out that the architecture doesn't distinguish
> > between embedded and server machines using A-class processors.
> >
> > > You could also have a system that requires cache invalidation for
> > > sending data from the device to memory, but does not require anything
> > > for memory-to-device data, or you could have the opposite.
> >
> > You could theoretically build all sorts of strange devices, but that doesn't
> > mean we have to support them. In the case you describe, they'd have to put
> > up with the cost of redundant cache cleaning but it should at least function
> > correctly.
>
> Which case would a variant of ArmadaXP with a 64-bit core fall into then?
> Do I understand it right that requiring to sync the coherency fabric
> would make it noncompliant with ACPI but still architecturally compliant?
I would say that the ArmadaXP coherency fabric is not compliant with ARMv8
as it requires additional steps over those cache maintenance instructions
described by the architecture (i.e. it falls into class (1) of the three
classes of system cache in the architecture).
> I guess we could handle that case as well, by requiring any ACPI based
> firmware to turn off the coherency fabric on that system and just making
> it dog slow.
We already require something similar in Documentation/arm64/booting.txt:
`System caches which do not respect architected cache maintenance by VA
operations (not recommended) must be configured and disabled.'
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 13:13 ` Will Deacon
0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2015-04-30 13:13 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > > In particular, there are two common models that we support in Linux:
> > >
> > > a) embedded ARM32 and others
> > >
> > > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > > dma_cache_sync() == not supportable
> > > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> > >
> > > b) NUMA servers (parisc, itanium) and others
> > >
> > > dma_alloc_noncoherent() == alloc cached
> >
> > This would lead to mismatched memory attributes on ARM/arm64.
>
> How so? This is just what __dma_alloc() on arm64 does for
> coherent devices:
>
> /* no need for non-cacheable mapping if coherent */
> if (coherent)
> return ptr;
Ok, I thought that you were only describing the cases when the device is
non-coherent (_CCA=0). Otherwise, your assertion above that
dma_alloc_coherent == alloc uncached isn't true for coherent devices.
So now I'm confused...
> > > dma_alloc_coherent() == alloc uncached
> > > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
> >
> > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > maintenance).
>
> It ensures that a state of a buffer as observed by CPU and device is
> identical. It's possible that we removed all platforms that did something
> interesting here, so it's one of these:
>
> a) On architectures that are mostly coherent, it's a barrier
> that is broadcast to all devices, like I assume DSB is. IA64
> currently does this for all machines, but IIRC it used to
> access some cluster interconnect at some point to enforce a
> flush.
> The ARM32 based ArmadaXP also falls into this model if the cache
> coherency fabric is enabled, as that needs to be synchronized
> b) On architectures where the device may not see the state of the cache,
> but the CPU is always aware of anything the device sends it,
> it flushes the cache. This seems to be the case on parisc,
> and in particular, there are some variants that do not support
> dma_alloc_coherent but only dma_alloc_noncoherent.
> c) On architectures that need the synchronization both ways,
> it does (almost) the same invalidate/clean/flush thing as
> ARM, except it doesn't have to worry about cache lines from
> speculative prefetch which make it impossible to implement on
> ARM.
Okey doke, thanks for the explanation. It sounds like we can just build
the primitive out of the existing cache maintenance routines if we need
to implement it.
> > > There are probably other models that could happen, but the patch
> > > set seems to assume a) is the only possible model, while the
> > > architecture description you cite seems to still allow both a) and
> > > b), as well as some variations, and it's possible that we will
> > > see b) on arm64 servers but not a)
> >
> > Well, we should be careful not to confuse the ACPI spec with the ARM
> > architecture. The latter is more permissive, but does disallow system
> > caches that do not respect broadcast maintenance.
> >
> > It's also worth pointing out that the architecture doesn't distinguish
> > between embedded and server machines using A-class processors.
> >
> > > You could also have a system that requires cache invalidation for
> > > sending data from the device to memory, but does not require anything
> > > for memory-to-device data, or you could have the opposite.
> >
> > You could theoretically build all sorts of strange devices, but that doesn't
> > mean we have to support them. In the case you describe, they'd have to put
> > up with the cost of redundant cache cleaning but it should at least function
> > correctly.
>
> Which case would a variant of ArmadaXP with a 64-bit core fall into then?
> Do I understand it right that requiring to sync the coherency fabric
> would make it noncompliant with ACPI but still architecturally compliant?
I would say that the ArmadaXP coherency fabric is not compliant with ARMv8
as it requires additional steps over those cache maintenance instructions
described by the architecture (i.e. it falls into class (1) of the three
classes of system cache in the architecture).
> I guess we could handle that case as well, by requiring any ACPI based
> firmware to turn off the coherency fabric on that system and just making
> it dog slow.
We already require something similar in Documentation/arm64/booting.txt:
`System caches which do not respect architected cache maintenance by VA
operations (not recommended) must be configured and disabled.'
Will
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 13:13 ` Will Deacon
(?)
@ 2015-04-30 13:52 ` Arnd Bergmann
-1 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:52 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > > > In particular, there are two common models that we support in Linux:
> > > >
> > > > a) embedded ARM32 and others
> > > >
> > > > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > > > dma_cache_sync() == not supportable
> > > > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> > > >
> > > > b) NUMA servers (parisc, itanium) and others
> > > >
> > > > dma_alloc_noncoherent() == alloc cached
> > >
> > > This would lead to mismatched memory attributes on ARM/arm64.
> >
> > How so? This is just what __dma_alloc() on arm64 does for
> > coherent devices:
> >
> > /* no need for non-cacheable mapping if coherent */
> > if (coherent)
> > return ptr;
>
> Ok, I thought that you were only describing the cases when the device is
> non-coherent (_CCA=0). Otherwise, your assertion above that
> dma_alloc_coherent == alloc uncached isn't true for coherent devices.
>
> So now I'm confused...
What I was describing here is a device that is not fully coherent,
but instead requires some operation other than a cache flush/invalidate
to complete before the memory can be accessed.
> > > > dma_alloc_coherent() == alloc uncached
> > > > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
> > >
> > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > maintenance).
> >
> > It ensures that a state of a buffer as observed by CPU and device is
> > identical. It's possible that we removed all platforms that did something
> > interesting here, so it's one of these:
> >
> > a) On architectures that are mostly coherent, it's a barrier
> > that is broadcast to all devices, like I assume DSB is. IA64
> > currently does this for all machines, but IIRC it used to
> > access some cluster interconnect at some point to enforce a
> > flush.
> > The ARM32 based ArmadaXP also falls into this model if the cache
> > coherency fabric is enabled, as that needs to be synchronized
> > b) On architectures where the device may not see the state of the cache,
> > but the CPU is always aware of anything the device sends it,
> > it flushes the cache. This seems to be the case on parisc,
> > and in particular, there are some variants that do not support
> > dma_alloc_coherent but only dma_alloc_noncoherent.
> > c) On architectures that need the synchronization both ways,
> > it does (almost) the same invalidate/clean/flush thing as
> > ARM, except it doesn't have to worry about cache lines from
> > speculative prefetch which make it impossible to implement on
> > ARM.
>
> Okey doke, thanks for the explanation. It sounds like we can just build
> the primitive out of the existing cache maintenance routines if we need
> to implement it.
Cases a) and b) yes, but not c), otherwise we could simplify
the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
and __dma_page_dev_to_cpu into one function.
And a) and b) are both for systems that are more coherent than what
our noncoherent dma_map_ops implement, but less coherent than what
the coherent dma_map_ops do, and that is specifically what the ACPI
binding cannot describe, unless you argue that either ACPI or ARMv8
forbids both of these models.
> > Which case would a variant of ArmadaXP with a 64-bit core fall into then?
> > Do I understand it right that requiring to sync the coherency fabric
> > would make it noncompliant with ACPI but still architecturally compliant?
>
> I would say that the ArmadaXP coherency fabric is not compliant with ARMv8
> as it requires additional steps over those cache maintenance instructions
> described by the architecture (i.e. it falls into class (1) of the three
> classes of system cache in the architecture).
>
> > I guess we could handle that case as well, by requiring any ACPI based
> > firmware to turn off the coherency fabric on that system and just making
> > it dog slow.
>
> We already require something similar in Documentation/arm64/booting.txt:
>
> `System caches which do not respect architected cache maintenance by VA
> operations (not recommended) must be configured and disabled.'
Hmm, does that rule really get violated here? I think it fully respects
the cache maintenance (flush/invalidate/clean) operations, but it does
not fully respect the dsb/dmb instructions, which is something else.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 13:52 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:52 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > > > In particular, there are two common models that we support in Linux:
> > > >
> > > > a) embedded ARM32 and others
> > > >
> > > > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > > > dma_cache_sync() == not supportable
> > > > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> > > >
> > > > b) NUMA servers (parisc, itanium) and others
> > > >
> > > > dma_alloc_noncoherent() == alloc cached
> > >
> > > This would lead to mismatched memory attributes on ARM/arm64.
> >
> > How so? This is just what __dma_alloc() on arm64 does for
> > coherent devices:
> >
> > /* no need for non-cacheable mapping if coherent */
> > if (coherent)
> > return ptr;
>
> Ok, I thought that you were only describing the cases when the device is
> non-coherent (_CCA=0). Otherwise, your assertion above that
> dma_alloc_coherent == alloc uncached isn't true for coherent devices.
>
> So now I'm confused...
What I was describing here is a device that is not fully coherent,
but instead requires some operation other than a cache flush/invalidate
to complete before the memory can be accessed.
> > > > dma_alloc_coherent() == alloc uncached
> > > > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
> > >
> > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > maintenance).
> >
> > It ensures that a state of a buffer as observed by CPU and device is
> > identical. It's possible that we removed all platforms that did something
> > interesting here, so it's one of these:
> >
> > a) On architectures that are mostly coherent, it's a barrier
> > that is broadcast to all devices, like I assume DSB is. IA64
> > currently does this for all machines, but IIRC it used to
> > access some cluster interconnect at some point to enforce a
> > flush.
> > The ARM32 based ArmadaXP also falls into this model if the cache
> > coherency fabric is enabled, as that needs to be synchronized
> > b) On architectures where the device may not see the state of the cache,
> > but the CPU is always aware of anything the device sends it,
> > it flushes the cache. This seems to be the case on parisc,
> > and in particular, there are some variants that do not support
> > dma_alloc_coherent but only dma_alloc_noncoherent.
> > c) On architectures that need the synchronization both ways,
> > it does (almost) the same invalidate/clean/flush thing as
> > ARM, except it doesn't have to worry about cache lines from
> > speculative prefetch which make it impossible to implement on
> > ARM.
>
> Okey doke, thanks for the explanation. It sounds like we can just build
> the primitive out of the existing cache maintenance routines if we need
> to implement it.
Cases a) and b) yes, but not c), otherwise we could simplify
the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
and __dma_page_dev_to_cpu into one function.
And a) and b) are both for systems that are more coherent than what
our noncoherent dma_map_ops implement, but less coherent than what
the coherent dma_map_ops do, and that is specifically what the ACPI
binding cannot describe, unless you argue that either ACPI or ARMv8
forbids both of these models.
> > Which case would a variant of ArmadaXP with a 64-bit core fall into then?
> > Do I understand it right that requiring to sync the coherency fabric
> > would make it noncompliant with ACPI but still architecturally compliant?
>
> I would say that the ArmadaXP coherency fabric is not compliant with ARMv8
> as it requires additional steps over those cache maintenance instructions
> described by the architecture (i.e. it falls into class (1) of the three
> classes of system cache in the architecture).
>
> > I guess we could handle that case as well, by requiring any ACPI based
> > firmware to turn off the coherency fabric on that system and just making
> > it dog slow.
>
> We already require something similar in Documentation/arm64/booting.txt:
>
> `System caches which do not respect architected cache maintenance by VA
> operations (not recommended) must be configured and disabled.'
Hmm, does that rule really get violated here? I think it fully respects
the cache maintenance (flush/invalidate/clean) operations, but it does
not fully respect the dsb/dmb instructions, which is something else.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 13:52 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:52 UTC (permalink / raw)
To: Will Deacon
Cc: linaro-acpi, suravee.suthikulpanit, linux-arm-kernel,
Catalin Marinas, rjw, linux-kernel, linux-acpi, lenb
On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > On Thu, Apr 30, 2015 at 12:24:12PM +0100, Arnd Bergmann wrote:
> > > > In particular, there are two common models that we support in Linux:
> > > >
> > > > a) embedded ARM32 and others
> > > >
> > > > dma_alloc_non_coherent() == dma_alloc_coherent() == alloc uncached
> > > > dma_cache_sync() == not supportable
> > > > dma_sync_{single,sg,page}_for_{device,cpu} == {flush, invalidate, ...}
> > > >
> > > > b) NUMA servers (parisc, itanium) and others
> > > >
> > > > dma_alloc_noncoherent() == alloc cached
> > >
> > > This would lead to mismatched memory attributes on ARM/arm64.
> >
> > How so? This is just what __dma_alloc() on arm64 does for
> > coherent devices:
> >
> > /* no need for non-cacheable mapping if coherent */
> > if (coherent)
> > return ptr;
>
> Ok, I thought that you were only describing the cases when the device is
> non-coherent (_CCA=0). Otherwise, your assertion above that
> dma_alloc_coherent == alloc uncached isn't true for coherent devices.
>
> So now I'm confused...
What I was describing here is a device that is not fully coherent,
but instead requires some operation other than a cache flush/invalidate
to complete before the memory can be accessed.
> > > > dma_alloc_coherent() == alloc uncached
> > > > dma_sync_{single,sg,page}_for_{device,cpu} == dma_cache_sync() == cache sync
> > >
> > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > maintenance).
> >
> > It ensures that a state of a buffer as observed by CPU and device is
> > identical. It's possible that we removed all platforms that did something
> > interesting here, so it's one of these:
> >
> > a) On architectures that are mostly coherent, it's a barrier
> > that is broadcast to all devices, like I assume DSB is. IA64
> > currently does this for all machines, but IIRC it used to
> > access some cluster interconnect at some point to enforce a
> > flush.
> > The ARM32 based ArmadaXP also falls into this model if the cache
> > coherency fabric is enabled, as that needs to be synchronized
> > b) On architectures where the device may not see the state of the cache,
> > but the CPU is always aware of anything the device sends it,
> > it flushes the cache. This seems to be the case on parisc,
> > and in particular, there are some variants that do not support
> > dma_alloc_coherent but only dma_alloc_noncoherent.
> > c) On architectures that need the synchronization both ways,
> > it does (almost) the same invalidate/clean/flush thing as
> > ARM, except it doesn't have to worry about cache lines from
> > speculative prefetch which make it impossible to implement on
> > ARM.
>
> Okey doke, thanks for the explanation. It sounds like we can just build
> the primitive out of the existing cache maintenance routines if we need
> to implement it.
Cases a) and b) yes, but not c), otherwise we could simplify
the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
and __dma_page_dev_to_cpu into one function.
And a) and b) are both for systems that are more coherent than what
our noncoherent dma_map_ops implement, but less coherent than what
the coherent dma_map_ops do, and that is specifically what the ACPI
binding cannot describe, unless you argue that either ACPI or ARMv8
forbids both of these models.
> > Which case would a variant of ArmadaXP with a 64-bit core fall into then?
> > Do I understand it right that requiring to sync the coherency fabric
> > would make it noncompliant with ACPI but still architecturally compliant?
>
> I would say that the ArmadaXP coherency fabric is not compliant with ARMv8
> as it requires additional steps over those cache maintenance instructions
> described by the architecture (i.e. it falls into class (1) of the three
> classes of system cache in the architecture).
>
> > I guess we could handle that case as well, by requiring any ACPI based
> > firmware to turn off the coherency fabric on that system and just making
> > it dog slow.
>
> We already require something similar in Documentation/arm64/booting.txt:
>
> `System caches which do not respect architected cache maintenance by VA
> operations (not recommended) must be configured and disabled.'
Hmm, does that rule really get violated here? I think it fully respects
the cache maintenance (flush/invalidate/clean) operations, but it does
not fully respect the dsb/dmb instructions, which is something else.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 13:52 ` Arnd Bergmann
(?)
@ 2015-04-30 15:55 ` Catalin Marinas
-1 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-04-30 15:55 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Will Deacon, linaro-acpi, rjw, linux-kernel, linux-acpi,
suravee.suthikulpanit, linux-arm-kernel, lenb
On Thu, Apr 30, 2015 at 03:52:17PM +0200, Arnd Bergmann wrote:
> On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> > On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > > maintenance).
> > >
> > > It ensures that a state of a buffer as observed by CPU and device is
> > > identical. It's possible that we removed all platforms that did something
> > > interesting here, so it's one of these:
> > >
> > > a) On architectures that are mostly coherent, it's a barrier
> > > that is broadcast to all devices, like I assume DSB is. IA64
> > > currently does this for all machines, but IIRC it used to
> > > access some cluster interconnect at some point to enforce a
> > > flush.
> > > The ARM32 based ArmadaXP also falls into this model if the cache
> > > coherency fabric is enabled, as that needs to be synchronized
I'm getting confused by the ArmadaXP case. IIRC, the point of the
arm,io-coherent property to the PL310 was precisely to make the
outer_sync a no-op when the coherency is enabled. So basically an mb()
would only issue a DSB on such platform without the PL310 cache sync.
On coherent systems, devices usually snoop the inner/CPU cache and not
the system cache, that's further down the line. So a DSB would ensure
the visibility at the coherent interconnect level before the system
cache. I don't think it needs to be broadcast all the way to devices.
> > > b) On architectures where the device may not see the state of the cache,
> > > but the CPU is always aware of anything the device sends it,
> > > it flushes the cache. This seems to be the case on parisc,
> > > and in particular, there are some variants that do not support
> > > dma_alloc_coherent but only dma_alloc_noncoherent.
> > > c) On architectures that need the synchronization both ways,
> > > it does (almost) the same invalidate/clean/flush thing as
> > > ARM, except it doesn't have to worry about cache lines from
> > > speculative prefetch which make it impossible to implement on
> > > ARM.
> >
> > Okey doke, thanks for the explanation. It sounds like we can just build
> > the primitive out of the existing cache maintenance routines if we need
> > to implement it.
>
> Cases a) and b) yes, but not c), otherwise we could simplify
> the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
> and __dma_page_dev_to_cpu into one function.
I don't fully understand c) or b). Wouldn't the non-coherent ops cover
them both, though potentially not as efficient?
> And a) and b) are both for systems that are more coherent than what
> our noncoherent dma_map_ops implement, but less coherent than what
> the coherent dma_map_ops do, and that is specifically what the ACPI
> binding cannot describe, unless you argue that either ACPI or ARMv8
> forbids both of these models.
In general, a DSB should work as described in the ARM ARM without the
need to poke additional devices (PL310 is an example not to follow).
> > > I guess we could handle that case as well, by requiring any ACPI based
> > > firmware to turn off the coherency fabric on that system and just making
> > > it dog slow.
> >
> > We already require something similar in Documentation/arm64/booting.txt:
> >
> > `System caches which do not respect architected cache maintenance by VA
> > operations (not recommended) must be configured and disabled.'
>
> Hmm, does that rule really get violated here? I think it fully respects
> the cache maintenance (flush/invalidate/clean) operations, but it does
> not fully respect the dsb/dmb instructions, which is something else.
If it fully respects the cache maintenance, it should also respect the
completion and ordering requirements of the cache maintenance
operations. That means that a DSB guarantees completion of such
operations.
--
Catalin
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 15:55 ` Catalin Marinas
0 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-04-30 15:55 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Apr 30, 2015 at 03:52:17PM +0200, Arnd Bergmann wrote:
> On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> > On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > > maintenance).
> > >
> > > It ensures that a state of a buffer as observed by CPU and device is
> > > identical. It's possible that we removed all platforms that did something
> > > interesting here, so it's one of these:
> > >
> > > a) On architectures that are mostly coherent, it's a barrier
> > > that is broadcast to all devices, like I assume DSB is. IA64
> > > currently does this for all machines, but IIRC it used to
> > > access some cluster interconnect at some point to enforce a
> > > flush.
> > > The ARM32 based ArmadaXP also falls into this model if the cache
> > > coherency fabric is enabled, as that needs to be synchronized
I'm getting confused by the ArmadaXP case. IIRC, the point of the
arm,io-coherent property to the PL310 was precisely to make the
outer_sync a no-op when the coherency is enabled. So basically an mb()
would only issue a DSB on such platform without the PL310 cache sync.
On coherent systems, devices usually snoop the inner/CPU cache and not
the system cache, that's further down the line. So a DSB would ensure
the visibility at the coherent interconnect level before the system
cache. I don't think it needs to be broadcast all the way to devices.
> > > b) On architectures where the device may not see the state of the cache,
> > > but the CPU is always aware of anything the device sends it,
> > > it flushes the cache. This seems to be the case on parisc,
> > > and in particular, there are some variants that do not support
> > > dma_alloc_coherent but only dma_alloc_noncoherent.
> > > c) On architectures that need the synchronization both ways,
> > > it does (almost) the same invalidate/clean/flush thing as
> > > ARM, except it doesn't have to worry about cache lines from
> > > speculative prefetch which make it impossible to implement on
> > > ARM.
> >
> > Okey doke, thanks for the explanation. It sounds like we can just build
> > the primitive out of the existing cache maintenance routines if we need
> > to implement it.
>
> Cases a) and b) yes, but not c), otherwise we could simplify
> the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
> and __dma_page_dev_to_cpu into one function.
I don't fully understand c) or b). Wouldn't the non-coherent ops cover
them both, though potentially not as efficient?
> And a) and b) are both for systems that are more coherent than what
> our noncoherent dma_map_ops implement, but less coherent than what
> the coherent dma_map_ops do, and that is specifically what the ACPI
> binding cannot describe, unless you argue that either ACPI or ARMv8
> forbids both of these models.
In general, a DSB should work as described in the ARM ARM without the
need to poke additional devices (PL310 is an example not to follow).
> > > I guess we could handle that case as well, by requiring any ACPI based
> > > firmware to turn off the coherency fabric on that system and just making
> > > it dog slow.
> >
> > We already require something similar in Documentation/arm64/booting.txt:
> >
> > `System caches which do not respect architected cache maintenance by VA
> > operations (not recommended) must be configured and disabled.'
>
> Hmm, does that rule really get violated here? I think it fully respects
> the cache maintenance (flush/invalidate/clean) operations, but it does
> not fully respect the dsb/dmb instructions, which is something else.
If it fully respects the cache maintenance, it should also respect the
completion and ordering requirements of the cache maintenance
operations. That means that a DSB guarantees completion of such
operations.
--
Catalin
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 15:55 ` Catalin Marinas
0 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-04-30 15:55 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Will Deacon, linaro-acpi, rjw, linux-kernel, linux-acpi,
suravee.suthikulpanit, linux-arm-kernel, lenb
On Thu, Apr 30, 2015 at 03:52:17PM +0200, Arnd Bergmann wrote:
> On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> > On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > > maintenance).
> > >
> > > It ensures that a state of a buffer as observed by CPU and device is
> > > identical. It's possible that we removed all platforms that did something
> > > interesting here, so it's one of these:
> > >
> > > a) On architectures that are mostly coherent, it's a barrier
> > > that is broadcast to all devices, like I assume DSB is. IA64
> > > currently does this for all machines, but IIRC it used to
> > > access some cluster interconnect at some point to enforce a
> > > flush.
> > > The ARM32 based ArmadaXP also falls into this model if the cache
> > > coherency fabric is enabled, as that needs to be synchronized
I'm getting confused by the ArmadaXP case. IIRC, the point of the
arm,io-coherent property to the PL310 was precisely to make the
outer_sync a no-op when the coherency is enabled. So basically an mb()
would only issue a DSB on such platform without the PL310 cache sync.
On coherent systems, devices usually snoop the inner/CPU cache and not
the system cache, that's further down the line. So a DSB would ensure
the visibility at the coherent interconnect level before the system
cache. I don't think it needs to be broadcast all the way to devices.
> > > b) On architectures where the device may not see the state of the cache,
> > > but the CPU is always aware of anything the device sends it,
> > > it flushes the cache. This seems to be the case on parisc,
> > > and in particular, there are some variants that do not support
> > > dma_alloc_coherent but only dma_alloc_noncoherent.
> > > c) On architectures that need the synchronization both ways,
> > > it does (almost) the same invalidate/clean/flush thing as
> > > ARM, except it doesn't have to worry about cache lines from
> > > speculative prefetch which make it impossible to implement on
> > > ARM.
> >
> > Okey doke, thanks for the explanation. It sounds like we can just build
> > the primitive out of the existing cache maintenance routines if we need
> > to implement it.
>
> Cases a) and b) yes, but not c), otherwise we could simplify
> the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
> and __dma_page_dev_to_cpu into one function.
I don't fully understand c) or b). Wouldn't the non-coherent ops cover
them both, though potentially not as efficient?
> And a) and b) are both for systems that are more coherent than what
> our noncoherent dma_map_ops implement, but less coherent than what
> the coherent dma_map_ops do, and that is specifically what the ACPI
> binding cannot describe, unless you argue that either ACPI or ARMv8
> forbids both of these models.
In general, a DSB should work as described in the ARM ARM without the
need to poke additional devices (PL310 is an example not to follow).
> > > I guess we could handle that case as well, by requiring any ACPI based
> > > firmware to turn off the coherency fabric on that system and just making
> > > it dog slow.
> >
> > We already require something similar in Documentation/arm64/booting.txt:
> >
> > `System caches which do not respect architected cache maintenance by VA
> > operations (not recommended) must be configured and disabled.'
>
> Hmm, does that rule really get violated here? I think it fully respects
> the cache maintenance (flush/invalidate/clean) operations, but it does
> not fully respect the dsb/dmb instructions, which is something else.
If it fully respects the cache maintenance, it should also respect the
completion and ordering requirements of the cache maintenance
operations. That means that a DSB guarantees completion of such
operations.
--
Catalin
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 15:55 ` Catalin Marinas
(?)
@ 2015-05-08 14:01 ` Arnd Bergmann
-1 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-05-08 14:01 UTC (permalink / raw)
To: linaro-acpi
Cc: Catalin Marinas, Will Deacon, rjw, linux-kernel, linux-acpi,
linux-arm-kernel, lenb
On Thursday 30 April 2015 16:55:14 Catalin Marinas wrote:
> On Thu, Apr 30, 2015 at 03:52:17PM +0200, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> > > On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > > > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > > > maintenance).
> > > >
> > > > It ensures that a state of a buffer as observed by CPU and device is
> > > > identical. It's possible that we removed all platforms that did something
> > > > interesting here, so it's one of these:
> > > >
> > > > a) On architectures that are mostly coherent, it's a barrier
> > > > that is broadcast to all devices, like I assume DSB is. IA64
> > > > currently does this for all machines, but IIRC it used to
> > > > access some cluster interconnect at some point to enforce a
> > > > flush.
> > > > The ARM32 based ArmadaXP also falls into this model if the cache
> > > > coherency fabric is enabled, as that needs to be synchronized
>
> I'm getting confused by the ArmadaXP case. IIRC, the point of the
> arm,io-coherent property to the PL310 was precisely to make the
> outer_sync a no-op when the coherency is enabled. So basically an mb()
> would only issue a DSB on such platform without the PL310 cache sync.
>
> On coherent systems, devices usually snoop the inner/CPU cache and not
> the system cache, that's further down the line. So a DSB would ensure
> the visibility at the coherent interconnect level before the system
> cache. I don't think it needs to be broadcast all the way to devices.
Sorry for the late reply. IIRC, the sync on Armada XP was not required
for the cache controller, but rather for the bus fabric, to ensure
that a DMA has made it into the memory controller.
> > > > b) On architectures where the device may not see the state of the cache,
> > > > but the CPU is always aware of anything the device sends it,
> > > > it flushes the cache. This seems to be the case on parisc,
> > > > and in particular, there are some variants that do not support
> > > > dma_alloc_coherent but only dma_alloc_noncoherent.
> > > > c) On architectures that need the synchronization both ways,
> > > > it does (almost) the same invalidate/clean/flush thing as
> > > > ARM, except it doesn't have to worry about cache lines from
> > > > speculative prefetch which make it impossible to implement on
> > > > ARM.
> > >
> > > Okey doke, thanks for the explanation. It sounds like we can just build
> > > the primitive out of the existing cache maintenance routines if we need
> > > to implement it.
> >
> > Cases a) and b) yes, but not c), otherwise we could simplify
> > the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
> > and __dma_page_dev_to_cpu into one function.
>
> I don't fully understand c) or b). Wouldn't the non-coherent ops cover
> them both, though potentially not as efficient?
Turning off caches usually makes everything coherent, but the performance
cost can be gigantic. Also, it might not help if the problem with coherency
is the completion of the DMA as opposed to the caching.
> > > > I guess we could handle that case as well, by requiring any ACPI based
> > > > firmware to turn off the coherency fabric on that system and just making
> > > > it dog slow.
> > >
> > > We already require something similar in Documentation/arm64/booting.txt:
> > >
> > > `System caches which do not respect architected cache maintenance by VA
> > > operations (not recommended) must be configured and disabled.'
> >
> > Hmm, does that rule really get violated here? I think it fully respects
> > the cache maintenance (flush/invalidate/clean) operations, but it does
> > not fully respect the dsb/dmb instructions, which is something else.
>
> If it fully respects the cache maintenance, it should also respect the
> completion and ordering requirements of the cache maintenance
> operations. That means that a DSB guarantees completion of such
> operations.
Ok.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-05-08 14:01 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-05-08 14:01 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 30 April 2015 16:55:14 Catalin Marinas wrote:
> On Thu, Apr 30, 2015 at 03:52:17PM +0200, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> > > On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > > > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > > > maintenance).
> > > >
> > > > It ensures that a state of a buffer as observed by CPU and device is
> > > > identical. It's possible that we removed all platforms that did something
> > > > interesting here, so it's one of these:
> > > >
> > > > a) On architectures that are mostly coherent, it's a barrier
> > > > that is broadcast to all devices, like I assume DSB is. IA64
> > > > currently does this for all machines, but IIRC it used to
> > > > access some cluster interconnect at some point to enforce a
> > > > flush.
> > > > The ARM32 based ArmadaXP also falls into this model if the cache
> > > > coherency fabric is enabled, as that needs to be synchronized
>
> I'm getting confused by the ArmadaXP case. IIRC, the point of the
> arm,io-coherent property to the PL310 was precisely to make the
> outer_sync a no-op when the coherency is enabled. So basically an mb()
> would only issue a DSB on such platform without the PL310 cache sync.
>
> On coherent systems, devices usually snoop the inner/CPU cache and not
> the system cache, that's further down the line. So a DSB would ensure
> the visibility at the coherent interconnect level before the system
> cache. I don't think it needs to be broadcast all the way to devices.
Sorry for the late reply. IIRC, the sync on Armada XP was not required
for the cache controller, but rather for the bus fabric, to ensure
that a DMA has made it into the memory controller.
> > > > b) On architectures where the device may not see the state of the cache,
> > > > but the CPU is always aware of anything the device sends it,
> > > > it flushes the cache. This seems to be the case on parisc,
> > > > and in particular, there are some variants that do not support
> > > > dma_alloc_coherent but only dma_alloc_noncoherent.
> > > > c) On architectures that need the synchronization both ways,
> > > > it does (almost) the same invalidate/clean/flush thing as
> > > > ARM, except it doesn't have to worry about cache lines from
> > > > speculative prefetch which make it impossible to implement on
> > > > ARM.
> > >
> > > Okey doke, thanks for the explanation. It sounds like we can just build
> > > the primitive out of the existing cache maintenance routines if we need
> > > to implement it.
> >
> > Cases a) and b) yes, but not c), otherwise we could simplify
> > the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
> > and __dma_page_dev_to_cpu into one function.
>
> I don't fully understand c) or b). Wouldn't the non-coherent ops cover
> them both, though potentially not as efficient?
Turning off caches usually makes everything coherent, but the performance
cost can be gigantic. Also, it might not help if the problem with coherency
is the completion of the DMA as opposed to the caching.
> > > > I guess we could handle that case as well, by requiring any ACPI based
> > > > firmware to turn off the coherency fabric on that system and just making
> > > > it dog slow.
> > >
> > > We already require something similar in Documentation/arm64/booting.txt:
> > >
> > > `System caches which do not respect architected cache maintenance by VA
> > > operations (not recommended) must be configured and disabled.'
> >
> > Hmm, does that rule really get violated here? I think it fully respects
> > the cache maintenance (flush/invalidate/clean) operations, but it does
> > not fully respect the dsb/dmb instructions, which is something else.
>
> If it fully respects the cache maintenance, it should also respect the
> completion and ordering requirements of the cache maintenance
> operations. That means that a DSB guarantees completion of such
> operations.
Ok.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-05-08 14:01 ` Arnd Bergmann
0 siblings, 0 replies; 102+ messages in thread
From: Arnd Bergmann @ 2015-05-08 14:01 UTC (permalink / raw)
To: linaro-acpi
Cc: Catalin Marinas, Will Deacon, rjw, linux-kernel, linux-acpi,
linux-arm-kernel, lenb
On Thursday 30 April 2015 16:55:14 Catalin Marinas wrote:
> On Thu, Apr 30, 2015 at 03:52:17PM +0200, Arnd Bergmann wrote:
> > On Thursday 30 April 2015 14:13:45 Will Deacon wrote:
> > > On Thu, Apr 30, 2015 at 02:03:00PM +0100, Arnd Bergmann wrote:
> > > > On Thursday 30 April 2015 12:46:15 Will Deacon wrote:
> > > > > Cache sync doesn't exist in the ARM/arm64architecture, what are the
> > > > > semantics supposed to be? Maybe it's just DSB for us (complete all pending
> > > > > maintenance).
> > > >
> > > > It ensures that a state of a buffer as observed by CPU and device is
> > > > identical. It's possible that we removed all platforms that did something
> > > > interesting here, so it's one of these:
> > > >
> > > > a) On architectures that are mostly coherent, it's a barrier
> > > > that is broadcast to all devices, like I assume DSB is. IA64
> > > > currently does this for all machines, but IIRC it used to
> > > > access some cluster interconnect at some point to enforce a
> > > > flush.
> > > > The ARM32 based ArmadaXP also falls into this model if the cache
> > > > coherency fabric is enabled, as that needs to be synchronized
>
> I'm getting confused by the ArmadaXP case. IIRC, the point of the
> arm,io-coherent property to the PL310 was precisely to make the
> outer_sync a no-op when the coherency is enabled. So basically an mb()
> would only issue a DSB on such platform without the PL310 cache sync.
>
> On coherent systems, devices usually snoop the inner/CPU cache and not
> the system cache, that's further down the line. So a DSB would ensure
> the visibility at the coherent interconnect level before the system
> cache. I don't think it needs to be broadcast all the way to devices.
Sorry for the late reply. IIRC, the sync on Armada XP was not required
for the cache controller, but rather for the bus fabric, to ensure
that a DMA has made it into the memory controller.
> > > > b) On architectures where the device may not see the state of the cache,
> > > > but the CPU is always aware of anything the device sends it,
> > > > it flushes the cache. This seems to be the case on parisc,
> > > > and in particular, there are some variants that do not support
> > > > dma_alloc_coherent but only dma_alloc_noncoherent.
> > > > c) On architectures that need the synchronization both ways,
> > > > it does (almost) the same invalidate/clean/flush thing as
> > > > ARM, except it doesn't have to worry about cache lines from
> > > > speculative prefetch which make it impossible to implement on
> > > > ARM.
> > >
> > > Okey doke, thanks for the explanation. It sounds like we can just build
> > > the primitive out of the existing cache maintenance routines if we need
> > > to implement it.
> >
> > Cases a) and b) yes, but not c), otherwise we could simplify
> > the ARM dma-mapping implementation and just merge __dma_page_cpu_to_dev
> > and __dma_page_dev_to_cpu into one function.
>
> I don't fully understand c) or b). Wouldn't the non-coherent ops cover
> them both, though potentially not as efficient?
Turning off caches usually makes everything coherent, but the performance
cost can be gigantic. Also, it might not help if the problem with coherency
is the completion of the DMA as opposed to the caching.
> > > > I guess we could handle that case as well, by requiring any ACPI based
> > > > firmware to turn off the coherency fabric on that system and just making
> > > > it dog slow.
> > >
> > > We already require something similar in Documentation/arm64/booting.txt:
> > >
> > > `System caches which do not respect architected cache maintenance by VA
> > > operations (not recommended) must be configured and disabled.'
> >
> > Hmm, does that rule really get violated here? I think it fully respects
> > the cache maintenance (flush/invalidate/clean) operations, but it does
> > not fully respect the dsb/dmb instructions, which is something else.
>
> If it fully respects the cache maintenance, it should also respect the
> completion and ordering requirements of the cache maintenance
> operations. That means that a DSB guarantees completion of such
> operations.
Ok.
Arnd
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
2015-04-30 8:23 ` Arnd Bergmann
(?)
@ 2015-04-30 23:39 ` Suravee Suthikulanit
-1 siblings, 0 replies; 102+ messages in thread
From: Suravee Suthikulanit @ 2015-04-30 23:39 UTC (permalink / raw)
To: Arnd Bergmann, linaro-acpi
Cc: linux-arm-kernel, catalin.marinas, rjw, linux-kernel,
will.deacon, linux-acpi, lenb
On 4/30/2015 3:23 AM, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
>> On 4/29/15 11:25, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 08:44:09 Suravee Suthikulpanit wrote:
>> [...]
>> As for the case where _CCA=0, I think the ACPI driver should essentially
>> communicate the information as HW is non-coherent as described in the
>> spec, and should be calling arch_setup_dma_ops(dev, false). It is true
>> that this in probably less-likely for the ARM64 server platforms.
>> However, I would think that the ACPI driver should not be making such
>> assumption.
>
> Can you add a description to the ACPI spec then to describe in detail what
> "non-coherent" is supposed to mean, and which action the OS is supposed to
> take when accessing data from device or CPU?
I believe Will has already provided this, and we have already discussed
this on separate emails in this thread.
>>>[...]
>>> On a related note, I'm not sure how to handle different DMA masks here.
>>> arch_setup_dma_ops() gets passed a size (and offset) argument, which should
>>> match the DMA mask, but I don't know if there is a way to find out the
>>> size from ACPI. Should we assume it's always 64-bit DMA capable?
>>
>> Looking at the ACPI spec, it does have the _DMA object. IIUC, this can
>> be used to describe DMA properties of a particular bus.
>>
>> Method(_DMA, ResourceTemplate()
>> {
>> QWORDMemory(
>> ResourceConsumer,
>> PosDecode, // _DEC
>> MinFixed, // _MIF
>> MaxFixed, // _MAF
>> Prefetchable, // _MEM
>> ReadWrite, // _RW
>> 0, // _GRA
>> 0, // _MIN
>> 0x1fffffff, // _MAX
>> 0x200000000, // _TRA
>> 0x20000000, // _LEN
>> , , ,
>> )
>> }
>>
>> I am not sure if this is an appropriate use for this object, but this
>> seems to be similar to the dma-ranges property for OF, and probably can
>> be used to specify baseaddr and size information when calling
>> arch_setup_dma_ops().
>
> Yes, that seems like a good idea. What is the expected behavior when that
> object is absent? Do we assume that the parent device is not DMA capable?
From the spec:
If the _DMA object is not present for a bus device, the OS assumes that
any address placed on a bus by a child device will be decoded either by
a device on the bus or by the bus itself, (in other words, all address
ranges can be used for DMA).
The issue is, since this is optional, I don't know which FW often
providing this info.
> Is this sufficient to describe the case where a device can only do DMA
> to a specific address range that is not at bus address zero but that maps
> to the beginning of physical RAM?
I believe that's the _MIN (Minimum Base Address) is for.
>>> For legacy reasons, the default mask is probably best left at 32-bit,
>>> but drivers are expected to call dma_set_mask() if they can do 64-bit DMA,
>>> and that should fail based on the information provided by the platform
>>> if the bus is not capable of doing that.
>>>
>> However, on ARM64 the dma_base and size parameter for
>> arch_setup_dma_ops() is currently not used, and only coherent flag is
>> used.
>
> We can hope that we won't need the dma_base setting here, but it's
> good to have the option to pass it down if we need it.
>
> Not passing the size is a bug that needs to be fixed ASAP, I believe
> a number of folks have run into this, most recently the APM X-Gene
> MMC controller
>
Ok. I'll look at this separately.
>> We probably should look at this separately. For the moment, we can
>> probably say that if _CCA object is missing when needed, the ACPI driver
>> won't set up dma_mask when creating platform_device, which should be
>> equivalent to saying DMA is not supported.
>>
>> Please let me know if this is acceptable, and I'll make change in V2
>> accordingly.
>
> I would still ask that you treat non-coherent to mean "no DMA" until
> we have come up with a way to sufficiently describe the kind of
> non-coherency in ACPI.
>
> Arnd
Ok. In V2, when _CCA=0, since we are not aware of ARM64 systems that is
working with such assumption with ACPI. I will also default to not
calling arch_setup_dma_ops() and fallback to arch-specific default. We
can revisit this later once we need to support such case.
Thanks,
Suravee
^ permalink raw reply [flat|nested] 102+ messages in thread
* [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 23:39 ` Suravee Suthikulanit
0 siblings, 0 replies; 102+ messages in thread
From: Suravee Suthikulanit @ 2015-04-30 23:39 UTC (permalink / raw)
To: linux-arm-kernel
On 4/30/2015 3:23 AM, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
>> On 4/29/15 11:25, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 08:44:09 Suravee Suthikulpanit wrote:
>> [...]
>> As for the case where _CCA=0, I think the ACPI driver should essentially
>> communicate the information as HW is non-coherent as described in the
>> spec, and should be calling arch_setup_dma_ops(dev, false). It is true
>> that this in probably less-likely for the ARM64 server platforms.
>> However, I would think that the ACPI driver should not be making such
>> assumption.
>
> Can you add a description to the ACPI spec then to describe in detail what
> "non-coherent" is supposed to mean, and which action the OS is supposed to
> take when accessing data from device or CPU?
I believe Will has already provided this, and we have already discussed
this on separate emails in this thread.
>>>[...]
>>> On a related note, I'm not sure how to handle different DMA masks here.
>>> arch_setup_dma_ops() gets passed a size (and offset) argument, which should
>>> match the DMA mask, but I don't know if there is a way to find out the
>>> size from ACPI. Should we assume it's always 64-bit DMA capable?
>>
>> Looking at the ACPI spec, it does have the _DMA object. IIUC, this can
>> be used to describe DMA properties of a particular bus.
>>
>> Method(_DMA, ResourceTemplate()
>> {
>> QWORDMemory(
>> ResourceConsumer,
>> PosDecode, // _DEC
>> MinFixed, // _MIF
>> MaxFixed, // _MAF
>> Prefetchable, // _MEM
>> ReadWrite, // _RW
>> 0, // _GRA
>> 0, // _MIN
>> 0x1fffffff, // _MAX
>> 0x200000000, // _TRA
>> 0x20000000, // _LEN
>> , , ,
>> )
>> }
>>
>> I am not sure if this is an appropriate use for this object, but this
>> seems to be similar to the dma-ranges property for OF, and probably can
>> be used to specify baseaddr and size information when calling
>> arch_setup_dma_ops().
>
> Yes, that seems like a good idea. What is the expected behavior when that
> object is absent? Do we assume that the parent device is not DMA capable?
From the spec:
If the _DMA object is not present for a bus device, the OS assumes that
any address placed on a bus by a child device will be decoded either by
a device on the bus or by the bus itself, (in other words, all address
ranges can be used for DMA).
The issue is, since this is optional, I don't know which FW often
providing this info.
> Is this sufficient to describe the case where a device can only do DMA
> to a specific address range that is not at bus address zero but that maps
> to the beginning of physical RAM?
I believe that's the _MIN (Minimum Base Address) is for.
>>> For legacy reasons, the default mask is probably best left at 32-bit,
>>> but drivers are expected to call dma_set_mask() if they can do 64-bit DMA,
>>> and that should fail based on the information provided by the platform
>>> if the bus is not capable of doing that.
>>>
>> However, on ARM64 the dma_base and size parameter for
>> arch_setup_dma_ops() is currently not used, and only coherent flag is
>> used.
>
> We can hope that we won't need the dma_base setting here, but it's
> good to have the option to pass it down if we need it.
>
> Not passing the size is a bug that needs to be fixed ASAP, I believe
> a number of folks have run into this, most recently the APM X-Gene
> MMC controller
>
Ok. I'll look at this separately.
>> We probably should look at this separately. For the moment, we can
>> probably say that if _CCA object is missing when needed, the ACPI driver
>> won't set up dma_mask when creating platform_device, which should be
>> equivalent to saying DMA is not supported.
>>
>> Please let me know if this is acceptable, and I'll make change in V2
>> accordingly.
>
> I would still ask that you treat non-coherent to mean "no DMA" until
> we have come up with a way to sufficiently describe the kind of
> non-coherency in ACPI.
>
> Arnd
Ok. In V2, when _CCA=0, since we are not aware of ARM64 systems that is
working with such assumption with ACPI. I will also default to not
calling arch_setup_dma_ops() and fallback to arch-specific default. We
can revisit this later once we need to support such case.
Thanks,
Suravee
^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [Linaro-acpi] [PATCH 2/2] ACPI / scan: Parse _CCA and setup device coherency
@ 2015-04-30 23:39 ` Suravee Suthikulanit
0 siblings, 0 replies; 102+ messages in thread
From: Suravee Suthikulanit @ 2015-04-30 23:39 UTC (permalink / raw)
To: Arnd Bergmann, linaro-acpi
Cc: linux-arm-kernel, catalin.marinas, rjw, linux-kernel,
will.deacon, linux-acpi, lenb
On 4/30/2015 3:23 AM, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 16:53:10 Suravee Suthikulpanit wrote:
>> On 4/29/15 11:25, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 08:44:09 Suravee Suthikulpanit wrote:
>> [...]
>> As for the case where _CCA=0, I think the ACPI driver should essentially
>> communicate the information as HW is non-coherent as described in the
>> spec, and should be calling arch_setup_dma_ops(dev, false). It is true
>> that this in probably less-likely for the ARM64 server platforms.
>> However, I would think that the ACPI driver should not be making such
>> assumption.
>
> Can you add a description to the ACPI spec then to describe in detail what
> "non-coherent" is supposed to mean, and which action the OS is supposed to
> take when accessing data from device or CPU?
I believe Will has already provided this, and we have already discussed
this on separate emails in this thread.
>>>[...]
>>> On a related note, I'm not sure how to handle different DMA masks here.
>>> arch_setup_dma_ops() gets passed a size (and offset) argument, which should
>>> match the DMA mask, but I don't know if there is a way to find out the
>>> size from ACPI. Should we assume it's always 64-bit DMA capable?
>>
>> Looking at the ACPI spec, it does have the _DMA object. IIUC, this can
>> be used to describe DMA properties of a particular bus.
>>
>> Method(_DMA, ResourceTemplate()
>> {
>> QWORDMemory(
>> ResourceConsumer,
>> PosDecode, // _DEC
>> MinFixed, // _MIF
>> MaxFixed, // _MAF
>> Prefetchable, // _MEM
>> ReadWrite, // _RW
>> 0, // _GRA
>> 0, // _MIN
>> 0x1fffffff, // _MAX
>> 0x200000000, // _TRA
>> 0x20000000, // _LEN
>> , , ,
>> )
>> }
>>
>> I am not sure if this is an appropriate use for this object, but this
>> seems to be similar to the dma-ranges property for OF, and probably can
>> be used to specify baseaddr and size information when calling
>> arch_setup_dma_ops().
>
> Yes, that seems like a good idea. What is the expected behavior when that
> object is absent? Do we assume that the parent device is not DMA capable?
From the spec:
If the _DMA object is not present for a bus device, the OS assumes that
any address placed on a bus by a child device will be decoded either by
a device on the bus or by the bus itself, (in other words, all address
ranges can be used for DMA).
The issue is, since this is optional, I don't know which FW often
providing this info.
> Is this sufficient to describe the case where a device can only do DMA
> to a specific address range that is not at bus address zero but that maps
> to the beginning of physical RAM?
I believe that's the _MIN (Minimum Base Address) is for.
>>> For legacy reasons, the default mask is probably best left at 32-bit,
>>> but drivers are expected to call dma_set_mask() if they can do 64-bit DMA,
>>> and that should fail based on the information provided by the platform
>>> if the bus is not capable of doing that.
>>>
>> However, on ARM64 the dma_base and size parameter for
>> arch_setup_dma_ops() is currently not used, and only coherent flag is
>> used.
>
> We can hope that we won't need the dma_base setting here, but it's
> good to have the option to pass it down if we need it.
>
> Not passing the size is a bug that needs to be fixed ASAP, I believe
> a number of folks have run into this, most recently the APM X-Gene
> MMC controller
>
Ok. I'll look at this separately.
>> We probably should look at this separately. For the moment, we can
>> probably say that if _CCA object is missing when needed, the ACPI driver
>> won't set up dma_mask when creating platform_device, which should be
>> equivalent to saying DMA is not supported.
>>
>> Please let me know if this is acceptable, and I'll make change in V2
>> accordingly.
>
> I would still ask that you treat non-coherent to mean "no DMA" until
> we have come up with a way to sufficiently describe the kind of
> non-coherency in ACPI.
>
> Arnd
Ok. In V2, when _CCA=0, since we are not aware of ARM64 systems that is
working with such assumption with ACPI. I will also default to not
calling arch_setup_dma_ops() and fallback to arch-specific default. We
can revisit this later once we need to support such case.
Thanks,
Suravee
^ permalink raw reply [flat|nested] 102+ messages in thread