All of lore.kernel.org
 help / color / mirror / Atom feed
* USB mass storage and ARM cache coherency
@ 2010-01-29 14:34 Catalin Marinas
  2010-01-29 16:10 ` Oliver Neukum
  2010-01-29 16:23 ` Ming Lei
  0 siblings, 2 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-01-29 14:34 UTC (permalink / raw)
  To: Matthew Dharm; +Cc: linux-usb, linux-kernel

Hi Matthew,

I've been trying for some time to use a rootfs (ext2) on a USB memory
stick on ARM platforms but without any success. The USB HCD driver is
ISP1760 which doesn't use DMA.

ARM has a Harvard cache architecture and what I get is incoherency
between the I and D caches. The CPU I'm using (ARM11MPCore) has PIPT
caches with D-cache lines allocation on write.

Basically, when user space tries to execute from a new page, it faults
and the data is requested via the VFS layer, SCSI block device and USB
mass storage from the ISP1760 driver. The page is then mapped into user
space and update_mmu_cache() called.

However, since the driver is PIO, the data copied from the USB device
into RAM gets stuck in the D-cache. On the above page requesting path
there is no call to flush_dcache_page() to handle D-cache maintenance
(for DMA drivers, that's handled by the DMA API).

Since the USB mass storage code has the information about the USB driver
capabilities (DMA or PIO), it looks like the best place to call
flush_dcache_page(). But I got lost in the SCSI emulation and all my
attempts failed to get a working rootfs.

Adding flush_dcache_page() higher up in mpage_end_io_read() solves the
problem but that's not the correct fix as it has wider implications and
it's not needed for DMA-capable devices.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 14:34 USB mass storage and ARM cache coherency Catalin Marinas
@ 2010-01-29 16:10 ` Oliver Neukum
  2010-01-29 16:23 ` Ming Lei
  1 sibling, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-01-29 16:10 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Matthew Dharm, linux-usb, linux-kernel

Am Freitag, 29. Januar 2010 15:34:15 schrieb Catalin Marinas:
> Basically, when user space tries to execute from a new page, it faults
> and the data is requested via the VFS layer, SCSI block device and USB
> mass storage from the ISP1760 driver. The page is then mapped into user
> space and update_mmu_cache() called.
> 
> However, since the driver is PIO, the data copied from the USB device
> into RAM gets stuck in the D-cache. On the above page requesting path
> there is no call to flush_dcache_page() to handle D-cache maintenance
> (for DMA drivers, that's handled by the DMA API).
> 
> Since the USB mass storage code has the information about the USB driver
> capabilities (DMA or PIO), it looks like the best place to call
> flush_dcache_page(). But I got lost in the SCSI emulation and all my
> attempts failed to get a working rootfs.

No, that would be a very bad place in the layering to do this.
The problem would happen with ub and storage. It might also
happen with any other driver.
Please add this to the HCD driver.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 14:34 USB mass storage and ARM cache coherency Catalin Marinas
  2010-01-29 16:10 ` Oliver Neukum
@ 2010-01-29 16:23 ` Ming Lei
  2010-01-29 16:34   ` Catalin Marinas
  1 sibling, 1 reply; 352+ messages in thread
From: Ming Lei @ 2010-01-29 16:23 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Matthew Dharm, linux-usb, linux-kernel

2010/1/29 Catalin Marinas <catalin.marinas@arm.com>:
> Hi Matthew,
>
> I've been trying for some time to use a rootfs (ext2) on a USB memory
> stick on ARM platforms but without any success. The USB HCD driver is
> ISP1760 which doesn't use DMA.
>
> ARM has a Harvard cache architecture and what I get is incoherency
> between the I and D caches. The CPU I'm using (ARM11MPCore) has PIPT
> caches with D-cache lines allocation on write.
>
> Basically, when user space tries to execute from a new page, it faults
> and the data is requested via the VFS layer, SCSI block device and USB
> mass storage from the ISP1760 driver. The page is then mapped into user
> space and update_mmu_cache() called.
>
> However, since the driver is PIO, the data copied from the USB device
> into RAM gets stuck in the D-cache. On the above page requesting path
> there is no call to flush_dcache_page() to handle D-cache maintenance
> (for DMA drivers, that's handled by the DMA API).
>
> Since the USB mass storage code has the information about the USB driver

Sorry,  I am a little confused that usb mass storage has what information
about DMA or PIO of low level usb transfer?

Thanks,

-- 
Lei Ming

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 16:23 ` Ming Lei
@ 2010-01-29 16:34   ` Catalin Marinas
  2010-01-29 16:41     ` Oliver Neukum
  2010-01-29 17:51     ` Sergei Shtylyov
  0 siblings, 2 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-01-29 16:34 UTC (permalink / raw)
  To: Ming Lei; +Cc: Matthew Dharm, linux-usb, linux-kernel

On Fri, 2010-01-29 at 16:23 +0000, Ming Lei wrote:
> 2010/1/29 Catalin Marinas <catalin.marinas@arm.com>:
> > I've been trying for some time to use a rootfs (ext2) on a USB memory
> > stick on ARM platforms but without any success. The USB HCD driver is
> > ISP1760 which doesn't use DMA.
> >
> > ARM has a Harvard cache architecture and what I get is incoherency
> > between the I and D caches. The CPU I'm using (ARM11MPCore) has PIPT
> > caches with D-cache lines allocation on write.
> >
> > Basically, when user space tries to execute from a new page, it faults
> > and the data is requested via the VFS layer, SCSI block device and USB
> > mass storage from the ISP1760 driver. The page is then mapped into user
> > space and update_mmu_cache() called.
> >
> > However, since the driver is PIO, the data copied from the USB device
> > into RAM gets stuck in the D-cache. On the above page requesting path
> > there is no call to flush_dcache_page() to handle D-cache maintenance
> > (for DMA drivers, that's handled by the DMA API).
> >
> > Since the USB mass storage code has the information about the USB driver
> 
> Sorry,  I am a little confused that usb mass storage has what information
> about DMA or PIO of low level usb transfer?

I was thinking about checking dev->bus->controller->dma_mask which the
code (though not the storage one) seems to imply that if the dma_mask is
0, the HCD driver is only capable of PIO.

That would be a more general solution rather than going through each HCD
driver since my understanding is that flush_dcache_page() is only needed
together with the mass storage support.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 16:34   ` Catalin Marinas
@ 2010-01-29 16:41     ` Oliver Neukum
  2010-01-29 17:14       ` Catalin Marinas
  2010-01-29 17:51     ` Sergei Shtylyov
  1 sibling, 1 reply; 352+ messages in thread
From: Oliver Neukum @ 2010-01-29 16:41 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ming Lei, Matthew Dharm, linux-usb, linux-kernel

Am Freitag, 29. Januar 2010 17:34:03 schrieb Catalin Marinas:

> I was thinking about checking dev->bus->controller->dma_mask which the
> code (though not the storage one) seems to imply that if the dma_mask is
> 0, the HCD driver is only capable of PIO.

That a HCD is capable of DMA need not imply that DMA is used for every
transfer.
 
> That would be a more general solution rather than going through each HCD
> driver since my understanding is that flush_dcache_page() is only needed
> together with the mass storage support.

What about ub, nfs or nbd over a USB<->ethernet converter?
This, I am afraid is best solved at the HCD or glue layer.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 16:41     ` Oliver Neukum
@ 2010-01-29 17:14       ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-01-29 17:14 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Ming Lei, Matthew Dharm, linux-usb, linux-kernel

On Fri, 2010-01-29 at 16:41 +0000, Oliver Neukum wrote:
> Am Freitag, 29. Januar 2010 17:34:03 schrieb Catalin Marinas:
> 
> > I was thinking about checking dev->bus->controller->dma_mask which the
> > code (though not the storage one) seems to imply that if the dma_mask is
> > 0, the HCD driver is only capable of PIO.
> 
> That a HCD is capable of DMA need not imply that DMA is used for every
> transfer.

Actually the DMA drivers are safe in this respect only if the transfer
happens directly to a page cache page that may be (later) mapped into
user space. I'm not familiar with the USB drivers to fully understand
the data flow, so any help would be appreciated.

> > That would be a more general solution rather than going through each HCD
> > driver since my understanding is that flush_dcache_page() is only needed
> > together with the mass storage support.
> 
> What about ub, nfs or nbd over a USB<->ethernet converter?
> This, I am afraid is best solved at the HCD or glue layer.

NFS handles the cache flushing itself, so in this case there is no need
to duplicate the cache flushing at the HCD level. AFAICT, the HCD driver
may be used in several cases and it's only the storage case (via either
ub, mass storage etc.) that requires cache flushing. Is there a way to
differentiate between these at the HCD driver level?

Regarding nbd, is there any copying happening between the HCD driver
receiving the network packet from the USB-ethernet converter and the nbd
bio_vec buffers (most likely during the TCP/IP stack flow)? In this case
it would be for the nbd driver (doesn't seem to be the case now) to
flush the D-cache as the HCD flushing is not necessary as long as it
doesn't write directly to the page cache page.

The ub case is similar to the USB mass storage one, so they could both
benefit from flushing at the HCD driver level. But is this possible
without duplicating the flushing in the nfs case?

Regards.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 16:34   ` Catalin Marinas
  2010-01-29 16:41     ` Oliver Neukum
@ 2010-01-29 17:51     ` Sergei Shtylyov
  2010-01-29 18:54       ` Matthew Dharm
  1 sibling, 1 reply; 352+ messages in thread
From: Sergei Shtylyov @ 2010-01-29 17:51 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ming Lei, Matthew Dharm, linux-usb, linux-kernel

Hello.

Catalin Marinas wrote:

>>> I've been trying for some time to use a rootfs (ext2) on a USB memory
>>> stick on ARM platforms but without any success. The USB HCD driver is
>>> ISP1760 which doesn't use DMA.
>>>
>>> ARM has a Harvard cache architecture and what I get is incoherency
>>> between the I and D caches. The CPU I'm using (ARM11MPCore) has PIPT
>>> caches with D-cache lines allocation on write.
>>>
>>> Basically, when user space tries to execute from a new page, it faults
>>> and the data is requested via the VFS layer, SCSI block device and USB
>>> mass storage from the ISP1760 driver. The page is then mapped into user
>>> space and update_mmu_cache() called.
>>>
>>> However, since the driver is PIO, the data copied from the USB device
>>> into RAM gets stuck in the D-cache. On the above page requesting path
>>> there is no call to flush_dcache_page() to handle D-cache maintenance
>>> (for DMA drivers, that's handled by the DMA API).
>>>
>>> Since the USB mass storage code has the information about the USB driver
>>>       
>> Sorry,  I am a little confused that usb mass storage has what information
>> about DMA or PIO of low level usb transfer?
>>     
>
> I was thinking about checking dev->bus->controller->dma_mask which the
> code (though not the storage one) seems to imply that if the dma_mask is
> 0, the HCD driver is only capable of PIO.
>
> That would be a more general solution rather than going through each HCD
> driver since my understanding is that flush_dcache_page() is only needed
> together with the mass storage support.

   Note that DMA capable driver can be doing some transfers in PIO mode 
or falling back to PIO mode if DMA mode transfer is unsuccessful (the 
musb driver is an example of the latter and if the DMA rewrite patches 
will get accepted, it'll do short transfers in PIO mode).

MBR, Sergei


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 17:51     ` Sergei Shtylyov
@ 2010-01-29 18:54       ` Matthew Dharm
  2010-01-29 19:35         ` Greg KH
                           ` (2 more replies)
  0 siblings, 3 replies; 352+ messages in thread
From: Matthew Dharm @ 2010-01-29 18:54 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: Catalin Marinas, Ming Lei, linux-usb, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1739 bytes --]

On Fri, Jan 29, 2010 at 08:51:47PM +0300, Sergei Shtylyov wrote:
> Catalin Marinas wrote:
> 
> >That would be a more general solution rather than going through each HCD
> >driver since my understanding is that flush_dcache_page() is only needed
> >together with the mass storage support.
> 
>   Note that DMA capable driver can be doing some transfers in PIO mode 
> or falling back to PIO mode if DMA mode transfer is unsuccessful (the 
> musb driver is an example of the latter and if the DMA rewrite patches 
> will get accepted, it'll do short transfers in PIO mode).

Given that an HCD can choose, on the fly, if it's using DMA or PIO, the HCD
driver is the only place to reasonably put any cache-synchronization code.

That said, what do the other SCSI HCDs do?  I'm guessing the question gets
kinda muddy there, since the other SCSI HCDs all talk directly to some
piece of hardware, and thus are responsible for the cache management
themselves.

Based on that, one could argue that ub and usb-storage should be doing
this.

HOWEVER, I firmly believe that the cache-management functions belong with
the driver that actually talks to the low-level hardware, as that's the
only place where you can be 100% certain of what cache operations are
needed.  After all, I think someone is working on a USB-over-IP transport,
and trying to manage cache at the usb-storage level in that scenario is
just silly.

So, let's put this in the HCD drivers and be done with it.

Matt

-- 
Matthew Dharm                              Home: mdharm-usb@one-eyed-alien.net 
Maintainer, Linux USB Mass Storage Driver

I see you've been reading alt.sex.chubby.sheep voraciously.
					-- Tanya
User Friendly, 11/24/97

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 18:54       ` Matthew Dharm
@ 2010-01-29 19:35         ` Greg KH
  2010-02-01 13:49         ` Catalin Marinas
  2010-02-01 17:29         ` Catalin Marinas
  2 siblings, 0 replies; 352+ messages in thread
From: Greg KH @ 2010-01-29 19:35 UTC (permalink / raw)
  To: Sergei Shtylyov, Catalin Marinas, Ming Lei, linux-usb, linux-kernel

On Fri, Jan 29, 2010 at 10:54:34AM -0800, Matthew Dharm wrote:
> On Fri, Jan 29, 2010 at 08:51:47PM +0300, Sergei Shtylyov wrote:
> > Catalin Marinas wrote:
> > 
> > >That would be a more general solution rather than going through each HCD
> > >driver since my understanding is that flush_dcache_page() is only needed
> > >together with the mass storage support.
> > 
> >   Note that DMA capable driver can be doing some transfers in PIO mode 
> > or falling back to PIO mode if DMA mode transfer is unsuccessful (the 
> > musb driver is an example of the latter and if the DMA rewrite patches 
> > will get accepted, it'll do short transfers in PIO mode).
> 
> Given that an HCD can choose, on the fly, if it's using DMA or PIO, the HCD
> driver is the only place to reasonably put any cache-synchronization code.
> 
> That said, what do the other SCSI HCDs do?  I'm guessing the question gets
> kinda muddy there, since the other SCSI HCDs all talk directly to some
> piece of hardware, and thus are responsible for the cache management
> themselves.
> 
> Based on that, one could argue that ub and usb-storage should be doing
> this.
> 
> HOWEVER, I firmly believe that the cache-management functions belong with
> the driver that actually talks to the low-level hardware, as that's the
> only place where you can be 100% certain of what cache operations are
> needed.  After all, I think someone is working on a USB-over-IP transport,
> and trying to manage cache at the usb-storage level in that scenario is
> just silly.
> 
> So, let's put this in the HCD drivers and be done with it.

I agree, that's the place to fix this issue.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 18:54       ` Matthew Dharm
  2010-01-29 19:35         ` Greg KH
@ 2010-02-01 13:49         ` Catalin Marinas
  2010-02-01 17:29         ` Catalin Marinas
  2 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-01 13:49 UTC (permalink / raw)
  To: Matthew Dharm; +Cc: Sergei Shtylyov, Ming Lei, linux-usb, linux-kernel

On Fri, 2010-01-29 at 18:54 +0000, Matthew Dharm wrote:
> On Fri, Jan 29, 2010 at 08:51:47PM +0300, Sergei Shtylyov wrote:
> > Catalin Marinas wrote:
> > 
> > >That would be a more general solution rather than going through each HCD
> > >driver since my understanding is that flush_dcache_page() is only needed
> > >together with the mass storage support.
> > 
> >   Note that DMA capable driver can be doing some transfers in PIO mode 
> > or falling back to PIO mode if DMA mode transfer is unsuccessful (the 
> > musb driver is an example of the latter and if the DMA rewrite patches 
> > will get accepted, it'll do short transfers in PIO mode).
> 
> Given that an HCD can choose, on the fly, if it's using DMA or PIO, the HCD
> driver is the only place to reasonably put any cache-synchronization code.
> 
> That said, what do the other SCSI HCDs do?  I'm guessing the question gets
> kinda muddy there, since the other SCSI HCDs all talk directly to some
> piece of hardware, and thus are responsible for the cache management
> themselves.
> 
> Based on that, one could argue that ub and usb-storage should be doing
> this.
> 
> HOWEVER, I firmly believe that the cache-management functions belong with
> the driver that actually talks to the low-level hardware, as that's the
> only place where you can be 100% certain of what cache operations are
> needed.  After all, I think someone is working on a USB-over-IP transport,
> and trying to manage cache at the usb-storage level in that scenario is
> just silly.
> 
> So, let's put this in the HCD drivers and be done with it.

Doing this (flush_dcache_page) in the HCD driver (ISP1760) solves my
problem. I'll post a patch and also cc the driver maintainer.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-01-29 18:54       ` Matthew Dharm
  2010-01-29 19:35         ` Greg KH
  2010-02-01 13:49         ` Catalin Marinas
@ 2010-02-01 17:29         ` Catalin Marinas
  2010-02-01 20:14           ` Alan Stern
                             ` (5 more replies)
  2 siblings, 6 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-01 17:29 UTC (permalink / raw)
  To: Matthew Dharm
  Cc: Sergei Shtylyov, Ming Lei, linux-usb, linux-kernel,
	Sebastian Siewior, Greg KH

On Fri, 2010-01-29 at 18:54 +0000, Matthew Dharm wrote:
> HOWEVER, I firmly believe that the cache-management functions belong with
> the driver that actually talks to the low-level hardware, as that's the
> only place where you can be 100% certain of what cache operations are
> needed.  After all, I think someone is working on a USB-over-IP transport,
> and trying to manage cache at the usb-storage level in that scenario is
> just silly.
> 
> So, let's put this in the HCD drivers and be done with it.

The patch below is what fixes the I-D cache incoherency issues on ARM. I
don't particularly like the solution but it seems to be the only one
available.

IMHO, Linux should have functions similar to the DMA API but for PIO
drivers (e.g. pio_map_single/pio_unmap_single) that non-coherent
architectures can define, otherwise being no-ops. Any thoughts?

Thanks.



isp1760: Flush the D-cache for the pipe-in transfer buffers

From: Catalin Marinas <catalin.marinas@arm.com>

When the HDC driver writes the data to the transfer buffers it pollutes
the D-cache (unlike DMA drivers where the device writes the data). If
the corresponding pages get mapped into user space, there are no
additional cache flushing operations performed and this causes random
user space faults on architectures with separate I and D caches
(Harvard) or those with aliasing D-cache.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matthew Dharm <mdharm-kernel@one-eyed-alien.net>
Cc: Greg KH <greg@kroah.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
---
 drivers/usb/host/isp1760-hcd.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/host/isp1760-hcd.c b/drivers/usb/host/isp1760-hcd.c
index 27b8f7c..4d3eeee 100644
--- a/drivers/usb/host/isp1760-hcd.c
+++ b/drivers/usb/host/isp1760-hcd.c
@@ -18,6 +18,8 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <asm/unaligned.h>
+#include <asm/cacheflush.h>
+#include <asm/memory.h>
 
 #include "../core/hcd.h"
 #include "isp1760-hcd.h"
@@ -904,6 +906,14 @@ __acquires(priv->lock)
 			status = 0;
 	}
 
+	if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
+		void *ptr;
+		for (ptr = urb->transfer_buffer;
+		     ptr < urb->transfer_buffer + urb->transfer_buffer_length;
+		     ptr += PAGE_SIZE)
+			flush_dcache_page(virt_to_page(ptr));
+	}
+
 	/* complete() can reenter this HCD */
 	usb_hcd_unlink_urb_from_ep(priv_to_hcd(priv), urb);
 	spin_unlock(&priv->lock);

-- 
Catalin


^ permalink raw reply related	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 17:29         ` Catalin Marinas
@ 2010-02-01 20:14           ` Alan Stern
  2010-02-02  4:24             ` Paul Mundt
  2010-02-01 22:30           ` Andreas Mohr
                             ` (4 subsequent siblings)
  5 siblings, 1 reply; 352+ messages in thread
From: Alan Stern @ 2010-02-01 20:14 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Mon, 1 Feb 2010, Catalin Marinas wrote:

> On Fri, 2010-01-29 at 18:54 +0000, Matthew Dharm wrote:
> > HOWEVER, I firmly believe that the cache-management functions belong with
> > the driver that actually talks to the low-level hardware, as that's the
> > only place where you can be 100% certain of what cache operations are
> > needed.  After all, I think someone is working on a USB-over-IP transport,
> > and trying to manage cache at the usb-storage level in that scenario is
> > just silly.
> > 
> > So, let's put this in the HCD drivers and be done with it.
> 
> The patch below is what fixes the I-D cache incoherency issues on ARM. I
> don't particularly like the solution but it seems to be the only one
> available.
> 
> IMHO, Linux should have functions similar to the DMA API but for PIO
> drivers (e.g. pio_map_single/pio_unmap_single) that non-coherent
> architectures can define, otherwise being no-ops. Any thoughts?

You should bring this up on the linux-arm-kernel mailing list and CC:  
the ARM maintainer.  They are the ones most directly affected.

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 17:29         ` Catalin Marinas
  2010-02-01 20:14           ` Alan Stern
@ 2010-02-01 22:30           ` Andreas Mohr
  2010-02-02  6:58             ` Oliver Neukum
  2010-02-02  6:39           ` Paul Mundt
                             ` (3 subsequent siblings)
  5 siblings, 1 reply; 352+ messages in thread
From: Andreas Mohr @ 2010-02-01 22:30 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH, Takashi Iwai

[CC Takashi]

On Mon, Feb 01, 2010 at 05:29:14PM +0000, Catalin Marinas wrote:
> On Fri, 2010-01-29 at 18:54 +0000, Matthew Dharm wrote:
> > HOWEVER, I firmly believe that the cache-management functions belong with
> > the driver that actually talks to the low-level hardware, as that's the
> > only place where you can be 100% certain of what cache operations are
> > needed.  After all, I think someone is working on a USB-over-IP transport,
> > and trying to manage cache at the usb-storage level in that scenario is
> > just silly.
> > 
> > So, let's put this in the HCD drivers and be done with it.
> 
> The patch below is what fixes the I-D cache incoherency issues on ARM. I
> don't particularly like the solution but it seems to be the only one
> available.

Thanks very much for working on this amazingly large problem!

I took some time to add your patch to ehci-q.c / ohci-q.c
(for my *hci-ssb.c ASUS WL-500gP v2), on my now _heavily_ patched-up 2.6.31.9,
but _UNFORTUNATELY_ it kept locking up the same way as always when stopping
playback despite being damn sure this time that this patch could have
the potential to finally fix it ;)
(I had to replace memory.h with page.h on my arch though, to fix the build)

This is on MIPSEL (not one of my many ARM devices, unfortunately ;),
with usb-audio, and the madplay process crashes in __bzero(),
which strongly indicates cache coherency issues (other subsequent backtraces
have lots of mmap and vma listed, see also my "snd_usb_audio OOPS on MIPSEL -
is that the mmap issue?").

Next thing I'll do is fire up gdb and get a good backtrace of the
__bzero() address to find out which page handling in mpd exactly
is hampered with crashes. This is now ~ the third patch that I applied
on-the-go and that didn't help, so it's probably time to do
some earnest analysis on what's really going on locally.

Note that usb-storage itself does work on this platform though.

Rather annoying to be so close (sound works) yet so far away,
especially after all that USB host trouble I already had.

Thanks a lot again,

Andreas Mohr

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 20:14           ` Alan Stern
@ 2010-02-02  4:24             ` Paul Mundt
  2010-02-02  9:58               ` Catalin Marinas
  0 siblings, 1 reply; 352+ messages in thread
From: Paul Mundt @ 2010-02-02  4:24 UTC (permalink / raw)
  To: Alan Stern
  Cc: Catalin Marinas, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	linux-usb, linux-kernel, Sebastian Siewior, Greg KH

On Mon, Feb 01, 2010 at 03:14:04PM -0500, Alan Stern wrote:
> On Mon, 1 Feb 2010, Catalin Marinas wrote:
> 
> > On Fri, 2010-01-29 at 18:54 +0000, Matthew Dharm wrote:
> > > HOWEVER, I firmly believe that the cache-management functions belong with
> > > the driver that actually talks to the low-level hardware, as that's the
> > > only place where you can be 100% certain of what cache operations are
> > > needed.  After all, I think someone is working on a USB-over-IP transport,
> > > and trying to manage cache at the usb-storage level in that scenario is
> > > just silly.
> > > 
> > > So, let's put this in the HCD drivers and be done with it.
> > 
> > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > don't particularly like the solution but it seems to be the only one
> > available.
> > 
> > IMHO, Linux should have functions similar to the DMA API but for PIO
> > drivers (e.g. pio_map_single/pio_unmap_single) that non-coherent
> > architectures can define, otherwise being no-ops. Any thoughts?
> 
> You should bring this up on the linux-arm-kernel mailing list and CC:  
> the ARM maintainer.  They are the ones most directly affected.
> 
No, this belongs on linux-arch, as it's something that impacts a lot of
people besides ARM.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 17:29         ` Catalin Marinas
  2010-02-01 20:14           ` Alan Stern
  2010-02-01 22:30           ` Andreas Mohr
@ 2010-02-02  6:39           ` Paul Mundt
  2010-02-02 11:05             ` Catalin Marinas
  2010-02-02  9:11           ` Sebastian Andrzej Siewior
                             ` (2 subsequent siblings)
  5 siblings, 1 reply; 352+ messages in thread
From: Paul Mundt @ 2010-02-02  6:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Mon, Feb 01, 2010 at 05:29:14PM +0000, Catalin Marinas wrote:
> On Fri, 2010-01-29 at 18:54 +0000, Matthew Dharm wrote:
> > HOWEVER, I firmly believe that the cache-management functions belong with
> > the driver that actually talks to the low-level hardware, as that's the
> > only place where you can be 100% certain of what cache operations are
> > needed.  After all, I think someone is working on a USB-over-IP transport,
> > and trying to manage cache at the usb-storage level in that scenario is
> > just silly.
> > 
> > So, let's put this in the HCD drivers and be done with it.
> 
> The patch below is what fixes the I-D cache incoherency issues on ARM. I
> don't particularly like the solution but it seems to be the only one
> available.
> 
> IMHO, Linux should have functions similar to the DMA API but for PIO
> drivers (e.g. pio_map_single/pio_unmap_single) that non-coherent
> architectures can define, otherwise being no-ops. Any thoughts?
> 
I would certainly be in favour of such a thing, particularly since on SH
we often find ourselves with coherent PIO and non-coherent MMIO.

This is however something that should be prototyped and submitted to
linux-arch for discussion.

> diff --git a/drivers/usb/host/isp1760-hcd.c b/drivers/usb/host/isp1760-hcd.c
> index 27b8f7c..4d3eeee 100644
> --- a/drivers/usb/host/isp1760-hcd.c
> +++ b/drivers/usb/host/isp1760-hcd.c
> @@ -18,6 +18,8 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <asm/unaligned.h>
> +#include <asm/cacheflush.h>
> +#include <asm/memory.h>
>  
asm/memory.h isn't a portable header. If you are including it for
virt_to_page(), linux/io.h should already bring that in via asm/io.h.
If arm doesn't bring in virt_to_page() through its asm/io.h, then fix the
headers there please.

FWIW I used the same fix you came up with on r8a66597_hcd and it fixed up
crashes we were seeing there, too.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 22:30           ` Andreas Mohr
@ 2010-02-02  6:58             ` Oliver Neukum
  2010-02-02  9:31               ` Florian Fainelli
  0 siblings, 1 reply; 352+ messages in thread
From: Oliver Neukum @ 2010-02-02  6:58 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Catalin Marinas, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	linux-usb, linux-kernel, Sebastian Siewior, Greg KH,
	Takashi Iwai

Am Montag, 1. Februar 2010 23:30:01 schrieb Andreas Mohr:
> I took some time to add your patch to ehci-q.c / ohci-q.c
> (for my *hci-ssb.c ASUS WL-500gP v2), on my now heavily patched-up 2.6.31.9,
> but UNFORTUNATELY it kept locking up the same way as always when stopping
> playback despite being damn sure this time that this patch could have
> the potential to finally fix it ;)
> (I had to replace memory.h with page.h on my arch though, to fix the build)

A moment please. You are using ehci and ohci. Both are using dma.
Why does this issue arise?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 17:29         ` Catalin Marinas
                             ` (2 preceding siblings ...)
  2010-02-02  6:39           ` Paul Mundt
@ 2010-02-02  9:11           ` Sebastian Andrzej Siewior
  2010-02-02 11:09             ` Catalin Marinas
  2010-02-02 11:48           ` Oliver Neukum
  2010-02-08  6:55             ` Pavel Machek
  5 siblings, 1 reply; 352+ messages in thread
From: Sebastian Andrzej Siewior @ 2010-02-02  9:11 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Greg KH

* Catalin Marinas | 2010-02-01 17:29:14 [+0000]:

>> So, let's put this in the HCD drivers and be done with it.
That is the correct place. MMC -hcd drivers for instance are doing this
way.

>The patch below is what fixes the I-D cache incoherency issues on ARM. I
>don't particularly like the solution but it seems to be the only one
>available.
The PIO-MMC drivers walk through a scatter list via sg_miter_start() and
friends. Those helpers take care of this automaticly.

>IMHO, Linux should have functions similar to the DMA API but for PIO
>drivers (e.g. pio_map_single/pio_unmap_single) that non-coherent
>architectures can define, otherwise being no-ops. Any thoughts?
What is wrong with flush_dcache_page() ? And I think linux-arch is the
appropriate place.

>isp1760: Flush the D-cache for the pipe-in transfer buffers
>
>From: Catalin Marinas <catalin.marinas@arm.com>
>
>When the HDC driver writes the data to the transfer buffers it pollutes
>the D-cache (unlike DMA drivers where the device writes the data). If
>the corresponding pages get mapped into user space, there are no
>additional cache flushing operations performed and this causes random
>user space faults on architectures with separate I and D caches
>(Harvard) or those with aliasing D-cache.
>
>Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>Cc: Matthew Dharm <mdharm-kernel@one-eyed-alien.net>
>Cc: Greg KH <greg@kroah.com>
>Cc: Sebastian Siewior <bigeasy@linutronix.de>
>---
> drivers/usb/host/isp1760-hcd.c |   10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)
>
>diff --git a/drivers/usb/host/isp1760-hcd.c b/drivers/usb/host/isp1760-hcd.c
>index 27b8f7c..4d3eeee 100644
>--- a/drivers/usb/host/isp1760-hcd.c
>+++ b/drivers/usb/host/isp1760-hcd.c
>@@ -18,6 +18,8 @@
> #include <linux/uaccess.h>
> #include <linux/io.h>
> #include <asm/unaligned.h>
>+#include <asm/cacheflush.h>
>+#include <asm/memory.h>

I'm fine with the patch generally but I don't like the asm headers.
cacheflush.h is available on most architectures as far as I can see it but
memory.h is only available on arm. So you break the build on !arm and
therefore I NAK this.

> #include "../core/hcd.h"
> #include "isp1760-hcd.h"
>@@ -904,6 +906,14 @@ __acquires(priv->lock)
> 			status = 0;
> 	}
> 
>+	if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
>+		void *ptr;
>+		for (ptr = urb->transfer_buffer;
>+		     ptr < urb->transfer_buffer + urb->transfer_buffer_length;
>+		     ptr += PAGE_SIZE)
>+			flush_dcache_page(virt_to_page(ptr));
>+	}
>+
> 	/* complete() can reenter this HCD */
> 	usb_hcd_unlink_urb_from_ep(priv_to_hcd(priv), urb);
> 	spin_unlock(&priv->lock);
>

Sebastian

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02  6:58             ` Oliver Neukum
@ 2010-02-02  9:31               ` Florian Fainelli
  0 siblings, 0 replies; 352+ messages in thread
From: Florian Fainelli @ 2010-02-02  9:31 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Andreas Mohr, Catalin Marinas, Matthew Dharm, Sergei Shtylyov,
	Ming Lei, linux-usb, linux-kernel, Sebastian Siewior, Greg KH,
	Takashi Iwai

On Tuesday 02 February 2010 07:58:42 Oliver Neukum wrote:
> Am Montag, 1. Februar 2010 23:30:01 schrieb Andreas Mohr:
> > I took some time to add your patch to ehci-q.c / ohci-q.c
> > (for my *hci-ssb.c ASUS WL-500gP v2), on my now heavily patched-up
> > 2.6.31.9, but UNFORTUNATELY it kept locking up the same way as always
> > when stopping playback despite being damn sure this time that this patch
> > could have the potential to finally fix it ;)
> > (I had to replace memory.h with page.h on my arch though, to fix the
> > build)
> 
> A moment please. You are using ehci and ohci. Both are using dma.
> Why does this issue arise?

Because the BCM4710 CPU core is know to have cache problems and we have been 
trying to workaround this, your problem Andreas is imho a different one.
--
Regards, Florian

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02  4:24             ` Paul Mundt
@ 2010-02-02  9:58               ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02  9:58 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Alan Stern, Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2010-02-02 at 04:24 +0000, Paul Mundt wrote:
> On Mon, Feb 01, 2010 at 03:14:04PM -0500, Alan Stern wrote:
> > On Mon, 1 Feb 2010, Catalin Marinas wrote:
> >
> > > On Fri, 2010-01-29 at 18:54 +0000, Matthew Dharm wrote:
> > > > HOWEVER, I firmly believe that the cache-management functions belong with
> > > > the driver that actually talks to the low-level hardware, as that's the
> > > > only place where you can be 100% certain of what cache operations are
> > > > needed.  After all, I think someone is working on a USB-over-IP transport,
> > > > and trying to manage cache at the usb-storage level in that scenario is
> > > > just silly.
> > > >
> > > > So, let's put this in the HCD drivers and be done with it.
> > >
> > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > don't particularly like the solution but it seems to be the only one
> > > available.
> > >
> > > IMHO, Linux should have functions similar to the DMA API but for PIO
> > > drivers (e.g. pio_map_single/pio_unmap_single) that non-coherent
> > > architectures can define, otherwise being no-ops. Any thoughts?
> >
> > You should bring this up on the linux-arm-kernel mailing list and CC: 
> > the ARM maintainer.  They are the ones most directly affected.
> 
> No, this belongs on linux-arch, as it's something that impacts a lot of
> people besides ARM.

I agree. I'll try to come up with a proposal and post it there.

BTW, this was already raised on the ARM Linux lists and people there are
aware of these problems. Their suggestion was to take it to LKML.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02  6:39           ` Paul Mundt
@ 2010-02-02 11:05             ` Catalin Marinas
  2010-02-02 11:15               ` Paul Mundt
  0 siblings, 1 reply; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 11:05 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2010-02-02 at 06:39 +0000, Paul Mundt wrote:
> On Mon, Feb 01, 2010 at 05:29:14PM +0000, Catalin Marinas wrote:
> > diff --git a/drivers/usb/host/isp1760-hcd.c b/drivers/usb/host/isp1760-hcd.c
> > index 27b8f7c..4d3eeee 100644
> > --- a/drivers/usb/host/isp1760-hcd.c
> > +++ b/drivers/usb/host/isp1760-hcd.c
> > @@ -18,6 +18,8 @@
> >  #include <linux/uaccess.h>
> >  #include <linux/io.h>
> >  #include <asm/unaligned.h>
> > +#include <asm/cacheflush.h>
> > +#include <asm/memory.h>
> 
> asm/memory.h isn't a portable header. If you are including it for
> virt_to_page(), linux/io.h should already bring that in via asm/io.h.
> If arm doesn't bring in virt_to_page() through its asm/io.h, then fix the
> headers there please.

In the ARM case, yes, it brings virt_to_page() but I'm not sure that's
the case for the other architectures. I think a better header is
linux/mm.h which already uses this function in virt_to_head_page().

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02  9:11           ` Sebastian Andrzej Siewior
@ 2010-02-02 11:09             ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 11:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Greg KH

On Tue, 2010-02-02 at 09:11 +0000, Sebastian Andrzej Siewior wrote:
> * Catalin Marinas | 2010-02-01 17:29:14 [+0000]:
> >> So, let's put this in the HCD drivers and be done with it.
> 
> That is the correct place. MMC -hcd drivers for instance are doing this
> way.
> 
> >The patch below is what fixes the I-D cache incoherency issues on ARM. I
> >don't particularly like the solution but it seems to be the only one
> >available.
> 
> The PIO-MMC drivers walk through a scatter list via sg_miter_start() and
> friends. Those helpers take care of this automaticly.
> 
> >IMHO, Linux should have functions similar to the DMA API but for PIO
> >drivers (e.g. pio_map_single/pio_unmap_single) that non-coherent
> >architectures can define, otherwise being no-ops. Any thoughts?
> 
> What is wrong with flush_dcache_page() ? 

In this particular case, it's too many lines to do the virt_to_page for
the transfer buffer since the HCD driver doesn't have access to the
individual pages (via something like urb->sg). A better solution would
be to move such loop in a flush_dcache_range() function to make it
easier for drivers.

Apart from that, flush_dcache_page() doesn't have any data flow
information. Optimisations could be done on ARM if we know that the
kernel only intends to read from a page (no flushing necessary with a
non-aliasing D-cache).

> And I think linux-arch is the appropriate place.

For changes to the cache flushing API, yes, that's the right place. I'll
get there with a patch.
> 
> >diff --git a/drivers/usb/host/isp1760-hcd.c b/drivers/usb/host/isp1760-hcd.c
> >index 27b8f7c..4d3eeee 100644
> >--- a/drivers/usb/host/isp1760-hcd.c
> >+++ b/drivers/usb/host/isp1760-hcd.c
> >@@ -18,6 +18,8 @@
> > #include <linux/uaccess.h>
> > #include <linux/io.h>
> > #include <asm/unaligned.h>
> >+#include <asm/cacheflush.h>
> >+#include <asm/memory.h>
> 
> I'm fine with the patch generally but I don't like the asm headers.
> cacheflush.h is available on most architectures as far as I can see it but
> memory.h is only available on arm. So you break the build on !arm and
> therefore I NAK this.

Yes, that was already pointed out. I'll post a revised patch (until we
maybe get a better API for such things but that's for linux-arch).

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 11:05             ` Catalin Marinas
@ 2010-02-02 11:15               ` Paul Mundt
  0 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-02-02 11:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Tue, Feb 02, 2010 at 11:05:39AM +0000, Catalin Marinas wrote:
> On Tue, 2010-02-02 at 06:39 +0000, Paul Mundt wrote:
> > On Mon, Feb 01, 2010 at 05:29:14PM +0000, Catalin Marinas wrote:
> > > diff --git a/drivers/usb/host/isp1760-hcd.c b/drivers/usb/host/isp1760-hcd.c
> > > index 27b8f7c..4d3eeee 100644
> > > --- a/drivers/usb/host/isp1760-hcd.c
> > > +++ b/drivers/usb/host/isp1760-hcd.c
> > > @@ -18,6 +18,8 @@
> > >  #include <linux/uaccess.h>
> > >  #include <linux/io.h>
> > >  #include <asm/unaligned.h>
> > > +#include <asm/cacheflush.h>
> > > +#include <asm/memory.h>
> > 
> > asm/memory.h isn't a portable header. If you are including it for
> > virt_to_page(), linux/io.h should already bring that in via asm/io.h.
> > If arm doesn't bring in virt_to_page() through its asm/io.h, then fix the
> > headers there please.
> 
> In the ARM case, yes, it brings virt_to_page() but I'm not sure that's
> the case for the other architectures. I think a better header is
> linux/mm.h which already uses this function in virt_to_head_page().
> 
For some reason I was thinking virt_to_phys() instead of virt_to_page()
when I wrote that, so just ignore me. linux/mm.h is obviously fine.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 17:29         ` Catalin Marinas
                             ` (3 preceding siblings ...)
  2010-02-02  9:11           ` Sebastian Andrzej Siewior
@ 2010-02-02 11:48           ` Oliver Neukum
  2010-02-02 12:01             ` Catalin Marinas
  2010-02-08  6:55             ` Pavel Machek
  5 siblings, 1 reply; 352+ messages in thread
From: Oliver Neukum @ 2010-02-02 11:48 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

Am Montag, 1. Februar 2010 18:29:14 schrieb Catalin Marinas:
> +       if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
> +               void *ptr;
> +               for (ptr = urb->transfer_buffer;
> +                    ptr < urb->transfer_buffer + urb->transfer_buffer_length;
> +                    ptr += PAGE_SIZE)
> +                       flush_dcache_page(virt_to_page(ptr));

Is it correct to limit this to BULK pipes?

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 11:48           ` Oliver Neukum
@ 2010-02-02 12:01             ` Catalin Marinas
  2010-02-02 12:07               ` Oliver Neukum
  0 siblings, 1 reply; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 12:01 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2010-02-02 at 11:48 +0000, Oliver Neukum wrote:
> Am Montag, 1. Februar 2010 18:29:14 schrieb Catalin Marinas:
> > +       if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
> > +               void *ptr;
> > +               for (ptr = urb->transfer_buffer;
> > +                    ptr < urb->transfer_buffer + urb->transfer_buffer_length;
> > +                    ptr += PAGE_SIZE)
> > +                       flush_dcache_page(virt_to_page(ptr));
> 
> Is it correct to limit this to BULK pipes?

I'm not entirely sure. The flush_dcache_page() should only be called for
pages that may be mapped into user space (page cache pages). We don't
need this for control buffers. It was my impression that what's coming
from the mass storage layer intended for page cache pages has the
PIPE_BULK type (I may be wrong though).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 12:01             ` Catalin Marinas
@ 2010-02-02 12:07               ` Oliver Neukum
  2010-02-02 12:11                 ` Andreas Mohr
  2010-02-02 12:39                 ` Catalin Marinas
  0 siblings, 2 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-02 12:07 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

Am Dienstag, 2. Februar 2010 13:01:12 schrieb Catalin Marinas:
> On Tue, 2010-02-02 at 11:48 +0000, Oliver Neukum wrote:
> > Am Montag, 1. Februar 2010 18:29:14 schrieb Catalin Marinas:
> > > +       if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
> > > +               void *ptr;
> > > +               for (ptr = urb->transfer_buffer;
> > > +                    ptr < urb->transfer_buffer + urb->transfer_buffer_length;
> > > +                    ptr += PAGE_SIZE)
> > > +                       flush_dcache_page(virt_to_page(ptr));
> > 
> > Is it correct to limit this to BULK pipes?
> 
> I'm not entirely sure. The flush_dcache_page() should only be called for
> pages that may be mapped into user space (page cache pages). We don't
> need this for control buffers. It was my impression that what's coming
> from the mass storage layer intended for page cache pages has the
> PIPE_BULK type (I may be wrong though).

For storage that is correct. But what about other sources of pages,
for example iSCSI?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 12:07               ` Oliver Neukum
@ 2010-02-02 12:11                 ` Andreas Mohr
  2010-02-02 14:42                   ` Clemens Ladisch
  2010-02-02 12:39                 ` Catalin Marinas
  1 sibling, 1 reply; 352+ messages in thread
From: Andreas Mohr @ 2010-02-02 12:11 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Catalin Marinas, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	linux-usb, linux-kernel, Sebastian Siewior, Greg KH

Hi,

On Tue, Feb 02, 2010 at 01:07:56PM +0100, Oliver Neukum wrote:
> Am Dienstag, 2. Februar 2010 13:01:12 schrieb Catalin Marinas:
> > On Tue, 2010-02-02 at 11:48 +0000, Oliver Neukum wrote:
> > > Am Montag, 1. Februar 2010 18:29:14 schrieb Catalin Marinas:
> > > Is it correct to limit this to BULK pipes?
> > 
> > I'm not entirely sure. The flush_dcache_page() should only be called for
> > pages that may be mapped into user space (page cache pages). We don't
> > need this for control buffers. It was my impression that what's coming
> > from the mass storage layer intended for page cache pages has the
> > PIPE_BULK type (I may be wrong though).
> 
> For storage that is correct. But what about other sources of pages,
> for example iSCSI?

Or... usb-audio? I should have verified that it is using bulk endpoints
(and thus the patch applies to my case).
usb-audio probably uses isochronous transfers, thus that would be
an obvious reason why the patch didn't work for me.
(with some other reason possibly being BCM4710 issues, of course)

Andreas Mohr

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 12:07               ` Oliver Neukum
  2010-02-02 12:11                 ` Andreas Mohr
@ 2010-02-02 12:39                 ` Catalin Marinas
  2010-02-02 13:08                   ` Oliver Neukum
  2010-02-02 13:36                   ` Ming Lei
  1 sibling, 2 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 12:39 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2010-02-02 at 12:07 +0000, Oliver Neukum wrote:
> Am Dienstag, 2. Februar 2010 13:01:12 schrieb Catalin Marinas:
> > On Tue, 2010-02-02 at 11:48 +0000, Oliver Neukum wrote:
> > > Am Montag, 1. Februar 2010 18:29:14 schrieb Catalin Marinas:
> > > > +       if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
> > > > +               void *ptr;
> > > > +               for (ptr = urb->transfer_buffer;
> > > > +                    ptr < urb->transfer_buffer + urb->transfer_buffer_length;
> > > > +                    ptr += PAGE_SIZE)
> > > > +                       flush_dcache_page(virt_to_page(ptr));
> > >
> > > Is it correct to limit this to BULK pipes?
> >
> > I'm not entirely sure. The flush_dcache_page() should only be called for
> > pages that may be mapped into user space (page cache pages). We don't
> > need this for control buffers. It was my impression that what's coming
> > from the mass storage layer intended for page cache pages has the
> > PIPE_BULK type (I may be wrong though).
> 
> For storage that is correct. But what about other sources of pages,
> for example iSCSI?

In the iSCSI case, does the HCD driver write directly to a page cache
page? Or it just fills in network packets that are copied to page cache
pages by the iSCSI code (sorry, I'm not familiar with this part of the
kernel). If the latter, the cache flushing in the HCD driver would not
help and it needs to be done in the iSCSI code.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 12:39                 ` Catalin Marinas
@ 2010-02-02 13:08                   ` Oliver Neukum
  2010-02-02 14:34                     ` Catalin Marinas
  2010-02-02 17:11                     ` Alan Stern
  2010-02-02 13:36                   ` Ming Lei
  1 sibling, 2 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-02 13:08 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

Am Dienstag, 2. Februar 2010 13:39:35 schrieb Catalin Marinas:
> > For storage that is correct. But what about other sources of pages,
> > for example iSCSI?
> 
> In the iSCSI case, does the HCD driver write directly to a page cache
> page? Or it just fills in network packets that are copied to page cache
> pages by the iSCSI code (sorry, I'm not familiar with this part of the
> kernel). If the latter, the cache flushing in the HCD driver would not
> help and it needs to be done in the iSCSI code.

As far as I can tell iSCSI does a private copy. But I don't know how
many methods to transfer code pages over USB exist. I'd say the
conservative solution is to flush for everything but control transfers.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 12:39                 ` Catalin Marinas
  2010-02-02 13:08                   ` Oliver Neukum
@ 2010-02-02 13:36                   ` Ming Lei
  2010-02-02 14:35                     ` Catalin Marinas
  1 sibling, 1 reply; 352+ messages in thread
From: Ming Lei @ 2010-02-02 13:36 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Oliver Neukum, Matthew Dharm, Sergei Shtylyov, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

2010/2/2 Catalin Marinas <catalin.marinas@arm.com>:

> In the iSCSI case, does the HCD driver write directly to a page cache
> page? Or it just fills in network packets that are copied to page cache
> pages by the iSCSI code (sorry, I'm not familiar with this part of the
> kernel). If the latter, the cache flushing in the HCD driver would not
> help and it needs to be done in the iSCSI code.

So we should flush dcache page in the place where the user mapped
page is copied to, instead of low level driver which does not do such
thing always.

-- 
Lei Ming

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 13:08                   ` Oliver Neukum
@ 2010-02-02 14:34                     ` Catalin Marinas
  2010-02-02 17:11                     ` Alan Stern
  1 sibling, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 14:34 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2010-02-02 at 13:08 +0000, Oliver Neukum wrote:
> Am Dienstag, 2. Februar 2010 13:39:35 schrieb Catalin Marinas:
> > > For storage that is correct. But what about other sources of pages,
> > > for example iSCSI?
> >
> > In the iSCSI case, does the HCD driver write directly to a page cache
> > page? Or it just fills in network packets that are copied to page cache
> > pages by the iSCSI code (sorry, I'm not familiar with this part of the
> > kernel). If the latter, the cache flushing in the HCD driver would not
> > help and it needs to be done in the iSCSI code.
> 
> As far as I can tell iSCSI does a private copy. But I don't know how
> many methods to transfer code pages over USB exist. I'd say the
> conservative solution is to flush for everything but control transfers.

flush_dcache_page() is on many architectures implemented lazily so that
if the page isn't mapped in user space no flushing takes place. It's
mainly the cost of virt_to_page() which I suspect is slightly higher
with sparsemem enabled.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 13:36                   ` Ming Lei
@ 2010-02-02 14:35                     ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 14:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Oliver Neukum, Matthew Dharm, Sergei Shtylyov, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2010-02-02 at 13:36 +0000, Ming Lei wrote:
> 2010/2/2 Catalin Marinas <catalin.marinas@arm.com>:
> 
> > In the iSCSI case, does the HCD driver write directly to a page cache
> > page? Or it just fills in network packets that are copied to page cache
> > pages by the iSCSI code (sorry, I'm not familiar with this part of the
> > kernel). If the latter, the cache flushing in the HCD driver would not
> > help and it needs to be done in the iSCSI code.
> 
> So we should flush dcache page in the place where the user mapped
> page is copied to, instead of low level driver which does not do such
> thing always.

Or both if you can't be sure whether the driver copies directly to a
page cache page.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 12:11                 ` Andreas Mohr
@ 2010-02-02 14:42                   ` Clemens Ladisch
  2010-02-02 14:52                     ` Oliver Neukum
  2010-02-02 20:38                     ` Andreas Mohr
  0 siblings, 2 replies; 352+ messages in thread
From: Clemens Ladisch @ 2010-02-02 14:42 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Oliver Neukum, Catalin Marinas, Matthew Dharm, Sergei Shtylyov,
	Ming Lei, linux-usb, linux-kernel, Sebastian Siewior, Greg KH

Andreas Mohr wrote:
> On Tue, Feb 02, 2010 at 01:07:56PM +0100, Oliver Neukum wrote:
> > Am Dienstag, 2. Februar 2010 13:01:12 schrieb Catalin Marinas:
> > > On Tue, 2010-02-02 at 11:48 +0000, Oliver Neukum wrote:
> > > > Am Montag, 1. Februar 2010 18:29:14 schrieb Catalin Marinas:
> > > > Is it correct to limit this to BULK pipes?
> > > 
> > > I'm not entirely sure. The flush_dcache_page() should only be called for
> > > pages that may be mapped into user space (page cache pages). We don't
> > > need this for control buffers. It was my impression that what's coming
> > > from the mass storage layer intended for page cache pages has the
> > > PIPE_BULK type (I may be wrong though).
> > 
> > For storage that is correct. But what about other sources of pages,
> > for example iSCSI?
> 
> Or... usb-audio? I should have verified that it is using bulk endpoints
> (and thus the patch applies to my case).
> usb-audio probably uses isochronous transfers, thus that would be
> an obvious reason why the patch didn't work for me.

snd-usb-audio indeed uses isochronous transfers, but those buffers are
never mapped into user space.  The intermediate vmalloc()ed buffer is,
however, and there was a bugfix for this recently.  Do you have these
patches in your tree?
http://git.kernel.org/?p=linux/kernel/git/tiwai/sound-2.6.git;a=commit;h=3e879d7bac705be4813a0ec9560cbe31db4b269f
http://git.kernel.org/?p=linux/kernel/git/tiwai/sound-2.6.git;a=commit;h=c32d977b8157bf67cdf47729ce7dd054a26eb534


Best regards,
Clemens

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 14:42                   ` Clemens Ladisch
@ 2010-02-02 14:52                     ` Oliver Neukum
  2010-02-02 15:10                       ` Andreas Mohr
  2010-02-02 20:38                     ` Andreas Mohr
  1 sibling, 1 reply; 352+ messages in thread
From: Oliver Neukum @ 2010-02-02 14:52 UTC (permalink / raw)
  To: Clemens Ladisch
  Cc: Andreas Mohr, Catalin Marinas, Matthew Dharm, Sergei Shtylyov,
	Ming Lei, linux-usb, linux-kernel, Sebastian Siewior, Greg KH

Am Dienstag, 2. Februar 2010 15:42:49 schrieb Clemens Ladisch:
> > Or... usb-audio? I should have verified that it is using bulk endpoints
> > (and thus the patch applies to my case).
> > usb-audio probably uses isochronous transfers, thus that would be
> > an obvious reason why the patch didn't work for me.
> 
> snd-usb-audio indeed uses isochronous transfers, but those buffers are
> never mapped into user space.  The intermediate vmalloc()ed buffer is,
> however, and there was a bugfix for this recently.  Do you have these
> patches in your tree?

Now that I think about it, several video drivers do map it to user space.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 14:52                     ` Oliver Neukum
@ 2010-02-02 15:10                       ` Andreas Mohr
  2010-02-02 15:34                         ` Catalin Marinas
  0 siblings, 1 reply; 352+ messages in thread
From: Andreas Mohr @ 2010-02-02 15:10 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Clemens Ladisch, Andreas Mohr, Catalin Marinas, Matthew Dharm,
	Sergei Shtylyov, Ming Lei, linux-usb, linux-kernel,
	Sebastian Siewior, Greg KH, Luke -Jr

[added another __bzero coherency crash victim, see
http://lkml.org/lkml/2008/6/9/14 ]

On Tue, Feb 02, 2010 at 03:52:19PM +0100, Oliver Neukum wrote:
> Am Dienstag, 2. Februar 2010 15:42:49 schrieb Clemens Ladisch:
> > > Or... usb-audio? I should have verified that it is using bulk endpoints
> > > (and thus the patch applies to my case).
> > > usb-audio probably uses isochronous transfers, thus that would be
> > > an obvious reason why the patch didn't work for me.
> > 
> > snd-usb-audio indeed uses isochronous transfers, but those buffers are
> > never mapped into user space.  The intermediate vmalloc()ed buffer is,
> > however, and there was a bugfix for this recently.  Do you have these
> > patches in your tree?
> 
> Now that I think about it, several video drivers do map it to user space.

OK, then the urb loop needs to also handle isochronous pipes,
and IMHO we should have a generic helper for this instead of open-coding
it, since it probably needs to be done in a couple affected HCDs
(and, most importantly, only on affected architectures - which the helper
could handle transparently).

Clemens: no, both of these patches haven't been applied (yet!!),
many thanks for the notification!

Will apply both patches and the isochronous addition, hopefully that
improves things (will be painful to check which of these things managed to
fix it - in case it does! -, though). Nope, will apply step by step,
both patches, then isochronous as a last resort.

Andreas Mohr

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 15:10                       ` Andreas Mohr
@ 2010-02-02 15:34                         ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 15:34 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Oliver Neukum, Clemens Ladisch, Matthew Dharm, Sergei Shtylyov,
	Ming Lei, linux-usb, linux-kernel, Sebastian Siewior, Greg KH,
	Luke -Jr

On Tue, 2010-02-02 at 15:10 +0000, Andreas Mohr wrote:
> [added another __bzero coherency crash victim, see
> http://lkml.org/lkml/2008/6/9/14 ]
> 
> On Tue, Feb 02, 2010 at 03:52:19PM +0100, Oliver Neukum wrote:
> > Am Dienstag, 2. Februar 2010 15:42:49 schrieb Clemens Ladisch:
> > > > Or... usb-audio? I should have verified that it is using bulk endpoints
> > > > (and thus the patch applies to my case).
> > > > usb-audio probably uses isochronous transfers, thus that would be
> > > > an obvious reason why the patch didn't work for me.
> > >
> > > snd-usb-audio indeed uses isochronous transfers, but those buffers are
> > > never mapped into user space.  The intermediate vmalloc()ed buffer is,
> > > however, and there was a bugfix for this recently.  Do you have these
> > > patches in your tree?
> >
> > Now that I think about it, several video drivers do map it to user space.
> 
> OK, then the urb loop needs to also handle isochronous pipes,
> and IMHO we should have a generic helper for this instead of open-coding
> it, since it probably needs to be done in a couple affected HCDs
> (and, most importantly, only on affected architectures - which the helper
> could handle transparently).

I'm planning to send a proposal to linux-arch for a flush_dcache_range()
function.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 13:08                   ` Oliver Neukum
  2010-02-02 14:34                     ` Catalin Marinas
@ 2010-02-02 17:11                     ` Alan Stern
  2010-02-02 17:20                       ` Catalin Marinas
  2010-02-08  6:55                       ` Pavel Machek
  1 sibling, 2 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-02 17:11 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Catalin Marinas, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	linux-usb, linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2 Feb 2010, Oliver Neukum wrote:

> Am Dienstag, 2. Februar 2010 13:39:35 schrieb Catalin Marinas:
> > > For storage that is correct. But what about other sources of pages,
> > > for example iSCSI?
> > 
> > In the iSCSI case, does the HCD driver write directly to a page cache
> > page? Or it just fills in network packets that are copied to page cache
> > pages by the iSCSI code (sorry, I'm not familiar with this part of the
> > kernel). If the latter, the cache flushing in the HCD driver would not
> > help and it needs to be done in the iSCSI code.
> 
> As far as I can tell iSCSI does a private copy. But I don't know how
> many methods to transfer code pages over USB exist. I'd say the
> conservative solution is to flush for everything but control transfers.

This doesn't make any sense.  Nobody would ever use isochronous 
transfers to store data into a code page because isochronous is 
unreliable.  (Audio isn't a counterexample -- audio data may be mapped 
to userspace, but only to data pages, not code pages.  And the problem 
here is to maintain consistency between the D and I caches.)

In principle interrupt transfers could be used, but it is most
unlikely.  They are intended for bounded-latency transfers, not
transfers of potentially large amounts of data.

The only transfer type that makes sense to worry about is bulk.

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 17:11                     ` Alan Stern
@ 2010-02-02 17:20                       ` Catalin Marinas
  2010-02-02 21:52                         ` Andreas Mohr
  2010-02-08  6:55                       ` Pavel Machek
  1 sibling, 1 reply; 352+ messages in thread
From: Catalin Marinas @ 2010-02-02 17:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	linux-usb, linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2010-02-02 at 17:11 +0000, Alan Stern wrote:
> On Tue, 2 Feb 2010, Oliver Neukum wrote:
> 
> > Am Dienstag, 2. Februar 2010 13:39:35 schrieb Catalin Marinas:
> > > > For storage that is correct. But what about other sources of pages,
> > > > for example iSCSI?
> > >
> > > In the iSCSI case, does the HCD driver write directly to a page cache
> > > page? Or it just fills in network packets that are copied to page cache
> > > pages by the iSCSI code (sorry, I'm not familiar with this part of the
> > > kernel). If the latter, the cache flushing in the HCD driver would not
> > > help and it needs to be done in the iSCSI code.
> >
> > As far as I can tell iSCSI does a private copy. But I don't know how
> > many methods to transfer code pages over USB exist. I'd say the
> > conservative solution is to flush for everything but control transfers.
> 
> This doesn't make any sense.  Nobody would ever use isochronous
> transfers to store data into a code page because isochronous is
> unreliable.  (Audio isn't a counterexample -- audio data may be mapped
> to userspace, but only to data pages, not code pages.  And the problem
> here is to maintain consistency between the D and I caches.)

My issues is with both I-D coherency and D-cache aliasing caused by
pages mapped in both user and kernel space (with different colours). The
flush_dcache_page() call should target both cases.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 14:42                   ` Clemens Ladisch
  2010-02-02 14:52                     ` Oliver Neukum
@ 2010-02-02 20:38                     ` Andreas Mohr
  1 sibling, 0 replies; 352+ messages in thread
From: Andreas Mohr @ 2010-02-02 20:38 UTC (permalink / raw)
  To: Clemens Ladisch
  Cc: Andreas Mohr, Oliver Neukum, Catalin Marinas, Matthew Dharm,
	Sergei Shtylyov, Ming Lei, linux-usb, linux-kernel,
	Sebastian Siewior, Greg KH, Luke -Jr

On Tue, Feb 02, 2010 at 03:42:49PM +0100, Clemens Ladisch wrote:
> Andreas Mohr wrote:
> > Or... usb-audio? I should have verified that it is using bulk endpoints
> > (and thus the patch applies to my case).
> > usb-audio probably uses isochronous transfers, thus that would be
> > an obvious reason why the patch didn't work for me.
> 
> snd-usb-audio indeed uses isochronous transfers, but those buffers are
> never mapped into user space.  The intermediate vmalloc()ed buffer is,
> however, and there was a bugfix for this recently.  Do you have these
> patches in your tree?
> http://git.kernel.org/?p=linux/kernel/git/tiwai/sound-2.6.git;a=commit;h=3e879d7bac705be4813a0ec9560cbe31db4b269f
> http://git.kernel.org/?p=linux/kernel/git/tiwai/sound-2.6.git;a=commit;h=c32d977b8157bf67cdf47729ce7dd054a26eb534

OK, I've now added both patches to my quilt series (and pushed everything),
rebuilt, reflashed image and copied modules, and it still bombs
just the very same way.
And this also with Catalins latest patch version (the one using != PIPE_CONTROL
to hit iso transfers etc. as well).
So it seems I still haven't got to the core of the issue despite all these
rather different patch attempts.

I'm afraid if it turns out that keeping open the sound device manually
via another process manages to workaround it, then I'll simply
give it all up completely and live with the current semi-satisfying situation
on my custom 2.6.31.9 build.

Any further ideas or patches that I could try?
(I might investigate the issue myself in a serious way sometime later,
but don't count on it)

Thanks!

Andreas Mohr

netconsole log (some previous crashes were at __bzero, now it was two times
at __copy_user - maybe the patches changed something for real?):

Instruction bus error, epc == 80004dd8, ra == 80000018
Oops[#1]:                                             
Cpu 0                                                 
$ 0   : 00000000 1000d000 00000000 00000000           
$ 4   : 7f9e6be8 81ee7ec4 00000004 00000000           
$ 8   : 00000000 00000000 00000000 81fac000           
$12   : 4b688a80 80340000 81d6e868 00000400           
$16   : 81ee7f00 00000000 7f9e6be8 00000001           
$20   : 81ee7eb8 00000000 7f9e6c9c 7f9eb320           
$24   : 00000000 2b565ed0                             
$28   : 81ee6000 81ee7ea8 7f9f6c98 80000018           
Hi    : 00000000                                      
Lo    : 00000000                                      
epc   : 80004dd8 __copy_user+0xd4/0x2bc               
    Not tainted                                       
ra    : 80000018 0x80000018                           
Status: 1000d003    KERNEL EXL IE                     
Cause : 00800018                                      
PrId  : 00029029 (Broadcom BCM3302)                   
Modules linked in: snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm evdev snd_timer snd_page_alloc snd_usb_lib snd_rawmidi usbhid snd_seq_device snd_hwdep hid snd input_core soundcore ipv6 arc4 ecb cryptomgr aead pcompress crypto_blkcipher crypto_hash crypto_algapi b43 mac80211 cfg80211                                                                    
Process mpd (pid: 1310, threadinfo=81ee6000, task=81d92838, tls=00000000)                                             
Stack : 00000000 800afd4c 81ee56f8 00000219 00000000 00000000 00000000 00000000                                       
        ffffffff 3b043616 81ee7f00 7f9e6be8 00000000 00000000 00000000 800b1370                                       
        7f9e6ca0 81ee5680 000182fc 00493a83 81ee7f00 8009f6b0 00000219 03b1daad                                       
        00000000 00002710 00000000 00000000 7f9f6ca0 7f9e6ca0 7f9ec528 00000127                                       
        00000000 800031f0 00000000 2ae49060 7f9f75a8 2ae49060 7f9e6be8 7f9f5bd8                                       
        ...                                                                                                           
Call Trace:                                                                                                           
[<80004dd8>] __copy_user+0xd4/0x2bc                                                                                   


Code: 8ca80000  24a50004  24c6fffc <ac880000> 1706fffb  24840004  10c00040  00864821  240a0020 
Disabling lock debugging due to kernel taint                                                   
Instruction bus error, epc == 80096fa0, ra == 80000018                                         
Oops[#2]:                                                                                      
Cpu 0                                                                                          
$ 0   : 00000000 1000d000 c0156064 00000064                                                    
$ 4   : 00000032 803b5514 00000032 81fa8000                                                    
$ 8   : 8037f840 00080000 81040000 00000003                                                    
$12   : 00000010 8037f840 00000004 00000000                                                    
$16   : 00000032 81fa8d54 2ab55000 00398f45                                                    
$20   : 2ab56000 0064d613 00000000 00000000                                                    
$24   : 00000000 80018ff0                                                                      
$28   : 81ee6000 81ee7c58 00000000 80000018                                                    
Hi    : 00000000                                                                               
Lo    : 00000000                                                                               
epc   : 80096fa0 swap_info_get+0x74/0xfc                                                       
    Tainted: G      D                                                                          
ra    : 80000018 0x80000018                                                                    
Status: 1000d003    KERNEL EXL IE                                                              
Cause : 00800018                                                                               
PrId  : 00029029 (Broadcom BCM3302)                                                            
Modules linked in: snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm evdev snd_timer snd_page_alloc snd_usb_lib snd_rawmidi usbhid snd_seq_device snd_hwdep hid snd input_core soundcore ipv6 arc4 ecb cryptomgr aead pcompress crypto_blkcipher crypto_hash crypto_algapi b43 mac80211 cfg80211                                                                    
Process mpd (pid: 1310, threadinfo=81ee6000, task=81d92838, tls=00000000)                                             
Stack : ffffffff 800736a0 00000004 81d6e870 80350fd0 80099498 81d929cc 81036e40                                       
        81fa8bfc 2aaff000 00064000 81fa8d54 2ab55000 80088a30 8033dab8 80029d78                                       
        803402d4 00000006 2ab55fff 8033d9c0 81eaae60 81f0c2a8 81f0c2a8 2ab56000                                       
        00000000 00000001 80c4c0bc 81eaae60 8037f840 81eaae94 81d92838 00000000                                       
        00000001 7f9eb320 7f9f6c98 8008db1c 81ee7d00 80c4ce90 00000000 ffffffff                                       
        ...                                                                                                           
Call Trace:                                                                                                           
[<80096fa0>] swap_info_get+0x74/0xfc                                                                                  
[<80099498>] free_swap_and_cache+0x1c/0x218                                                                           
[<80088a30>] unmap_vmas+0x418/0x63c                                                                                   
[<8008db1c>] exit_mmap+0xb8/0x148                                                                                     
[<8002e3c4>] mmput+0xc0/0x1d8                                                                                         
[<800333e8>] exit_mm+0x260/0x298
[<800357cc>] do_exit+0x1cc/0x688
[<80014658>] nmi_exception_handler+0x0/0x34


Code: 00041840  8ca20020  00431021 <94440000> 1480001d  8fbf0014  3c048030  3c05802c  24a5f280
Fixing recursive fault but reboot is needed!
Instruction bus error, epc == 80004dd8, ra == 80000018
Oops[#3]:
Cpu 0
$ 0   : 00000000 1000d000 00000000 00000000
$ 4   : 7fcd81b0 81c0ddd0 00000000 1000d001
$ 8   : 00000000 00000000 00000000 806f8000
$12   : 4b688a84 7f9f7f18 81d6e868 00000000
$16   : 00000004 00000000 81c0ddc0 81c0ddc0
$20   : 7fcd81b0 00000000 81c0ddcc 00000000
$24   : 00000000 2b565ed0
$28   : 81c0c000 81c0dd98 00000001 80000018
Hi    : 0000007d
Lo    : eb254400
epc   : 80004dd8 __copy_user+0xd4/0x2bc
    Tainted: G      D
ra    : 80000018 0x80000018
Status: 1000d003    KERNEL EXL IE
Cause : 00800018
PrId  : 00029029 (Broadcom BCM3302)
Modules linked in: snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm evdev snd_timer snd_page_alloc snd_usb_lib snd_rawmidi usbhid snd_seq_device snd_hwdep hid snd input_core soundcore ipv6 arc4 ecb cryptomgr aead pcompress crypto_blkcipher crypto_hash crypto_algapi b43 mac80211 cfg80211
Process init (pid: 1, threadinfo=81c0c000, task=81c08480, tls=00000000)
Stack : 00000000 81c0dda8 81c0df00 80350fb0 81c0ddc0 81c0ddc4 81c0ddc8 81c0ddcc
        81c0ddd0 81c0ddd4 00000400 00000000 00000000 00000000 00000000 00000000
        00000000 00000000 81d98000 ffffff9c 81c0dea8 0044a234 7fcd8598 8009b864
        00000001 81c0dea8 00000001 81d98000 ffffff9c 800a2d18 00000003 00000002
        00000003 00000003 0000000d 00000000 00000000 00000000 000000c9 00001180
        ...
Call Trace:
[<80004dd8>] __copy_user+0xd4/0x2bc


Code: 8ca80000  24a50004  24c6fffc <ac880000> 1706fffb  24840004  10c00040  00864821  240a0020
kobject: 'ep_01' (81e52f10): kobject_uevent_env
kobject: 'ep_01' (81e52f10): kobject_uevent_env: filter function caused the event to drop!
Kernel panic - not syncing: Attempted to kill init!


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 17:20                       ` Catalin Marinas
@ 2010-02-02 21:52                         ` Andreas Mohr
  2010-02-03 15:15                           ` Alan Stern
  0 siblings, 1 reply; 352+ messages in thread
From: Andreas Mohr @ 2010-02-02 21:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alan Stern, Oliver Neukum, Matthew Dharm, Sergei Shtylyov,
	Ming Lei, linux-usb, linux-kernel, Sebastian Siewior, Greg KH

Hi,

On Tue, Feb 02, 2010 at 05:20:11PM +0000, Catalin Marinas wrote:
> My issues is with both I-D coherency and D-cache aliasing caused by
> pages mapped in both user and kernel space (with different colours). The
> flush_dcache_page() call should target both cases.

Yup, it does, and quite successfully at that (aka "at that point in time we
having nothing any more to worry about, everything dealt with" ;-)


usbcore: registered new interface driver ums-datafab                                                                  
hub 2-1:1.0: state 7 ports 2 chg 0002 evt 0000                                                                        
kobject: 'ums-freecom' (81de0a80): kobject_add_internal: parent: 'drivers', set: 'drivers'                            
hub 2-1:1.0: port 1, status 0101, change 0000, 12 Mb/s                                                                
kobject: 'ums-freecom' (81de0a80): kobject_uevent_env                                                                 
kobject: 'ums-freecom' (81de0a80): fill_kobj_path: path = '/bus/usb/drivers/ums-freecom'                              
usbcore: registered new interface driver ums-freecom                                                                  
kobject: 'ums-jumpshot' (81de0c80): kobject_add_internal: parent: 'drivers', set: 'drivers'                           
CPU 0 Unable to handle kernel paging request at virtual address 0000041c, epc == 800171e8, ra == 801da5dc             
Oops[#1]:                                                                                                             
Cpu 0                                                                                                                 
$ 0   : 00000000 10008000 803b0000 00010000                                                                           
$ 4   : 00000408 8143bc60 0043bc60 00000001                                                                           
$ 8   : 81dd7124 81dd7190 00000004 00000000                                                                           
$12   : 0000003b 80380000 00000002 f2d9b780                                                                           
$16   : a1de4020 803b0000 8037f840 81de7f00                                                                           
$20   : 00000000 81dd7080 80000000 00000000                                                                           
$24   : 00000000 80016bb8                                                                                             
$28   : 81c0c000 81c0da98 a1dd414c 801da5dc                                                                           
Hi    : 00000008                                                                                                      
Lo    : 00000000                                                                                                      
epc   : 800171e8 __flush_dcache_page+0x38/0x120
    Not tainted
ra    : 801da5dc ehci_urb_done+0x178/0x1dc
Status: 10008002    KERNEL EXL
Cause : 00805008
BadVA : 0000041c
PrId  : 00029029 (Broadcom BCM3302)
Modules linked in:
Process swapper (pid: 1, threadinfo=81c0c000, task=81c08480, tls=00000000)
Stack : 81dd7080 00000001 10009000 8033dab8 a1dd8120 a1dd4114 ffffff6a ffffff6a
        81de7f00 a1dd414c a1dd4100 801db39c 05b8d800 00000000 00000018 803a0000
        803a0000 0000054c 00000001 00000000 a1dd8180 81dd7080 00000000 a1dd4100
        00000000 81c0dbb8 00000000 80318d24 81dd7158 81dd7080 81dda004 801deb38
        81dd7158 8004f984 01f63104 0000003c 81c0dc78 8033feb8 00000008 00000042
        ...
Call Trace:
[<800171e8>] __flush_dcache_page+0x38/0x120
[<801da5dc>] ehci_urb_done+0x178/0x1dc
[<801db39c>] qh_completions+0x484/0x554
[<801deb38>] ehci_work+0x438/0xb68
[<801df2bc>] ehci_watchdog+0x54/0x94
[<8003d3ec>] run_timer_softirq+0x1b0/0x268
[<80037fbc>] __do_softirq+0xb8/0x174
[<800380d4>] do_softirq+0x5c/0x98
[<80038244>] irq_exit+0x40/0x88
[<8000e12c>] plat_irq_dispatch+0x60/0x178
[<80001444>] ret_from_irq+0x0/0x4
[<80031de8>] vprintk+0x36c/0x3bc
[<8000a48c>] printk+0x24/0x30
[<80151918>] kobject_add_internal+0x124/0x254
[<80151f80>] kobject_init_and_add+0x40/0x58
[<8018e854>] bus_add_driver+0xdc/0x2b4
[<801902c8>] driver_register+0xe0/0x19c
[<801ce000>] usb_register_driver+0x84/0x118
[<8000d640>] do_one_initcall+0x70/0x1f4
[<80354334>] kernel_init+0xd0/0x140
[<8000fb4c>] kernel_thread_helper+0x10/0x18


Code: 00000000  10800029  3c02803b <8c820014> 14400026  3c02803b  8c83001c  2482001c  14620021
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt



Any ideas? To my uncaring mind this would look like __flush_dcache_page()
not being quite so happy with a NULL pointer that it is being served
(although I haven't managed to precisely investigate yet where the
dereferencing offset 0000041c is coming from).

Yes, crash is reproducible (three times on boot already, although some bootup
does make it successfully).

My ehci-q.c has:

       if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) != PIPE_CONTROL) {
               void *ptr;
               for (ptr = urb->transfer_buffer;
                    ptr < urb->transfer_buffer + urb->transfer_buffer_length;
                    ptr += PAGE_SIZE)
                       flush_dcache_page(virt_to_page(ptr));
       }

Hmm, OTOH this code seems to postulate that urb->transfer_buffer_length
is that 0x41c from above...
(IOW the code is simply missing an urb->transfer_buffer NULL check)
OTOH there would also be the question whether flush_dcache_page() should
have caught the NULL pointer input...
And then there's the question whether urb->transfer_buffer is allowed to end
up as NULL anyway...



BTW, trying to keep open /dev/dsp by another app when closing the playback app
does not prevent the audio OOPS.


Been seeing a nano-tiny wee bit too many crashes these days,

Andreas Mohr

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 21:52                         ` Andreas Mohr
@ 2010-02-03 15:15                           ` Alan Stern
  0 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-03 15:15 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Catalin Marinas, Oliver Neukum, Matthew Dharm, Sergei Shtylyov,
	Ming Lei, linux-usb, linux-kernel, Sebastian Siewior, Greg KH

On Tue, 2 Feb 2010, Andreas Mohr wrote:

> Any ideas? To my uncaring mind this would look like __flush_dcache_page()
> not being quite so happy with a NULL pointer that it is being served
> (although I haven't managed to precisely investigate yet where the
> dereferencing offset 0000041c is coming from).
> 
> Yes, crash is reproducible (three times on boot already, although some bootup
> does make it successfully).
> 
> My ehci-q.c has:
> 
>        if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) != PIPE_CONTROL) {
>                void *ptr;
>                for (ptr = urb->transfer_buffer;
>                     ptr < urb->transfer_buffer + urb->transfer_buffer_length;
>                     ptr += PAGE_SIZE)
>                        flush_dcache_page(virt_to_page(ptr));
>        }
> 
> Hmm, OTOH this code seems to postulate that urb->transfer_buffer_length
> is that 0x41c from above...
> (IOW the code is simply missing an urb->transfer_buffer NULL check)
> OTOH there would also be the question whether flush_dcache_page() should
> have caught the NULL pointer input...
> And then there's the question whether urb->transfer_buffer is allowed to end
> up as NULL anyway...

Have you looked at the code in qh_urb_transaction() in ehci-q.c
involving this_sg_len and buf?  It's quite possible that
urb->transfer_buffer is a NULL pointer and that the actual buffer is
not a contiguous set of pages -- but only if DMA is used.

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-01 17:29         ` Catalin Marinas
@ 2010-02-08  6:55             ` Pavel Machek
  2010-02-01 22:30           ` Andreas Mohr
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-02-08  6:55 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH, linux-arm-kernel

Hi!

> > So, let's put this in the HCD drivers and be done with it.
> 
> The patch below is what fixes the I-D cache incoherency issues on ARM. I
> don't particularly like the solution but it seems to be the only one
> available.

Really? It looks like arm should just flush the caches when mapping
executable page to the userspace.... you can't expect all the drivers
to be modified like that...

Plus it does unneccessary flushes on x86, etc...

> @@ -904,6 +906,14 @@ __acquires(priv->lock)
>  			status = 0;
>  	}
>  
> +	if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
> +		void *ptr;
> +		for (ptr = urb->transfer_buffer;
> +		     ptr < urb->transfer_buffer + urb->transfer_buffer_length;
> +		     ptr += PAGE_SIZE)
> +			flush_dcache_page(virt_to_page(ptr));
> +	}
> +
>  	/* complete() can reenter this HCD */
>  	usb_hcd_unlink_urb_from_ep(priv_to_hcd(priv), urb);
>  	spin_unlock(&priv->lock);
> 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-08  6:55             ` Pavel Machek
  0 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-02-08  6:55 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > So, let's put this in the HCD drivers and be done with it.
> 
> The patch below is what fixes the I-D cache incoherency issues on ARM. I
> don't particularly like the solution but it seems to be the only one
> available.

Really? It looks like arm should just flush the caches when mapping
executable page to the userspace.... you can't expect all the drivers
to be modified like that...

Plus it does unneccessary flushes on x86, etc...

> @@ -904,6 +906,14 @@ __acquires(priv->lock)
>  			status = 0;
>  	}
>  
> +	if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
> +		void *ptr;
> +		for (ptr = urb->transfer_buffer;
> +		     ptr < urb->transfer_buffer + urb->transfer_buffer_length;
> +		     ptr += PAGE_SIZE)
> +			flush_dcache_page(virt_to_page(ptr));
> +	}
> +
>  	/* complete() can reenter this HCD */
>  	usb_hcd_unlink_urb_from_ep(priv_to_hcd(priv), urb);
>  	spin_unlock(&priv->lock);
> 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-02 17:11                     ` Alan Stern
  2010-02-02 17:20                       ` Catalin Marinas
@ 2010-02-08  6:55                       ` Pavel Machek
  1 sibling, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-02-08  6:55 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Catalin Marinas, Matthew Dharm, Sergei Shtylyov,
	Ming Lei, linux-usb, linux-kernel, Sebastian Siewior, Greg KH

On Tue 2010-02-02 12:11:25, Alan Stern wrote:
> On Tue, 2 Feb 2010, Oliver Neukum wrote:
> 
> > Am Dienstag, 2. Februar 2010 13:39:35 schrieb Catalin Marinas:
> > > > For storage that is correct. But what about other sources of pages,
> > > > for example iSCSI?
> > > 
> > > In the iSCSI case, does the HCD driver write directly to a page cache
> > > page? Or it just fills in network packets that are copied to page cache
> > > pages by the iSCSI code (sorry, I'm not familiar with this part of the
> > > kernel). If the latter, the cache flushing in the HCD driver would not
> > > help and it needs to be done in the iSCSI code.
> > 
> > As far as I can tell iSCSI does a private copy. But I don't know how
> > many methods to transfer code pages over USB exist. I'd say the
> > conservative solution is to flush for everything but control transfers.
> 
> This doesn't make any sense.  Nobody would ever use isochronous 
> transfers to store data into a code page because isochronous is 
> unreliable.  (Audio isn't a counterexample -- audio data may be

Why not?

Use isochronous transfer to load data, verify it is okay, exec it.

Or maybe someone is doing crashme testing with usb audio as random
generator :-).

Sure, unlikely, but...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-08  6:55             ` Pavel Machek
@ 2010-02-08  7:33               ` Andreas Mohr
  -1 siblings, 0 replies; 352+ messages in thread
From: Andreas Mohr @ 2010-02-08  7:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Catalin Marinas, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	linux-usb, linux-kernel, Sebastian Siewior, Greg KH,
	linux-arm-kernel

Hi,

On Mon, Feb 08, 2010 at 07:55:19AM +0100, Pavel Machek wrote:
> Plus it does unneccessary flushes on x86, etc...

Noticed that as well, there should be an arch-obeying helper for this.


On my MIPSEL, I had urb->transfer_buffer NULL ptr crashes
(I think that was expected in case of a certain DMA setup, Alan said).

However, even with NULL check added I still had:

hub 2-1.1:1.0: state 7 ports 7 chg 0000 evt 0010
Unhandled kernel unaligned access[#1]:
Cpu 0
$ 0   : 00000000 fffffffd 803b0000 00010000
$ 4   : 08002042 8143bfe0 0043bfe0 0000000d
$ 8   : 00000001 3b9aca00 c4653600 00000000
$12   : 00000049 3b9aca00 81dbc868 00000000
$16   : a1e00000 803b0000 8037f840 81dfaa80
$20   : 00000000 81dd5080 80000000 00000000
$24   : 00000000 80015a64
$28   : 8033a000 8033bc10 a1dd83cc 801da5e4
Hi    : 00000000
Lo    : 00000000
epc   : 800171e8 __flush_dcache_page+0x38/0x120
    Not tainted
ra    : 801da5e4 ehci_urb_done+0x180/0x1e4
Status: 10009002    KERNEL EXL
Cause : 00800010
BadVA : 08002056
PrId  : 00029029 (Broadcom BCM3302)
Modules linked in:
Process swapper (pid: 0, threadinfo=8033a000, task=8033c000, tls=00000000)
Stack : 00000000 00000000 81e04980 801c80ac a1dd9060 a1dd8394 ffffff6a ffffff6a
        81dfaa80 a1dd83cc a1dd8380 801db3a4 803a6a28 80068e9c 000003f8 00003fc0
        a1dd81cc 801dea58 00000001 00000000 a1dd9360 81dd5080 a1dd8380 10009001
        a1dd83cc 81dd5158 00000000 80318d44 81dd5158 00000001 00010031 801de8f4
        81dd5158 8033bce0 803a76a0 803a0000 8033d860 8004f924 00000219 00000043
        ...
Call Trace:
[<800171e8>] __flush_dcache_page+0x38/0x120
[<801da5e4>] ehci_urb_done+0x180/0x1e4
[<801db3a4>] qh_completions+0x484/0x554
[<801de8f4>] ehci_work+0x1ec/0xb68
[<801e2598>] ehci_irq+0x360/0x3a4
[<801c7cf8>] usb_hcd_irq+0x64/0x15c
[<80066d58>] handle_IRQ_event+0x90/0x280
[<80068e80>] handle_percpu_irq+0x48/0x9c
[<8000e228>] plat_irq_dispatch+0x15c/0x178
[<80001444>] ret_from_irq+0x0/0x4
[<80001680>] r4k_wait+0x20/0x40
[<8000fe34>] cpu_idle+0x30/0x60
[<80354a34>] start_kernel+0x338/0x350


Code: 00000000  10800029  3c02803b <8c820014> 14400026  3c02803b  8c83001c  2482001c  14620021
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt



Seems like BadVA : 08002056 really isn't as aligned (offset 0x6) as it should be.

I've given up on this now BTW, I'll wait until the dust has settled (i.e. some nice improvements
have found their way to the kernel) and retry in some months with a much newer kernel version
(currently patched-up 2.6.31.9) whether something remains to be fixed.
I'll work on more productive things such as submitting some waiting patches.

Andreas Mohr

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-08  7:33               ` Andreas Mohr
  0 siblings, 0 replies; 352+ messages in thread
From: Andreas Mohr @ 2010-02-08  7:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Mon, Feb 08, 2010 at 07:55:19AM +0100, Pavel Machek wrote:
> Plus it does unneccessary flushes on x86, etc...

Noticed that as well, there should be an arch-obeying helper for this.


On my MIPSEL, I had urb->transfer_buffer NULL ptr crashes
(I think that was expected in case of a certain DMA setup, Alan said).

However, even with NULL check added I still had:

hub 2-1.1:1.0: state 7 ports 7 chg 0000 evt 0010
Unhandled kernel unaligned access[#1]:
Cpu 0
$ 0   : 00000000 fffffffd 803b0000 00010000
$ 4   : 08002042 8143bfe0 0043bfe0 0000000d
$ 8   : 00000001 3b9aca00 c4653600 00000000
$12   : 00000049 3b9aca00 81dbc868 00000000
$16   : a1e00000 803b0000 8037f840 81dfaa80
$20   : 00000000 81dd5080 80000000 00000000
$24   : 00000000 80015a64
$28   : 8033a000 8033bc10 a1dd83cc 801da5e4
Hi    : 00000000
Lo    : 00000000
epc   : 800171e8 __flush_dcache_page+0x38/0x120
    Not tainted
ra    : 801da5e4 ehci_urb_done+0x180/0x1e4
Status: 10009002    KERNEL EXL
Cause : 00800010
BadVA : 08002056
PrId  : 00029029 (Broadcom BCM3302)
Modules linked in:
Process swapper (pid: 0, threadinfo=8033a000, task=8033c000, tls=00000000)
Stack : 00000000 00000000 81e04980 801c80ac a1dd9060 a1dd8394 ffffff6a ffffff6a
        81dfaa80 a1dd83cc a1dd8380 801db3a4 803a6a28 80068e9c 000003f8 00003fc0
        a1dd81cc 801dea58 00000001 00000000 a1dd9360 81dd5080 a1dd8380 10009001
        a1dd83cc 81dd5158 00000000 80318d44 81dd5158 00000001 00010031 801de8f4
        81dd5158 8033bce0 803a76a0 803a0000 8033d860 8004f924 00000219 00000043
        ...
Call Trace:
[<800171e8>] __flush_dcache_page+0x38/0x120
[<801da5e4>] ehci_urb_done+0x180/0x1e4
[<801db3a4>] qh_completions+0x484/0x554
[<801de8f4>] ehci_work+0x1ec/0xb68
[<801e2598>] ehci_irq+0x360/0x3a4
[<801c7cf8>] usb_hcd_irq+0x64/0x15c
[<80066d58>] handle_IRQ_event+0x90/0x280
[<80068e80>] handle_percpu_irq+0x48/0x9c
[<8000e228>] plat_irq_dispatch+0x15c/0x178
[<80001444>] ret_from_irq+0x0/0x4
[<80001680>] r4k_wait+0x20/0x40
[<8000fe34>] cpu_idle+0x30/0x60
[<80354a34>] start_kernel+0x338/0x350


Code: 00000000  10800029  3c02803b <8c820014> 14400026  3c02803b  8c83001c  2482001c  14620021
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt



Seems like BadVA : 08002056 really isn't as aligned (offset 0x6) as it should be.

I've given up on this now BTW, I'll wait until the dust has settled (i.e. some nice improvements
have found their way to the kernel) and retry in some months with a much newer kernel version
(currently patched-up 2.6.31.9) whether something remains to be fixed.
I'll work on more productive things such as submitting some waiting patches.

Andreas Mohr

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-08  6:55             ` Pavel Machek
@ 2010-02-08  9:51               ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-08  9:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH, linux-arm-kernel

Hi,

On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > So, let's put this in the HCD drivers and be done with it.
> >
> > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > don't particularly like the solution but it seems to be the only one
> > available.
> 
> Really? It looks like arm should just flush the caches when mapping
> executable page to the userspace.... you can't expect all the drivers
> to be modified like that...

We could of course flush the caches every time we get a page fault but
that's far from optimal, especially since DMA-capable drivers to do not
pollute the D-cache and don't need this extra flushing. Note that the
recent ARM processors have PIPT caches but separate for I and D and it's
the PIO drivers that pollute the D-cache.

The kernel API provides flush_dcache_page() to be called every time the
kernel writes to a page cache page. This is further optimised for
working in pair with update_mmu_cache() to delay the flushing until the
actual page is mapped into user space and this latter function is called
(which in general is not a cache maintenance function).

The problem with some PIO drivers and a filesystems like ext2 is that
there is no call to flush_dcache_page() when getting data into a page
cache page. Since the page isn't marked as dirty (PG_arch_1), a
subsequent call to update_mmu_cache() as a result of a page fault
doesn't flush the caches.

There is a flush_icache_page() function called from __do_fault(),
however, Documentation/cachetlb.txt states that all the functionality of
this function can be implemented in flush_dcache_page() and
update_mmu_cache(), hence this function is a no-op.

Please suggest a better solution that does not involve modifying generic
Linux code.

> Plus it does unneccessary flushes on x86, etc...

On x86, it should indeed be conditionally compiled based on
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.

Regards.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-08  9:51               ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-08  9:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > So, let's put this in the HCD drivers and be done with it.
> >
> > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > don't particularly like the solution but it seems to be the only one
> > available.
> 
> Really? It looks like arm should just flush the caches when mapping
> executable page to the userspace.... you can't expect all the drivers
> to be modified like that...

We could of course flush the caches every time we get a page fault but
that's far from optimal, especially since DMA-capable drivers to do not
pollute the D-cache and don't need this extra flushing. Note that the
recent ARM processors have PIPT caches but separate for I and D and it's
the PIO drivers that pollute the D-cache.

The kernel API provides flush_dcache_page() to be called every time the
kernel writes to a page cache page. This is further optimised for
working in pair with update_mmu_cache() to delay the flushing until the
actual page is mapped into user space and this latter function is called
(which in general is not a cache maintenance function).

The problem with some PIO drivers and a filesystems like ext2 is that
there is no call to flush_dcache_page() when getting data into a page
cache page. Since the page isn't marked as dirty (PG_arch_1), a
subsequent call to update_mmu_cache() as a result of a page fault
doesn't flush the caches.

There is a flush_icache_page() function called from __do_fault(),
however, Documentation/cachetlb.txt states that all the functionality of
this function can be implemented in flush_dcache_page() and
update_mmu_cache(), hence this function is a no-op.

Please suggest a better solution that does not involve modifying generic
Linux code.

> Plus it does unneccessary flushes on x86, etc...

On x86, it should indeed be conditionally compiled based on
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.

Regards.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-08  9:51               ` Catalin Marinas
@ 2010-02-08 10:03                 ` Andy Green
  -1 siblings, 0 replies; 352+ messages in thread
From: Andy Green @ 2010-02-08 10:03 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Pavel Machek, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, Greg KH,
	linux-arm-kernel

On 02/08/10 10:51, Somebody in the thread at some point said:

> We could of course flush the caches every time we get a page fault but
> that's far from optimal, especially since DMA-capable drivers to do not
> pollute the D-cache and don't need this extra flushing. Note that the
> recent ARM processors have PIPT caches but separate for I and D and it's
> the PIO drivers that pollute the D-cache.

Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the 
moment, for lack of any platform DMA support of its unusual DMA engine.

-Andy

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-08 10:03                 ` Andy Green
  0 siblings, 0 replies; 352+ messages in thread
From: Andy Green @ 2010-02-08 10:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/08/10 10:51, Somebody in the thread at some point said:

> We could of course flush the caches every time we get a page fault but
> that's far from optimal, especially since DMA-capable drivers to do not
> pollute the D-cache and don't need this extra flushing. Note that the
> recent ARM processors have PIPT caches but separate for I and D and it's
> the PIO drivers that pollute the D-cache.

Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the 
moment, for lack of any platform DMA support of its unusual DMA engine.

-Andy

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-08  7:33               ` Andreas Mohr
@ 2010-02-08 10:19                 ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-08 10:19 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Pavel Machek, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	linux-usb, linux-kernel, Sebastian Siewior, Greg KH,
	linux-arm-kernel

On Mon, 2010-02-08 at 07:33 +0000, Andreas Mohr wrote:
> On Mon, Feb 08, 2010 at 07:55:19AM +0100, Pavel Machek wrote:
> > Plus it does unneccessary flushes on x86, etc...
> 
> Noticed that as well, there should be an arch-obeying helper for this.
> 
> 
> On my MIPSEL, I had urb->transfer_buffer NULL ptr crashes
> (I think that was expected in case of a certain DMA setup, Alan said).
> 
> However, even with NULL check added I still had:
> 
> hub 2-1.1:1.0: state 7 ports 7 chg 0000 evt 0010
> Unhandled kernel unaligned access[#1]:

Just to avoid confusion - that's a similar patch applied to a different
driver. The ISP1760 HCD driver works fine with my patch (transfer_buffer
never seems to be NULL with latest mainline). I can't comment on the
ehci-q.c driver (it looks like it has some support for DMA while my
patch only applies to PIO drivers where transfer_buffer should be set).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-08 10:19                 ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-08 10:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2010-02-08 at 07:33 +0000, Andreas Mohr wrote:
> On Mon, Feb 08, 2010 at 07:55:19AM +0100, Pavel Machek wrote:
> > Plus it does unneccessary flushes on x86, etc...
> 
> Noticed that as well, there should be an arch-obeying helper for this.
> 
> 
> On my MIPSEL, I had urb->transfer_buffer NULL ptr crashes
> (I think that was expected in case of a certain DMA setup, Alan said).
> 
> However, even with NULL check added I still had:
> 
> hub 2-1.1:1.0: state 7 ports 7 chg 0000 evt 0010
> Unhandled kernel unaligned access[#1]:

Just to avoid confusion - that's a similar patch applied to a different
driver. The ISP1760 HCD driver works fine with my patch (transfer_buffer
never seems to be NULL with latest mainline). I can't comment on the
ehci-q.c driver (it looks like it has some support for DMA while my
patch only applies to PIO drivers where transfer_buffer should be set).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-08  9:51               ` Catalin Marinas
@ 2010-02-08 10:52                 ` Pavel Machek
  -1 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-02-08 10:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH, linux-arm-kernel

> Hi,
> 
> On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > So, let's put this in the HCD drivers and be done with it.
> > >
> > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > don't particularly like the solution but it seems to be the only one
> > > available.
> > 
> > Really? It looks like arm should just flush the caches when mapping
> > executable page to the userspace.... you can't expect all the drivers
> > to be modified like that...
> 
> We could of course flush the caches every time we get a page fault but
> that's far from optimal, especially since DMA-capable drivers to do
> not

Maybe far for optimal, but it is something that should be done,
_first_. Correctness is more important than performance, and you can't
expect all drivers to behave like you want them.

Then you can add optimalizations not to do the flushes on drivers you
audited and where you care...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-08 10:52                 ` Pavel Machek
  0 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-02-08 10:52 UTC (permalink / raw)
  To: linux-arm-kernel

> Hi,
> 
> On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > So, let's put this in the HCD drivers and be done with it.
> > >
> > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > don't particularly like the solution but it seems to be the only one
> > > available.
> > 
> > Really? It looks like arm should just flush the caches when mapping
> > executable page to the userspace.... you can't expect all the drivers
> > to be modified like that...
> 
> We could of course flush the caches every time we get a page fault but
> that's far from optimal, especially since DMA-capable drivers to do
> not

Maybe far for optimal, but it is something that should be done,
_first_. Correctness is more important than performance, and you can't
expect all drivers to behave like you want them.

Then you can add optimalizations not to do the flushes on drivers you
audited and where you care...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-08 10:52                 ` Pavel Machek
@ 2010-02-08 11:28                   ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-08 11:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, linux-usb,
	linux-kernel, Sebastian Siewior, Greg KH, linux-arm-kernel

On Mon, 2010-02-08 at 10:52 +0000, Pavel Machek wrote:
> > On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > > So, let's put this in the HCD drivers and be done with it.
> > > >
> > > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > > don't particularly like the solution but it seems to be the only one
> > > > available.
> > >
> > > Really? It looks like arm should just flush the caches when mapping
> > > executable page to the userspace.... you can't expect all the drivers
> > > to be modified like that...
> >
> > We could of course flush the caches every time we get a page fault but
> > that's far from optimal, especially since DMA-capable drivers to do
> > not
> 
> Maybe far for optimal, but it is something that should be done,
> _first_. Correctness is more important than performance, and you can't
> expect all drivers to behave like you want them.

I wouldn't call heavy cache flushing "correctness". We could as well
disable the caches so that it is fully coherent.

The arch code follows an API defined in cachetlb.txt but the PIO drivers
don't (some do, like mmci.c). It may be inconvenient to call
flush_dcache_page() in the driver, hence I started a discussion on
linux-arch on a PIO mapping API that x86 or other fully coherent
architectures can leave it as no-ops.

> Then you can add optimalizations not to do the flushes on drivers you
> audited and where you care...

Sorry but that's not really feasible (unless I don't fully understand
what you mean) - if we do the cache flushing on the fault handling path
in the arch code, there is no way for the arch code to know what driver
is doing, unless we make this conditionally compiled with something like
CONFIG_ARCH_NEEDS_HEAVY_FLUSHING.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-08 11:28                   ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-08 11:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2010-02-08 at 10:52 +0000, Pavel Machek wrote:
> > On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > > So, let's put this in the HCD drivers and be done with it.
> > > >
> > > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > > don't particularly like the solution but it seems to be the only one
> > > > available.
> > >
> > > Really? It looks like arm should just flush the caches when mapping
> > > executable page to the userspace.... you can't expect all the drivers
> > > to be modified like that...
> >
> > We could of course flush the caches every time we get a page fault but
> > that's far from optimal, especially since DMA-capable drivers to do
> > not
> 
> Maybe far for optimal, but it is something that should be done,
> _first_. Correctness is more important than performance, and you can't
> expect all drivers to behave like you want them.

I wouldn't call heavy cache flushing "correctness". We could as well
disable the caches so that it is fully coherent.

The arch code follows an API defined in cachetlb.txt but the PIO drivers
don't (some do, like mmci.c). It may be inconvenient to call
flush_dcache_page() in the driver, hence I started a discussion on
linux-arch on a PIO mapping API that x86 or other fully coherent
architectures can leave it as no-ops.

> Then you can add optimalizations not to do the flushes on drivers you
> audited and where you care...

Sorry but that's not really feasible (unless I don't fully understand
what you mean) - if we do the cache flushing on the fault handling path
in the arch code, there is no way for the arch code to know what driver
is doing, unless we make this conditionally compiled with something like
CONFIG_ARCH_NEEDS_HEAVY_FLUSHING.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-08 11:28                   ` Catalin Marinas
@ 2010-02-16  7:57                     ` Shilimkar, Santosh
  -1 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16  7:57 UTC (permalink / raw)
  To: Catalin Marinas, Pavel Machek, Greg KH, Russell King - ARM Linux
  Cc: Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

> -----Original Message-----
> From: linux-arm-kernel-bounces@lists.infradead.org [mailto:linux-arm-kernel-
> bounces@lists.infradead.org] On Behalf Of Catalin Marinas
> Sent: Monday, February 08, 2010 4:58 PM
> To: Pavel Machek
> Cc: Matthew Dharm; Sergei Shtylyov; Ming Lei; Sebastian Siewior; linux-usb@vger.kernel.org; linux-
> kernel; Greg KH; linux-arm-kernel
> Subject: Re: USB mass storage and ARM cache coherency
> 
> On Mon, 2010-02-08 at 10:52 +0000, Pavel Machek wrote:
> > > On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > > > So, let's put this in the HCD drivers and be done with it.
> > > > >
> > > > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > > > don't particularly like the solution but it seems to be the only one
> > > > > available.
> > > >
> > > > Really? It looks like arm should just flush the caches when mapping
> > > > executable page to the userspace.... you can't expect all the drivers
> > > > to be modified like that...
> > >
> > > We could of course flush the caches every time we get a page fault but
> > > that's far from optimal, especially since DMA-capable drivers to do
> > > not
> >
> > Maybe far for optimal, but it is something that should be done,
> > _first_. Correctness is more important than performance, and you can't
> > expect all drivers to behave like you want them.
> 
> I wouldn't call heavy cache flushing "correctness". We could as well
> disable the caches so that it is fully coherent.
> 
> The arch code follows an API defined in cachetlb.txt but the PIO drivers
> don't (some do, like mmci.c). It may be inconvenient to call
> flush_dcache_page() in the driver, hence I started a discussion on
> linux-arch on a PIO mapping API that x86 or other fully coherent
> architectures can leave it as no-ops.
> 
> > Then you can add optimalizations not to do the flushes on drivers you
> > audited and where you care...
> 
> Sorry but that's not really feasible (unless I don't fully understand
> what you mean) - if we do the cache flushing on the fault handling path
> in the arch code, there is no way for the arch code to know what driver
> is doing, unless we make this conditionally compiled with something like
> CONFIG_ARCH_NEEDS_HEAVY_FLUSHING.


Continuing on the USB issue w.r.t cache coherency, the usb host
code is violating the buffer ownership rules of streaming APIs from
dma and non-dma transfers point if view.

We have a below temporary patch to get around the issue and probably it
needs to be fixed in the right way in the stack because some controllers
may not have PIO option even for control transfers. (e.g. Synopsis EHCI
controller)

From: Maulik Mankad <x0082077@ti.com>

USB: Avoid DMA map/unmap of control transfer buffers.

This patch avoids the DMA mapping of buffers for control
transfers.

Signed-off-by: Maulik Mankad <x0082077@ti.com>
---
Index: omap4_integration/drivers/usb/core/hcd.c
===================================================================
--- omap4_integration.orig/drivers/usb/core/hcd.c
+++ omap4_integration/drivers/usb/core/hcd.c
@@ -1274,6 +1274,10 @@ static int map_urb_for_dma(struct usb_hc
 	if (is_root_hub(urb->dev))
 		return 0;
 
+	if (usb_endpoint_xfer_control(&urb->ep->desc))
+		urb->transfer_flags = URB_NO_SETUP_DMA_MAP |
+					URB_NO_TRANSFER_DMA_MAP;
+
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
 		if (hcd->self.uses_dma) {

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16  7:57                     ` Shilimkar, Santosh
  0 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16  7:57 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-kernel-
> bounces at lists.infradead.org] On Behalf Of Catalin Marinas
> Sent: Monday, February 08, 2010 4:58 PM
> To: Pavel Machek
> Cc: Matthew Dharm; Sergei Shtylyov; Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-
> kernel; Greg KH; linux-arm-kernel
> Subject: Re: USB mass storage and ARM cache coherency
> 
> On Mon, 2010-02-08 at 10:52 +0000, Pavel Machek wrote:
> > > On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > > > So, let's put this in the HCD drivers and be done with it.
> > > > >
> > > > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > > > don't particularly like the solution but it seems to be the only one
> > > > > available.
> > > >
> > > > Really? It looks like arm should just flush the caches when mapping
> > > > executable page to the userspace.... you can't expect all the drivers
> > > > to be modified like that...
> > >
> > > We could of course flush the caches every time we get a page fault but
> > > that's far from optimal, especially since DMA-capable drivers to do
> > > not
> >
> > Maybe far for optimal, but it is something that should be done,
> > _first_. Correctness is more important than performance, and you can't
> > expect all drivers to behave like you want them.
> 
> I wouldn't call heavy cache flushing "correctness". We could as well
> disable the caches so that it is fully coherent.
> 
> The arch code follows an API defined in cachetlb.txt but the PIO drivers
> don't (some do, like mmci.c). It may be inconvenient to call
> flush_dcache_page() in the driver, hence I started a discussion on
> linux-arch on a PIO mapping API that x86 or other fully coherent
> architectures can leave it as no-ops.
> 
> > Then you can add optimalizations not to do the flushes on drivers you
> > audited and where you care...
> 
> Sorry but that's not really feasible (unless I don't fully understand
> what you mean) - if we do the cache flushing on the fault handling path
> in the arch code, there is no way for the arch code to know what driver
> is doing, unless we make this conditionally compiled with something like
> CONFIG_ARCH_NEEDS_HEAVY_FLUSHING.


Continuing on the USB issue w.r.t cache coherency, the usb host
code is violating the buffer ownership rules of streaming APIs from
dma and non-dma transfers point if view.

We have a below temporary patch to get around the issue and probably it
needs to be fixed in the right way in the stack because some controllers
may not have PIO option even for control transfers. (e.g. Synopsis EHCI
controller)

From: Maulik Mankad <x0082077@ti.com>

USB: Avoid DMA map/unmap of control transfer buffers.

This patch avoids the DMA mapping of buffers for control
transfers.

Signed-off-by: Maulik Mankad <x0082077@ti.com>
---
Index: omap4_integration/drivers/usb/core/hcd.c
===================================================================
--- omap4_integration.orig/drivers/usb/core/hcd.c
+++ omap4_integration/drivers/usb/core/hcd.c
@@ -1274,6 +1274,10 @@ static int map_urb_for_dma(struct usb_hc
 	if (is_root_hub(urb->dev))
 		return 0;
 
+	if (usb_endpoint_xfer_control(&urb->ep->desc))
+		urb->transfer_flags = URB_NO_SETUP_DMA_MAP |
+					URB_NO_TRANSFER_DMA_MAP;
+
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
 		if (hcd->self.uses_dma) {

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  7:57                     ` Shilimkar, Santosh
@ 2010-02-16  8:22                       ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16  8:22 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Catalin Marinas, Pavel Machek, Greg KH, Russell King - ARM Linux,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Am Dienstag, 16. Februar 2010 08:57:53 schrieb Shilimkar, Santosh:
> Continuing on the USB issue w.r.t cache coherency, the usb host
> code is violating the buffer ownership rules of streaming APIs from
> dma and non-dma transfers point if view.
> 
> We have a below temporary patch to get around the issue and probably it
> needs to be fixed in the right way in the stack because some controllers
> may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> controller)

This seems wrong to me. Buffers for control transfers may be transfered
by DMA, so the caches must be flushed on architectures whose caches
are not coherent with respect to DMA.

Would you care to elaborate on the exact nature of the bug you are fixing?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16  8:22                       ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16  8:22 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 08:57:53 schrieb Shilimkar, Santosh:
> Continuing on the USB issue w.r.t cache coherency, the usb host
> code is violating the buffer ownership rules of streaming APIs from
> dma and non-dma transfers point if view.
> 
> We have a below temporary patch to get around the issue and probably it
> needs to be fixed in the right way in the stack because some controllers
> may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> controller)

This seems wrong to me. Buffers for control transfers may be transfered
by DMA, so the caches must be flushed on architectures whose caches
are not coherent with respect to DMA.

Would you care to elaborate on the exact nature of the bug you are fixing?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  7:57                     ` Shilimkar, Santosh
@ 2010-02-16  8:44                       ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-16  8:44 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Catalin Marinas, Pavel Machek, Greg KH, Matthew Dharm,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, linux-usb,
	linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

On Tue, Feb 16, 2010 at 01:27:53PM +0530, Shilimkar, Santosh wrote:
> Continuing on the USB issue w.r.t cache coherency, the usb host
> code is violating the buffer ownership rules of streaming APIs from
> dma and non-dma transfers point if view.
> 
> We have a below temporary patch to get around the issue and probably it
> needs to be fixed in the right way in the stack because some controllers
> may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> controller)

        if (usb_endpoint_xfer_control(&urb->ep->desc)
            && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
                if (hcd->self.uses_dma) {		<=================
                        urb->setup_dma = dma_map_single(
                                        hcd->self.controller,
                                        urb->setup_packet,
                                        sizeof(struct usb_ctrlrequest),
                                        DMA_TO_DEVICE);

struct usb_hcd *usb_create_hcd (const struct hc_driver *driver,
                struct device *dev, const char *bus_name)
{
...
        hcd->self.uses_dma = (dev->dma_mask != NULL);

Is it easier to make sure that PIO devices don't have dev->dma_mask set?

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16  8:44                       ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-16  8:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 16, 2010 at 01:27:53PM +0530, Shilimkar, Santosh wrote:
> Continuing on the USB issue w.r.t cache coherency, the usb host
> code is violating the buffer ownership rules of streaming APIs from
> dma and non-dma transfers point if view.
> 
> We have a below temporary patch to get around the issue and probably it
> needs to be fixed in the right way in the stack because some controllers
> may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> controller)

        if (usb_endpoint_xfer_control(&urb->ep->desc)
            && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
                if (hcd->self.uses_dma) {		<=================
                        urb->setup_dma = dma_map_single(
                                        hcd->self.controller,
                                        urb->setup_packet,
                                        sizeof(struct usb_ctrlrequest),
                                        DMA_TO_DEVICE);

struct usb_hcd *usb_create_hcd (const struct hc_driver *driver,
                struct device *dev, const char *bus_name)
{
...
        hcd->self.uses_dma = (dev->dma_mask != NULL);

Is it easier to make sure that PIO devices don't have dev->dma_mask set?

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-16  8:44                       ` Russell King - ARM Linux
@ 2010-02-16  8:51                         ` Gadiyar, Anand
  -1 siblings, 0 replies; 352+ messages in thread
From: Gadiyar, Anand @ 2010-02-16  8:51 UTC (permalink / raw)
  To: Russell King - ARM Linux, Shilimkar, Santosh
  Cc: Catalin Marinas, Pavel Machek, Greg KH, Matthew Dharm,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, linux-usb,
	linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Russell King - ARM Linux wrote:
> On Tue, Feb 16, 2010 at 01:27:53PM +0530, Shilimkar, Santosh wrote:
> > Continuing on the USB issue w.r.t cache coherency, the usb host
> > code is violating the buffer ownership rules of streaming APIs from
> > dma and non-dma transfers point if view.
> > 
> > We have a below temporary patch to get around the issue and probably it
> > needs to be fixed in the right way in the stack because some controllers
> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> > controller)
> 
>         if (usb_endpoint_xfer_control(&urb->ep->desc)
>             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
>                 if (hcd->self.uses_dma) {		<=================
>                         urb->setup_dma = dma_map_single(
>                                         hcd->self.controller,
>                                         urb->setup_packet,
>                                         sizeof(struct usb_ctrlrequest),
>                                         DMA_TO_DEVICE);
> 
> struct usb_hcd *usb_create_hcd (const struct hc_driver *driver,
>                 struct device *dev, const char *bus_name)
> {
> ...
>         hcd->self.uses_dma = (dev->dma_mask != NULL);
> 
> Is it easier to make sure that PIO devices don't have dev->dma_mask set?

Not really. For instance, in the case of the DMA engine in the MUSB
controller in OMAP3, we can only use DMA with endpoints other than
EP0, and EP0 is what is used for control transfers.

It's not PIO for all the endpoints or DMA for all of them.

- Anand

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16  8:51                         ` Gadiyar, Anand
  0 siblings, 0 replies; 352+ messages in thread
From: Gadiyar, Anand @ 2010-02-16  8:51 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Tue, Feb 16, 2010 at 01:27:53PM +0530, Shilimkar, Santosh wrote:
> > Continuing on the USB issue w.r.t cache coherency, the usb host
> > code is violating the buffer ownership rules of streaming APIs from
> > dma and non-dma transfers point if view.
> > 
> > We have a below temporary patch to get around the issue and probably it
> > needs to be fixed in the right way in the stack because some controllers
> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> > controller)
> 
>         if (usb_endpoint_xfer_control(&urb->ep->desc)
>             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
>                 if (hcd->self.uses_dma) {		<=================
>                         urb->setup_dma = dma_map_single(
>                                         hcd->self.controller,
>                                         urb->setup_packet,
>                                         sizeof(struct usb_ctrlrequest),
>                                         DMA_TO_DEVICE);
> 
> struct usb_hcd *usb_create_hcd (const struct hc_driver *driver,
>                 struct device *dev, const char *bus_name)
> {
> ...
>         hcd->self.uses_dma = (dev->dma_mask != NULL);
> 
> Is it easier to make sure that PIO devices don't have dev->dma_mask set?

Not really. For instance, in the case of the DMA engine in the MUSB
controller in OMAP3, we can only use DMA with endpoints other than
EP0, and EP0 is what is used for control transfers.

It's not PIO for all the endpoints or DMA for all of them.

- Anand

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-16  8:22                       ` Oliver Neukum
@ 2010-02-16  8:55                         ` Shilimkar, Santosh
  -1 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16  8:55 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Catalin Marinas, Pavel Machek, Greg KH, Russell King - ARM Linux,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver@neukum.org]
> Sent: Tuesday, February 16, 2010 1:53 PM
> To: Shilimkar, Santosh
> Cc: Catalin Marinas; Pavel Machek; Greg KH; Russell King - ARM Linux; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb@vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 08:57:53 schrieb Shilimkar, Santosh:
> > Continuing on the USB issue w.r.t cache coherency, the usb host
> > code is violating the buffer ownership rules of streaming APIs from
> > dma and non-dma transfers point if view.
> >
> > We have a below temporary patch to get around the issue and probably it
> > needs to be fixed in the right way in the stack because some controllers
> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> > controller)
> 
> This seems wrong to me. Buffers for control transfers may be transfered
> by DMA, so the caches must be flushed on architectures whose caches
> are not coherent with respect to DMA.
Indeed and that's what I mentioned in the comment. But we shouldn't have dma 
cache maintenance operations done for the buffers which would use pio based transfer. 
> Would you care to elaborate on the exact nature of the bug you are fixing?
On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
transfer buffers are corrupted. On our platform, we use PIO mode for control 
transfers and DMA for bulk transfers.

The current stack performs dma cache maintenance even for the PIO transfers
which leads to the corruption issue. The control buffers are handled by CPU 
and they already coherent from CPU point of view.


Regards,
Santosh


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16  8:55                         ` Shilimkar, Santosh
  0 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 1:53 PM
> To: Shilimkar, Santosh
> Cc: Catalin Marinas; Pavel Machek; Greg KH; Russell King - ARM Linux; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 08:57:53 schrieb Shilimkar, Santosh:
> > Continuing on the USB issue w.r.t cache coherency, the usb host
> > code is violating the buffer ownership rules of streaming APIs from
> > dma and non-dma transfers point if view.
> >
> > We have a below temporary patch to get around the issue and probably it
> > needs to be fixed in the right way in the stack because some controllers
> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> > controller)
> 
> This seems wrong to me. Buffers for control transfers may be transfered
> by DMA, so the caches must be flushed on architectures whose caches
> are not coherent with respect to DMA.
Indeed and that's what I mentioned in the comment. But we shouldn't have dma 
cache maintenance operations done for the buffers which would use pio based transfer. 
> Would you care to elaborate on the exact nature of the bug you are fixing?
On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
transfer buffers are corrupted. On our platform, we use PIO mode for control 
transfers and DMA for bulk transfers.

The current stack performs dma cache maintenance even for the PIO transfers
which leads to the corruption issue. The control buffers are handled by CPU 
and they already coherent from CPU point of view.


Regards,
Santosh

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  8:55                         ` Shilimkar, Santosh
@ 2010-02-16  9:07                           ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16  9:07 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Catalin Marinas, Pavel Machek, Greg KH, Russell King - ARM Linux,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > This seems wrong to me. Buffers for control transfers may be transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> Indeed and that's what I mentioned in the comment. But we shouldn't have dma 
> cache maintenance operations done for the buffers which would use pio based transfer.

Given that the generic layer can't know which buffers will be used for DMA
that would require a callback into the hcd driver.

> > Would you care to elaborate on the exact nature of the bug you are fixing?
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control 
> transfers and DMA for bulk transfers.
> 
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU 
> and they already coherent from CPU point of view.

How does the mapping corrupt buffers? It might impact performance, but why
do you see corruption?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16  9:07                           ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16  9:07 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > This seems wrong to me. Buffers for control transfers may be transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> Indeed and that's what I mentioned in the comment. But we shouldn't have dma 
> cache maintenance operations done for the buffers which would use pio based transfer.

Given that the generic layer can't know which buffers will be used for DMA
that would require a callback into the hcd driver.

> > Would you care to elaborate on the exact nature of the bug you are fixing?
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control 
> transfers and DMA for bulk transfers.
> 
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU 
> and they already coherent from CPU point of view.

How does the mapping corrupt buffers? It might impact performance, but why
do you see corruption?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  9:07                           ` Oliver Neukum
@ 2010-02-16  9:39                             ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-16  9:39 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Shilimkar, Santosh, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

On Tue, Feb 16, 2010 at 10:07:20AM +0100, Oliver Neukum wrote:
> Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > > Would you care to elaborate on the exact nature of the bug you are fixing?
> > On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> > transfer buffers are corrupted. On our platform, we use PIO mode for control 
> > transfers and DMA for bulk transfers.
> > 
> > The current stack performs dma cache maintenance even for the PIO transfers
> > which leads to the corruption issue. The control buffers are handled by CPU 
> > and they already coherent from CPU point of view.
> 
> How does the mapping corrupt buffers? It might impact performance, but why
> do you see corruption?

On map, buffers are cleaned if they're being used for DMA_TO_DEVICE and
DMA_BIDIRECTIONAL, or invalidated in the case of DMA_FROM_DEVICE.

However, because ARM CPUs can now speculatively prefetch, just leaving it
at that results in corruption of buffers used for DMA.  So we have to
invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
ensure coherency with DMA operations.

If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
writes can sit in the cache, and on unmap, they will be discarded.

Cleaning the cache on unmap is not an option; that too can lead to DMA
buffer corruption in the DMA case.

USB and associated host driver must abide by the DMA API buffer
ownership rules otherwise the result will be data corruption; either
that or USB/host driver people need to have a discussion with the
DMA API authors to remove this sensible "restriction".

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16  9:39                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-16  9:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 16, 2010 at 10:07:20AM +0100, Oliver Neukum wrote:
> Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > > Would you care to elaborate on the exact nature of the bug you are fixing?
> > On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> > transfer buffers are corrupted. On our platform, we use PIO mode for control 
> > transfers and DMA for bulk transfers.
> > 
> > The current stack performs dma cache maintenance even for the PIO transfers
> > which leads to the corruption issue. The control buffers are handled by CPU 
> > and they already coherent from CPU point of view.
> 
> How does the mapping corrupt buffers? It might impact performance, but why
> do you see corruption?

On map, buffers are cleaned if they're being used for DMA_TO_DEVICE and
DMA_BIDIRECTIONAL, or invalidated in the case of DMA_FROM_DEVICE.

However, because ARM CPUs can now speculatively prefetch, just leaving it
at that results in corruption of buffers used for DMA.  So we have to
invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
ensure coherency with DMA operations.

If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
writes can sit in the cache, and on unmap, they will be discarded.

Cleaning the cache on unmap is not an option; that too can lead to DMA
buffer corruption in the DMA case.

USB and associated host driver must abide by the DMA API buffer
ownership rules otherwise the result will be data corruption; either
that or USB/host driver people need to have a discussion with the
DMA API authors to remove this sensible "restriction".

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  9:39                             ` Russell King - ARM Linux
@ 2010-02-16 13:32                               ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16 13:32 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Shilimkar, Santosh, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Am Dienstag, 16. Februar 2010 10:39:46 schrieb Russell King - ARM Linux:
> However, because ARM CPUs can now speculatively prefetch, just leaving it
> at that results in corruption of buffers used for DMA.  So we have to
> invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> ensure coherency with DMA operations.
> 
> If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> writes can sit in the cache, and on unmap, they will be discarded.
> 
> Cleaning the cache on unmap is not an option; that too can lead to DMA
> buffer corruption in the DMA case.

I am afraid for these controllers the controller driver must be responsible
for all DMA and cache issues. Indicating the exact requirements to the
upper layer would be a battle already lost.
so the safe choice is not to set has_dma and the generic layer will leave
the issue to the lower level.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16 13:32                               ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16 13:32 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 10:39:46 schrieb Russell King - ARM Linux:
> However, because ARM CPUs can now speculatively prefetch, just leaving it
> at that results in corruption of buffers used for DMA.  So we have to
> invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> ensure coherency with DMA operations.
> 
> If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> writes can sit in the cache, and on unmap, they will be discarded.
> 
> Cleaning the cache on unmap is not an option; that too can lead to DMA
> buffer corruption in the DMA case.

I am afraid for these controllers the controller driver must be responsible
for all DMA and cache issues. Indicating the exact requirements to the
upper layer would be a battle already lost.
so the safe choice is not to set has_dma and the generic layer will leave
the issue to the lower level.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-16 13:32                               ` Oliver Neukum
@ 2010-02-16 13:40                                 ` Shilimkar, Santosh
  -1 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 13:40 UTC (permalink / raw)
  To: Oliver Neukum, Russell King - ARM Linux
  Cc: Catalin Marinas, Pavel Machek, Greg KH, Matthew Dharm,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, linux-usb,
	linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver@neukum.org]
> Sent: Tuesday, February 16, 2010 7:03 PM
> To: Russell King - ARM Linux
> Cc: Shilimkar, Santosh; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov; Ming
> Lei; Sebastian Siewior; linux-usb@vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad, Maulik
> Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 10:39:46 schrieb Russell King - ARM Linux:
> > However, because ARM CPUs can now speculatively prefetch, just leaving it
> > at that results in corruption of buffers used for DMA.  So we have to
> > invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> > ensure coherency with DMA operations.
> >
> > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > writes can sit in the cache, and on unmap, they will be discarded.
> >
> > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > buffer corruption in the DMA case.
> 
> I am afraid for these controllers the controller driver must be responsible
> for all DMA and cache issues. Indicating the exact requirements to the
> upper layer would be a battle already lost.
> so the safe choice is not to set has_dma and the generic layer will leave
> the issue to the lower level.
This means don't use dma at all which will almost kill the performance.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16 13:40                                 ` Shilimkar, Santosh
  0 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:03 PM
> To: Russell King - ARM Linux
> Cc: Shilimkar, Santosh; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov; Ming
> Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad, Maulik
> Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 10:39:46 schrieb Russell King - ARM Linux:
> > However, because ARM CPUs can now speculatively prefetch, just leaving it
> > at that results in corruption of buffers used for DMA.  So we have to
> > invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> > ensure coherency with DMA operations.
> >
> > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > writes can sit in the cache, and on unmap, they will be discarded.
> >
> > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > buffer corruption in the DMA case.
> 
> I am afraid for these controllers the controller driver must be responsible
> for all DMA and cache issues. Indicating the exact requirements to the
> upper layer would be a battle already lost.
> so the safe choice is not to set has_dma and the generic layer will leave
> the issue to the lower level.
This means don't use dma at all which will almost kill the performance.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16 13:40                                 ` Shilimkar, Santosh
@ 2010-02-16 13:46                                   ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16 13:46 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Russell King - ARM Linux, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Am Dienstag, 16. Februar 2010 14:40:45 schrieb Shilimkar, Santosh:
> > > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > > writes can sit in the cache, and on unmap, they will be discarded.
> > >
> > > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > > buffer corruption in the DMA case.
> > 
> > I am afraid for these controllers the controller driver must be responsible
> > for all DMA and cache issues. Indicating the exact requirements to the
> > upper layer would be a battle already lost.
> > so the safe choice is not to set has_dma and the generic layer will leave
> > the issue to the lower level.
> This means don't use dma at all which will almost kill the performance.

Why would you be unable to map a buffer in the hcd driver when you know
that you'll use DMA?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16 13:46                                   ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16 13:46 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 14:40:45 schrieb Shilimkar, Santosh:
> > > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > > writes can sit in the cache, and on unmap, they will be discarded.
> > >
> > > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > > buffer corruption in the DMA case.
> > 
> > I am afraid for these controllers the controller driver must be responsible
> > for all DMA and cache issues. Indicating the exact requirements to the
> > upper layer would be a battle already lost.
> > so the safe choice is not to set has_dma and the generic layer will leave
> > the issue to the lower level.
> This means don't use dma at all which will almost kill the performance.

Why would you be unable to map a buffer in the hcd driver when you know
that you'll use DMA?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-16 13:46                                   ` Oliver Neukum
@ 2010-02-16 14:12                                     ` Shilimkar, Santosh
  -1 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 14:12 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Russell King - ARM Linux, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver@neukum.org]
> Sent: Tuesday, February 16, 2010 7:17 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb@vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 14:40:45 schrieb Shilimkar, Santosh:
> > > > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > > > writes can sit in the cache, and on unmap, they will be discarded.
> > > >
> > > > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > > > buffer corruption in the DMA case.
> > >
> > > I am afraid for these controllers the controller driver must be responsible
> > > for all DMA and cache issues. Indicating the exact requirements to the
> > > upper layer would be a battle already lost.
> > > so the safe choice is not to set has_dma and the generic layer will leave
> > > the issue to the lower level.
> > This means don't use dma at all which will almost kill the performance.
> 
> Why would you be unable to map a buffer in the hcd driver when you know
> that you'll use DMA?
Probably it can be. The USB stack has the dma maintenance code at common 
place for all controllers and hence we were just trying to see if there is 
way to handle that way.

We shall check this possibility

Regards,
Santosh

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16 14:12                                     ` Shilimkar, Santosh
  0 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 14:12 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:17 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 14:40:45 schrieb Shilimkar, Santosh:
> > > > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > > > writes can sit in the cache, and on unmap, they will be discarded.
> > > >
> > > > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > > > buffer corruption in the DMA case.
> > >
> > > I am afraid for these controllers the controller driver must be responsible
> > > for all DMA and cache issues. Indicating the exact requirements to the
> > > upper layer would be a battle already lost.
> > > so the safe choice is not to set has_dma and the generic layer will leave
> > > the issue to the lower level.
> > This means don't use dma at all which will almost kill the performance.
> 
> Why would you be unable to map a buffer in the hcd driver when you know
> that you'll use DMA?
Probably it can be. The USB stack has the dma maintenance code at common 
place for all controllers and hence we were just trying to see if there is 
way to handle that way.

We shall check this possibility

Regards,
Santosh

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16 14:12                                     ` Shilimkar, Santosh
@ 2010-02-16 14:22                                       ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16 14:22 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Russell King - ARM Linux, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > I am afraid for these controllers the controller driver must be responsible
> > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > upper layer would be a battle already lost.
> > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > the issue to the lower level.
> > > This means don't use dma at all which will almost kill the performance.
> > 
> > Why would you be unable to map a buffer in the hcd driver when you know
> > that you'll use DMA?
> Probably it can be. The USB stack has the dma maintenance code at common 
> place for all controllers and hence we were just trying to see if there is 
> way to handle that way.

This is true. If you can find a clean way to describe your requirements
to the generic layer, that would be better. The problem is that we must
not end up with a dozen flags.

Your original patch however kills ehci, ohci and uhci on some architectures.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16 14:22                                       ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-16 14:22 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > I am afraid for these controllers the controller driver must be responsible
> > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > upper layer would be a battle already lost.
> > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > the issue to the lower level.
> > > This means don't use dma at all which will almost kill the performance.
> > 
> > Why would you be unable to map a buffer in the hcd driver when you know
> > that you'll use DMA?
> Probably it can be. The USB stack has the dma maintenance code at common 
> place for all controllers and hence we were just trying to see if there is 
> way to handle that way.

This is true. If you can find a clean way to describe your requirements
to the generic layer, that would be better. The problem is that we must
not end up with a dozen flags.

Your original patch however kills ehci, ohci and uhci on some architectures.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-16 14:22                                       ` Oliver Neukum
@ 2010-02-16 14:45                                         ` Shilimkar, Santosh
  -1 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 14:45 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Russell King - ARM Linux, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver@neukum.org]
> Sent: Tuesday, February 16, 2010 7:53 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb@vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > > I am afraid for these controllers the controller driver must be responsible
> > > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > > upper layer would be a battle already lost.
> > > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > > the issue to the lower level.
> > > > This means don't use dma at all which will almost kill the performance.
> > >
> > > Why would you be unable to map a buffer in the hcd driver when you know
> > > that you'll use DMA?
> > Probably it can be. The USB stack has the dma maintenance code at common
> > place for all controllers and hence we were just trying to see if there is
> > way to handle that way.
> 
> This is true. If you can find a clean way to describe your requirements
> to the generic layer, that would be better. The problem is that we must
> not end up with a dozen flags.
Agree 
> Your original patch however kills ehci, ohci and uhci on some architectures.
Well the patch was making _ONLY_ control transfers use PIO and rest of
the transfer would still use dma. So not sure how much performance impact would
be because of that.
Another issue with that patch is there are few controllers which can't do PIO
at all and hence the patch would broke those controllers.

So we need a clean way to handle it as you described.

Regards,
Santosh



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16 14:45                                         ` Shilimkar, Santosh
  0 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 14:45 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:53 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > > I am afraid for these controllers the controller driver must be responsible
> > > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > > upper layer would be a battle already lost.
> > > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > > the issue to the lower level.
> > > > This means don't use dma at all which will almost kill the performance.
> > >
> > > Why would you be unable to map a buffer in the hcd driver when you know
> > > that you'll use DMA?
> > Probably it can be. The USB stack has the dma maintenance code at common
> > place for all controllers and hence we were just trying to see if there is
> > way to handle that way.
> 
> This is true. If you can find a clean way to describe your requirements
> to the generic layer, that would be better. The problem is that we must
> not end up with a dozen flags.
Agree 
> Your original patch however kills ehci, ohci and uhci on some architectures.
Well the patch was making _ONLY_ control transfers use PIO and rest of
the transfer would still use dma. So not sure how much performance impact would
be because of that.
Another issue with that patch is there are few controllers which can't do PIO
at all and hence the patch would broke those controllers.

So we need a clean way to handle it as you described.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-16 14:45                                         ` Shilimkar, Santosh
@ 2010-02-16 15:44                                           ` Alan Stern
  -1 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-16 15:44 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Oliver Neukum, Russell King - ARM Linux, Catalin Marinas,
	Pavel Machek, Greg KH, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, linux-arm-kernel,
	Mankad, Maulik Ojas

On Tue, 16 Feb 2010, Shilimkar, Santosh wrote:

> > Your original patch however kills ehci, ohci and uhci on some architectures.
> Well the patch was making _ONLY_ control transfers use PIO and rest of
> the transfer would still use dma. So not sure how much performance impact would
> be because of that.
> Another issue with that patch is there are few controllers which can't do PIO
> at all and hence the patch would broke those controllers.

More than "a few"!  None of the EHCI, OHCI, or UHCI controllers used in
Intel-compatible desktop and laptop systems can do PIO.  That's what 
Oliver meant.

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-16 15:44                                           ` Alan Stern
  0 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-16 15:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 16 Feb 2010, Shilimkar, Santosh wrote:

> > Your original patch however kills ehci, ohci and uhci on some architectures.
> Well the patch was making _ONLY_ control transfers use PIO and rest of
> the transfer would still use dma. So not sure how much performance impact would
> be because of that.
> Another issue with that patch is there are few controllers which can't do PIO
> at all and hence the patch would broke those controllers.

More than "a few"!  None of the EHCI, OHCI, or UHCI controllers used in
Intel-compatible desktop and laptop systems can do PIO.  That's what 
Oliver meant.

Alan Stern

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  8:55                         ` Shilimkar, Santosh
@ 2010-02-17  3:21                           ` Ming Lei
  -1 siblings, 0 replies; 352+ messages in thread
From: Ming Lei @ 2010-02-17  3:21 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Oliver Neukum, Catalin Marinas, Pavel Machek, Greg KH,
	Russell King - ARM Linux, Matthew Dharm, Sergei Shtylyov,
	Sebastian Siewior, linux-usb, linux-kernel, linux-arm-kernel,
	Mankad, Maulik Ojas

2010/2/16 Shilimkar, Santosh <santosh.shilimkar@ti.com>:

>> > We have a below temporary patch to get around the issue and probably it
>> > needs to be fixed in the right way in the stack because some controllers
>> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
>> > controller)

Your temporary patch only removes dma map and umap for setup buffer in
control transfer.

>>
>> This seems wrong to me. Buffers for control transfers may be transfered
>> by DMA, so the caches must be flushed on architectures whose caches
>> are not coherent with respect to DMA.
> Indeed and that's what I mentioned in the comment. But we shouldn't have dma
> cache maintenance operations done for the buffers which would use pio based transfer.
>> Would you care to elaborate on the exact nature of the bug you are fixing?
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control
> transfers and DMA for bulk transfers.

I don't know you mean you use PIO mode for seup buffer only or whole control
transfer(setup sent, data in or data out).  If you mean do not use DMA
for setup sent, data in or data out in a control transfer, your
temporary patch maybe is not enough, right?

>
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU
> and they already coherent from CPU point of view.

-- 
Lei Ming

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  3:21                           ` Ming Lei
  0 siblings, 0 replies; 352+ messages in thread
From: Ming Lei @ 2010-02-17  3:21 UTC (permalink / raw)
  To: linux-arm-kernel

2010/2/16 Shilimkar, Santosh <santosh.shilimkar@ti.com>:

>> > We have a below temporary patch to get around the issue and probably it
>> > needs to be fixed in the right way in the stack because some controllers
>> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
>> > controller)

Your temporary patch only removes dma map and umap for setup buffer in
control transfer.

>>
>> This seems wrong to me. Buffers for control transfers may be transfered
>> by DMA, so the caches must be flushed on architectures whose caches
>> are not coherent with respect to DMA.
> Indeed and that's what I mentioned in the comment. But we shouldn't have dma
> cache maintenance operations done for the buffers which would use pio based transfer.
>> Would you care to elaborate on the exact nature of the bug you are fixing?
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control
> transfers and DMA for bulk transfers.

I don't know you mean you use PIO mode for seup buffer only or whole control
transfer(setup sent, data in or data out).  If you mean do not use DMA
for setup sent, data in or data out in a control transfer, your
temporary patch maybe is not enough, right?

>
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU
> and they already coherent from CPU point of view.

-- 
Lei Ming

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-16 14:22                                       ` Oliver Neukum
@ 2010-02-17  8:55                                         ` Shilimkar, Santosh
  -1 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-17  8:55 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Russell King - ARM Linux, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas,
	Gadiyar, Anand

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver@neukum.org]
> Sent: Tuesday, February 16, 2010 7:53 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb@vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > > I am afraid for these controllers the controller driver must be responsible
> > > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > > upper layer would be a battle already lost.
> > > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > > the issue to the lower level.
> > > > This means don't use dma at all which will almost kill the performance.
> > >
> > > Why would you be unable to map a buffer in the hcd driver when you know
> > > that you'll use DMA?
> > Probably it can be. The USB stack has the dma maintenance code at common
> > place for all controllers and hence we were just trying to see if there is
> > way to handle that way.
> 
> This is true. If you can find a clean way to describe your requirements
> to the generic layer, that would be better. The problem is that we must
> not end up with a dozen flags.
> 
> Your original patch however kills ehci, ohci and uhci on some architectures.

How about below approach? Controller driver can set 
"uses_pio_for_control" if it can't do dma for control transfer.

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 80995ef..e3eae02 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
 
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma) {
+		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
 			urb->setup_dma = dma_map_single(
 					hcd->self.controller,
 					urb->setup_packet,
@@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
 
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma)
+		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
 			dma_unmap_single(hcd->self.controller, urb->setup_dma,
 					sizeof(struct usb_ctrlrequest),
 					DMA_TO_DEVICE);
diff --git a/include/linux/usb.h b/include/linux/usb.h
index d7ace1b..ba5b0a2 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -329,6 +329,9 @@ struct usb_bus {
 	int busnum;			/* Bus number (in order of reg) */
 	const char *bus_name;		/* stable id (PCI slot_name etc) */
 	u8 uses_dma;			/* Does the host controller use DMA? */
+	u8 uses_pio_for_control;	/* Does the host controller use PIO
+					 * for control tansfers? 
+					 */
 	u8 otg_port;			/* 0, or number of OTG/HNP port */
 	unsigned is_b_host:1;		/* true during some HNP roleswitches */
 	unsigned b_hnp_enable:1;	/* OTG: did A-Host enable HNP? */

Regards,
Santosh

^ permalink raw reply related	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  8:55                                         ` Shilimkar, Santosh
  0 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-17  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:53 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > > I am afraid for these controllers the controller driver must be responsible
> > > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > > upper layer would be a battle already lost.
> > > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > > the issue to the lower level.
> > > > This means don't use dma at all which will almost kill the performance.
> > >
> > > Why would you be unable to map a buffer in the hcd driver when you know
> > > that you'll use DMA?
> > Probably it can be. The USB stack has the dma maintenance code at common
> > place for all controllers and hence we were just trying to see if there is
> > way to handle that way.
> 
> This is true. If you can find a clean way to describe your requirements
> to the generic layer, that would be better. The problem is that we must
> not end up with a dozen flags.
> 
> Your original patch however kills ehci, ohci and uhci on some architectures.

How about below approach? Controller driver can set 
"uses_pio_for_control" if it can't do dma for control transfer.

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 80995ef..e3eae02 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
 
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma) {
+		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
 			urb->setup_dma = dma_map_single(
 					hcd->self.controller,
 					urb->setup_packet,
@@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
 
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma)
+		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
 			dma_unmap_single(hcd->self.controller, urb->setup_dma,
 					sizeof(struct usb_ctrlrequest),
 					DMA_TO_DEVICE);
diff --git a/include/linux/usb.h b/include/linux/usb.h
index d7ace1b..ba5b0a2 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -329,6 +329,9 @@ struct usb_bus {
 	int busnum;			/* Bus number (in order of reg) */
 	const char *bus_name;		/* stable id (PCI slot_name etc) */
 	u8 uses_dma;			/* Does the host controller use DMA? */
+	u8 uses_pio_for_control;	/* Does the host controller use PIO
+					 * for control tansfers? 
+					 */
 	u8 otg_port;			/* 0, or number of OTG/HNP port */
 	unsigned is_b_host:1;		/* true during some HNP roleswitches */
 	unsigned b_hnp_enable:1;	/* OTG: did A-Host enable HNP? */

Regards,
Santosh

^ permalink raw reply related	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  8:22                       ` Oliver Neukum
@ 2010-02-17  9:05                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17  9:05 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Shilimkar, Santosh, Matthew Dharm, Russell King - ARM Linux,
	Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> This seems wrong to me. Buffers for control transfers may be
> transfered
> by DMA, so the caches must be flushed on architectures whose caches
> are not coherent with respect to DMA.
> 
> Would you care to elaborate on the exact nature of the bug you are
> fixing?

I missed part of this thread, so forgive me if I'm a bit off here, but
if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
this is a long solved issue on other archs such as ppc (and I _think_
sparc).

The way we do it, at least on powerpc which is PIPT, is to keep track on
a per-page basis, whether a given page is clean for execution using
PG_arch1 bit. This bit is cleared when a new page is popped into the
page cache, and we clear it from flush_dcache_page() iirc (you may want
to dbl check I don't have the code at hand right now, or rather, I do
but I'm to lazy to look right now :-)

Any page with that not set is mapped into userspace with execute
permission disabled. We do the flush and set PG_arch1 on the first exec
fault to that page.

Cheers,
Ben.
 


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:05                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17  9:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> This seems wrong to me. Buffers for control transfers may be
> transfered
> by DMA, so the caches must be flushed on architectures whose caches
> are not coherent with respect to DMA.
> 
> Would you care to elaborate on the exact nature of the bug you are
> fixing?

I missed part of this thread, so forgive me if I'm a bit off here, but
if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
this is a long solved issue on other archs such as ppc (and I _think_
sparc).

The way we do it, at least on powerpc which is PIPT, is to keep track on
a per-page basis, whether a given page is clean for execution using
PG_arch1 bit. This bit is cleared when a new page is popped into the
page cache, and we clear it from flush_dcache_page() iirc (you may want
to dbl check I don't have the code at hand right now, or rather, I do
but I'm to lazy to look right now :-)

Any page with that not set is mapped into userspace with execute
permission disabled. We do the flush and set PG_arch1 on the first exec
fault to that page.

Cheers,
Ben.
 

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  8:55                                         ` Shilimkar, Santosh
@ 2010-02-17  9:10                                           ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17  9:10 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Russell King - ARM Linux, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas,
	Gadiyar, Anand

Am Mittwoch, 17. Februar 2010 09:55:08 schrieb Shilimkar, Santosh:
> > Your original patch however kills ehci, ohci and uhci on some architectures.
> 
> How about below approach? Controller driver can set 
> "uses_pio_for_control" if it can't do dma for control transfer.
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 80995ef..e3eae02 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  
>         if (usb_endpoint_xfer_control(&urb->ep->desc)
>             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -               if (hcd->self.uses_dma) {
> +               if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {

It is not elegant to describe exceptions. It would be better, if you split up
the flag into two flags, called uses_dma_for_ordinary_transfers and
uses_dma_for control_transfers. Doing so also makes sure you look at
all hcd drivers ;-)

And the tests become straightforward. And please add a detailed comment
to explain why this differentiation is needed on ARM.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:10                                           ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17  9:10 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 09:55:08 schrieb Shilimkar, Santosh:
> > Your original patch however kills ehci, ohci and uhci on some architectures.
> 
> How about below approach? Controller driver can set 
> "uses_pio_for_control" if it can't do dma for control transfer.
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 80995ef..e3eae02 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  
>         if (usb_endpoint_xfer_control(&urb->ep->desc)
>             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -               if (hcd->self.uses_dma) {
> +               if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {

It is not elegant to describe exceptions. It would be better, if you split up
the flag into two flags, called uses_dma_for_ordinary_transfers and
uses_dma_for control_transfers. Doing so also makes sure you look at
all hcd drivers ;-)

And the tests become straightforward. And please add a detailed comment
to explain why this differentiation is needed on ARM.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  9:05                         ` Benjamin Herrenschmidt
@ 2010-02-17  9:15                           ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17  9:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Shilimkar, Santosh, Matthew Dharm, Russell King - ARM Linux,
	Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

Am Mittwoch, 17. Februar 2010 10:05:43 schrieb Benjamin Herrenschmidt:
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I think
> sparc).

We should have changed the subject line.

There's a second problem. It turns out that on ARM
mapping for DMA must not be done if PIO will be used. Some HCDs
use PIO for some transfers but DMA for others. The generic layer
must learn about this.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:15                           ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17  9:15 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 10:05:43 schrieb Benjamin Herrenschmidt:
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I think
> sparc).

We should have changed the subject line.

There's a second problem. It turns out that on ARM
mapping for DMA must not be done if PIO will be used. Some HCDs
use PIO for some transfers but DMA for others. The generic layer
must learn about this.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-17  9:10                                           ` Oliver Neukum
@ 2010-02-17  9:17                                             ` Shilimkar, Santosh
  -1 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-17  9:17 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Russell King - ARM Linux, Catalin Marinas, Pavel Machek, Greg KH,
	Matthew Dharm, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas,
	Gadiyar, Anand

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver@neukum.org]
> Sent: Wednesday, February 17, 2010 2:41 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb@vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas; Gadiyar, Anand
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Mittwoch, 17. Februar 2010 09:55:08 schrieb Shilimkar, Santosh:
> > > Your original patch however kills ehci, ohci and uhci on some architectures.
> >
> > How about below approach? Controller driver can set
> > "uses_pio_for_control" if it can't do dma for control transfer.
> >
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 80995ef..e3eae02 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
> >
> >         if (usb_endpoint_xfer_control(&urb->ep->desc)
> >             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -               if (hcd->self.uses_dma) {
> > +               if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
> 
> It is not elegant to describe exceptions. It would be better, if you split up
> the flag into two flags, called uses_dma_for_ordinary_transfers and
> uses_dma_for control_transfers. Doing so also makes sure you look at
> all hcd drivers ;-)
> 
Good point. Negative checks are any way not elegant
> And the tests become straightforward. And please add a detailed comment
> to explain why this differentiation is needed on ARM.
OK. I shall create a patch with description about the problem.

Thanks for feedback!!

Regards,
Santosh


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:17                                             ` Shilimkar, Santosh
  0 siblings, 0 replies; 352+ messages in thread
From: Shilimkar, Santosh @ 2010-02-17  9:17 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Wednesday, February 17, 2010 2:41 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas; Gadiyar, Anand
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Mittwoch, 17. Februar 2010 09:55:08 schrieb Shilimkar, Santosh:
> > > Your original patch however kills ehci, ohci and uhci on some architectures.
> >
> > How about below approach? Controller driver can set
> > "uses_pio_for_control" if it can't do dma for control transfer.
> >
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 80995ef..e3eae02 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
> >
> >         if (usb_endpoint_xfer_control(&urb->ep->desc)
> >             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -               if (hcd->self.uses_dma) {
> > +               if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
> 
> It is not elegant to describe exceptions. It would be better, if you split up
> the flag into two flags, called uses_dma_for_ordinary_transfers and
> uses_dma_for control_transfers. Doing so also makes sure you look at
> all hcd drivers ;-)
> 
Good point. Negative checks are any way not elegant
> And the tests become straightforward. And please add a detailed comment
> to explain why this differentiation is needed on ARM.
OK. I shall create a patch with description about the problem.

Thanks for feedback!!

Regards,
Santosh

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  9:15                           ` Oliver Neukum
@ 2010-02-17  9:40                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17  9:40 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Shilimkar, Santosh, Matthew Dharm, Russell King - ARM Linux,
	Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 10:15 +0100, Oliver Neukum wrote:
> We should have changed the subject line.
> 
> There's a second problem. It turns out that on ARM
> mapping for DMA must not be done if PIO will be used. Some HCDs
> use PIO for some transfers but DMA for others. The generic layer
> must learn about this. 

Ah, that makes a lot of sense and the same problem would happen on
any non-DMA coherent architecture, including some embedded ppc's.

I can see why the dma unmap would invalidate the dcache and blow
away the PIO.

What bugs me here is that the dma_map_* operation should always
be done at the lowest level, ie, the actual HCD driver, and thus
it should be up to the HCD to decide whether to dma_map or not
depending on whether it's going to do DMA or not. I haven't
scrutinized USB lately but if that isn't the case and the dma_map_*
operations are done behind your back by the USB core then that needs to
be changed in a way or another, or hooked at least.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:40                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 10:15 +0100, Oliver Neukum wrote:
> We should have changed the subject line.
> 
> There's a second problem. It turns out that on ARM
> mapping for DMA must not be done if PIO will be used. Some HCDs
> use PIO for some transfers but DMA for others. The generic layer
> must learn about this. 

Ah, that makes a lot of sense and the same problem would happen on
any non-DMA coherent architecture, including some embedded ppc's.

I can see why the dma unmap would invalidate the dcache and blow
away the PIO.

What bugs me here is that the dma_map_* operation should always
be done at the lowest level, ie, the actual HCD driver, and thus
it should be up to the HCD to decide whether to dma_map or not
depending on whether it's going to do DMA or not. I haven't
scrutinized USB lately but if that isn't the case and the dma_map_*
operations are done behind your back by the USB core then that needs to
be changed in a way or another, or hooked at least.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-08 10:03                 ` Andy Green
@ 2010-02-17  9:50                   ` Sascha Hauer
  -1 siblings, 0 replies; 352+ messages in thread
From: Sascha Hauer @ 2010-02-17  9:50 UTC (permalink / raw)
  To: Andy Green
  Cc: Catalin Marinas, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Mon, Feb 08, 2010 at 11:03:14AM +0100, Andy Green wrote:
> On 02/08/10 10:51, Somebody in the thread at some point said:
>
>> We could of course flush the caches every time we get a page fault but
>> that's far from optimal, especially since DMA-capable drivers to do not
>> pollute the D-cache and don't need this extra flushing. Note that the
>> recent ARM processors have PIPT caches but separate for I and D and it's
>> the PIO drivers that pollute the D-cache.
>
> Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the  
> moment, for lack of any platform DMA support of its unusual DMA engine.

The EHCI module has its own DMA engine and has nothing to do with the
SDMA engine.

Sascha


-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:50                   ` Sascha Hauer
  0 siblings, 0 replies; 352+ messages in thread
From: Sascha Hauer @ 2010-02-17  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 08, 2010 at 11:03:14AM +0100, Andy Green wrote:
> On 02/08/10 10:51, Somebody in the thread at some point said:
>
>> We could of course flush the caches every time we get a page fault but
>> that's far from optimal, especially since DMA-capable drivers to do not
>> pollute the D-cache and don't need this extra flushing. Note that the
>> recent ARM processors have PIPT caches but separate for I and D and it's
>> the PIO drivers that pollute the D-cache.
>
> Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the  
> moment, for lack of any platform DMA support of its unusual DMA engine.

The EHCI module has its own DMA engine and has nothing to do with the
SDMA engine.

Sascha


-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  9:05                         ` Benjamin Herrenschmidt
@ 2010-02-17  9:55                           ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17  9:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Oliver Neukum, Shilimkar, Santosh, Matthew Dharm, Ming Lei,
	Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, Feb 17, 2010 at 08:05:43PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> > 
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

Nope.  It's to do with mapping a buffer for DMA, and then doing PIO
reads/writes to it.

With speculative prefetches, you have to deal with cache coherency with
hardware DMA on DMA unmap.  If you've written to the buffer in violation
of the DMA API buffer ownership rules, then your writes get thrown away
resulting in immediate data corruption.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:55                           ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 17, 2010 at 08:05:43PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> > 
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

Nope.  It's to do with mapping a buffer for DMA, and then doing PIO
reads/writes to it.

With speculative prefetches, you have to deal with cache coherency with
hardware DMA on DMA unmap.  If you've written to the buffer in violation
of the DMA API buffer ownership rules, then your writes get thrown away
resulting in immediate data corruption.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  9:50                   ` Sascha Hauer
@ 2010-02-17  9:57                     ` Andy Green
  -1 siblings, 0 replies; 352+ messages in thread
From: Andy Green @ 2010-02-17  9:57 UTC (permalink / raw)
  To: Sascha Hauer
  Cc: Catalin Marinas, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On 02/17/10 10:50, Somebody in the thread at some point said:
> On Mon, Feb 08, 2010 at 11:03:14AM +0100, Andy Green wrote:
>> On 02/08/10 10:51, Somebody in the thread at some point said:
>>
>>> We could of course flush the caches every time we get a page fault but
>>> that's far from optimal, especially since DMA-capable drivers to do not
>>> pollute the D-cache and don't need this extra flushing. Note that the
>>> recent ARM processors have PIPT caches but separate for I and D and it's
>>> the PIO drivers that pollute the D-cache.
>>
>> Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the
>> moment, for lack of any platform DMA support of its unusual DMA engine.
>
> The EHCI module has its own DMA engine and has nothing to do with the
> SDMA engine.

You're right, my mistake.  iMX31 MMC is PIO due to no SDMA support though.

-Andy

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17  9:57                     ` Andy Green
  0 siblings, 0 replies; 352+ messages in thread
From: Andy Green @ 2010-02-17  9:57 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/17/10 10:50, Somebody in the thread at some point said:
> On Mon, Feb 08, 2010 at 11:03:14AM +0100, Andy Green wrote:
>> On 02/08/10 10:51, Somebody in the thread at some point said:
>>
>>> We could of course flush the caches every time we get a page fault but
>>> that's far from optimal, especially since DMA-capable drivers to do not
>>> pollute the D-cache and don't need this extra flushing. Note that the
>>> recent ARM processors have PIPT caches but separate for I and D and it's
>>> the PIO drivers that pollute the D-cache.
>>
>> Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the
>> moment, for lack of any platform DMA support of its unusual DMA engine.
>
> The EHCI module has its own DMA engine and has nothing to do with the
> SDMA engine.

You're right, my mistake.  iMX31 MMC is PIO due to no SDMA support though.

-Andy

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  9:55                           ` Russell King - ARM Linux
@ 2010-02-17 10:05                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 10:05 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Oliver Neukum, Shilimkar, Santosh, Matthew Dharm, Ming Lei,
	Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 09:55 +0000, Russell King - ARM Linux wrote:
> Nope.  It's to do with mapping a buffer for DMA, and then doing PIO
> reads/writes to it.
> 
> With speculative prefetches, you have to deal with cache coherency with
> hardware DMA on DMA unmap.  If you've written to the buffer in violation
> of the DMA API buffer ownership rules, then your writes get thrown away
> resulting in immediate data corruption. 

Right, and this exact same problem will bite some embedded powerpc
too I suppose :-)

Hrm... actually not :-) We don't do the invalidate at unmap time
today because we know 44x have such a broken prefetcher that we disable
it ... interesting considering that there are machines around that
do non-coherent DMA with 750's style chips who -do- have a prefetcher...
damn, we have a bug :-)

In any case, same problem here.

See my reply to Oliver. Basically, the problem boils down to the
dma_map/unmap being done at the wrong layer. The driver should
simply not do these if it's going to do PIO over that range.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 10:05                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 10:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 09:55 +0000, Russell King - ARM Linux wrote:
> Nope.  It's to do with mapping a buffer for DMA, and then doing PIO
> reads/writes to it.
> 
> With speculative prefetches, you have to deal with cache coherency with
> hardware DMA on DMA unmap.  If you've written to the buffer in violation
> of the DMA API buffer ownership rules, then your writes get thrown away
> resulting in immediate data corruption. 

Right, and this exact same problem will bite some embedded powerpc
too I suppose :-)

Hrm... actually not :-) We don't do the invalidate at unmap time
today because we know 44x have such a broken prefetcher that we disable
it ... interesting considering that there are machines around that
do non-coherent DMA with 750's style chips who -do- have a prefetcher...
damn, we have a bug :-)

In any case, same problem here.

See my reply to Oliver. Basically, the problem boils down to the
dma_map/unmap being done at the wrong layer. The driver should
simply not do these if it's going to do PIO over that range.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  9:40                             ` Benjamin Herrenschmidt
@ 2010-02-17 10:09                               ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17 10:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Shilimkar, Santosh, Matthew Dharm, Russell King - ARM Linux,
	Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

Am Mittwoch, 17. Februar 2010 10:40:09 schrieb Benjamin Herrenschmidt:
> What bugs me here is that the dma_map_* operation should always
> be done at the lowest level, ie, the actual HCD driver, and thus
> it should be up to the HCD to decide whether to dma_map or not
> depending on whether it's going to do DMA or not. I haven't
> scrutinized USB lately but if that isn't the case and the dma_map_*
> operations are done behind your back by the USB core then that needs to
> be changed in a way or another, or hooked at least.

No problem here. USB core does the mapping only if the low-level driver
so requests. The only exception is in usb_buffer_alloc(), but that boils
down to dma_alloc_coherent()

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 10:09                               ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17 10:09 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 10:40:09 schrieb Benjamin Herrenschmidt:
> What bugs me here is that the dma_map_* operation should always
> be done at the lowest level, ie, the actual HCD driver, and thus
> it should be up to the HCD to decide whether to dma_map or not
> depending on whether it's going to do DMA or not. I haven't
> scrutinized USB lately but if that isn't the case and the dma_map_*
> operations are done behind your back by the USB core then that needs to
> be changed in a way or another, or hooked at least.

No problem here. USB core does the mapping only if the low-level driver
so requests. The only exception is in usb_buffer_alloc(), but that boils
down to dma_alloc_coherent()

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 10:09                               ` Oliver Neukum
@ 2010-02-17 10:18                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 10:18 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Shilimkar, Santosh, Matthew Dharm, Russell King - ARM Linux,
	Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 11:09 +0100, Oliver Neukum wrote:
> 
> No problem here. USB core does the mapping only if the low-level driver
> so requests. The only exception is in usb_buffer_alloc(), but that boils
> down to dma_alloc_coherent() 

Allright, so why do we need to "fix" anything ? Or is the whole thread
moot ? :-)

It's pretty clear that between dma_map* and subsequent unmap, the memory
is owned by the device and must not be touched by the CPU. If that is
violated, then we have a driver bug.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 10:18                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 10:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 11:09 +0100, Oliver Neukum wrote:
> 
> No problem here. USB core does the mapping only if the low-level driver
> so requests. The only exception is in usb_buffer_alloc(), but that boils
> down to dma_alloc_coherent() 

Allright, so why do we need to "fix" anything ? Or is the whole thread
moot ? :-)

It's pretty clear that between dma_map* and subsequent unmap, the memory
is owned by the device and must not be touched by the CPU. If that is
violated, then we have a driver bug.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 10:18                                 ` Benjamin Herrenschmidt
@ 2010-02-17 10:23                                   ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17 10:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Shilimkar, Santosh, Matthew Dharm, Russell King - ARM Linux,
	Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

Am Mittwoch, 17. Februar 2010 11:18:01 schrieb Benjamin Herrenschmidt:
> > No problem here. USB core does the mapping only if the low-level driver
> > so requests. The only exception is in usb_buffer_alloc(), but that boils
> > down to dma_alloc_coherent() 
> 
> Allright, so why do we need to "fix" anything ? Or is the whole thread
> moot ? :-)

The request a low-level driver does is all or nothing. Either DMA
issues have to be handled by that driver alone, or a finer-grained
description of the DMA requirements is needed. A fix using the latter
approach is being worked on.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 10:23                                   ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-17 10:23 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 11:18:01 schrieb Benjamin Herrenschmidt:
> > No problem here. USB core does the mapping only if the low-level driver
> > so requests. The only exception is in usb_buffer_alloc(), but that boils
> > down to dma_alloc_coherent() 
> 
> Allright, so why do we need to "fix" anything ? Or is the whole thread
> moot ? :-)

The request a low-level driver does is all or nothing. Either DMA
issues have to be handled by that driver alone, or a finer-grained
description of the DMA requirements is needed. A fix using the latter
approach is being worked on.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 10:23                                   ` Oliver Neukum
@ 2010-02-17 12:15                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 12:15 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Shilimkar, Santosh, Matthew Dharm, Russell King - ARM Linux,
	Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov, Catalin Marinas,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 11:23 +0100, Oliver Neukum wrote:
> 
> The request a low-level driver does is all or nothing. Either DMA
> issues have to be handled by that driver alone, or a finer-grained
> description of the DMA requirements is needed. A fix using the latter
> approach is being worked on. 

Well, that's what I'm trying to understand.

IE. It's a pretty strong rule ... don't do CPU accesses between dma_map
and unmap. So it's all in driver land at that stage. I'm not sure how
the DMA requirements get into the picture here. IE. That rule is
globally true. It's not going to hurt just non-coherent archs, it's
going to hurt anybody using swiotlb too... So I don't see you need more
info about the DMA requirements, but maybe I did miss something :-)

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 12:15                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 12:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 11:23 +0100, Oliver Neukum wrote:
> 
> The request a low-level driver does is all or nothing. Either DMA
> issues have to be handled by that driver alone, or a finer-grained
> description of the DMA requirements is needed. A fix using the latter
> approach is being worked on. 

Well, that's what I'm trying to understand.

IE. It's a pretty strong rule ... don't do CPU accesses between dma_map
and unmap. So it's all in driver land at that stage. I'm not sure how
the DMA requirements get into the picture here. IE. That rule is
globally true. It's not going to hurt just non-coherent archs, it's
going to hurt anybody using swiotlb too... So I don't see you need more
info about the DMA requirements, but maybe I did miss something :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  9:39                             ` Russell King - ARM Linux
@ 2010-02-17 12:29                               ` Jamie Lokier
  -1 siblings, 0 replies; 352+ messages in thread
From: Jamie Lokier @ 2010-02-17 12:29 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Oliver Neukum, Matthew Dharm, Ming Lei, Mankad, Maulik Ojas,
	Sergei Shtylyov, Catalin Marinas, Sebastian Siewior, linux-usb,
	linux-kernel, Shilimkar, Santosh, Pavel Machek, Greg KH,
	linux-arm-kernel

Russell King - ARM Linux wrote:
> On Tue, Feb 16, 2010 at 10:07:20AM +0100, Oliver Neukum wrote:
> > Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > > > Would you care to elaborate on the exact nature of the bug you are fixing?
> > > On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> > > transfer buffers are corrupted. On our platform, we use PIO mode for control 
> > > transfers and DMA for bulk transfers.
> > > 
> > > The current stack performs dma cache maintenance even for the PIO transfers
> > > which leads to the corruption issue. The control buffers are handled by CPU 
> > > and they already coherent from CPU point of view.
> > 
> > How does the mapping corrupt buffers? It might impact performance, but why
> > do you see corruption?
> 
> On map, buffers are cleaned if they're being used for DMA_TO_DEVICE and
> DMA_BIDIRECTIONAL, or invalidated in the case of DMA_FROM_DEVICE.
> 
> However, because ARM CPUs can now speculatively prefetch, just leaving it
> at that results in corruption of buffers used for DMA.  So we have to
> invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> ensure coherency with DMA operations.
> 
> If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> writes can sit in the cache, and on unmap, they will be discarded.
> 
> Cleaning the cache on unmap is not an option; that too can lead to DMA
> buffer corruption in the DMA case.

Provided the buffers are cleaned on map for
DMA_TO_DEVICE/DMA_BIDIRECTIONAL, I don't see how cleaning on unmap for
DMA_FROM_DEVICE/DMA_BIDIRECTIONAL can cause corruption.  The only way
to get dirty cache lines while mapped is if the CPU did PIO to them.
If it was real DMA, the second clean should be a no-op.  (Assume it's
all one or the other).

Can you explain why cleanining the cache on unmap (as well as map, in
DMA_BIDIRECTIONAL case) is not an option?  Just curious, because I
don't see what would go wrong.

> USB and associated host driver must abide by the DMA API buffer
> ownership rules otherwise the result will be data corruption; either
> that or USB/host driver people need to have a discussion with the
> DMA API authors to remove this sensible "restriction".

Just in case my question gives the wrong impression, I agree that the
DMA API must be followed. Additional flushes/cleans are not good for
performance either.

-- Jamie

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 12:29                               ` Jamie Lokier
  0 siblings, 0 replies; 352+ messages in thread
From: Jamie Lokier @ 2010-02-17 12:29 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Tue, Feb 16, 2010 at 10:07:20AM +0100, Oliver Neukum wrote:
> > Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > > > Would you care to elaborate on the exact nature of the bug you are fixing?
> > > On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> > > transfer buffers are corrupted. On our platform, we use PIO mode for control 
> > > transfers and DMA for bulk transfers.
> > > 
> > > The current stack performs dma cache maintenance even for the PIO transfers
> > > which leads to the corruption issue. The control buffers are handled by CPU 
> > > and they already coherent from CPU point of view.
> > 
> > How does the mapping corrupt buffers? It might impact performance, but why
> > do you see corruption?
> 
> On map, buffers are cleaned if they're being used for DMA_TO_DEVICE and
> DMA_BIDIRECTIONAL, or invalidated in the case of DMA_FROM_DEVICE.
> 
> However, because ARM CPUs can now speculatively prefetch, just leaving it
> at that results in corruption of buffers used for DMA.  So we have to
> invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> ensure coherency with DMA operations.
> 
> If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> writes can sit in the cache, and on unmap, they will be discarded.
> 
> Cleaning the cache on unmap is not an option; that too can lead to DMA
> buffer corruption in the DMA case.

Provided the buffers are cleaned on map for
DMA_TO_DEVICE/DMA_BIDIRECTIONAL, I don't see how cleaning on unmap for
DMA_FROM_DEVICE/DMA_BIDIRECTIONAL can cause corruption.  The only way
to get dirty cache lines while mapped is if the CPU did PIO to them.
If it was real DMA, the second clean should be a no-op.  (Assume it's
all one or the other).

Can you explain why cleanining the cache on unmap (as well as map, in
DMA_BIDIRECTIONAL case) is not an option?  Just curious, because I
don't see what would go wrong.

> USB and associated host driver must abide by the DMA API buffer
> ownership rules otherwise the result will be data corruption; either
> that or USB/host driver people need to have a discussion with the
> DMA API authors to remove this sensible "restriction".

Just in case my question gives the wrong impression, I agree that the
DMA API must be followed. Additional flushes/cleans are not good for
performance either.

-- Jamie

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17  9:05                         ` Benjamin Herrenschmidt
@ 2010-02-17 15:27                           ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Oliver Neukum, Shilimkar, Santosh, Matthew Dharm,
	Russell King - ARM Linux, Ming Lei, Mankad, Maulik Ojas,
	Sergei Shtylyov, Sebastian Siewior, linux-usb, linux-kernel,
	Pavel Machek, Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> > 
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 15:27                           ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> > 
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: Re: USB mass storage and ARM cache coherency
  2010-02-17  9:05                         ` Benjamin Herrenschmidt
@ 2010-02-17 15:27                           ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:27 UTC (permalink / raw)
  To: benh
  Cc: oliver, santosh.shilimkar, mdharm-kernel, linux, tom.leiming,
	x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel, pavel,
	greg, linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 15:27                           ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: Re: USB mass storage and ARM cache coherency
  2010-02-17  9:05                         ` Benjamin Herrenschmidt
@ 2010-02-17 15:39                           ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:39 UTC (permalink / raw)
  To: benh
  Cc: oliver, santosh.shilimkar, mdharm-kernel, linux, tom.leiming,
	x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel, pavel,
	greg, linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 15:39                           ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: Re: USB mass storage and ARM cache coherency
  2010-02-17  9:05                         ` Benjamin Herrenschmidt
@ 2010-02-17 15:40                           ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:40 UTC (permalink / raw)
  To: benh
  Cc: oliver, santosh.shilimkar, mdharm-kernel, linux, tom.leiming,
	x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel, pavel,
	greg, linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 15:40                           ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: Re: USB mass storage and ARM cache coherency
  2010-02-17  9:05                         ` Benjamin Herrenschmidt
@ 2010-02-17 15:40                           ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:40 UTC (permalink / raw)
  To: benh
  Cc: oliver, santosh.shilimkar, mdharm-kernel, linux, tom.leiming,
	x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel, pavel,
	greg, linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 15:40                           ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: Re: USB mass storage and ARM cache coherency
  2010-02-17 15:40                           ` Catalin Marinas
@ 2010-02-17 16:19                             ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 16:19 UTC (permalink / raw)
  To: benh
  Cc: mdharm-kernel, linux-usb, linux, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	pavel, greg, linux-arm-kernel

SORRY - one more message to apologise for the multiple reposts (and
automatically appended legal disclaimer). I've been moved to Exchange
2007 and trying to use Evolution + Exchange-MAPI. It looks like it went
terribly wrong.

Catalin


On Wed, 2010-02-17 at 15:40 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > > This seems wrong to me. Buffers for control transfers may be
> > > transfered
> > > by DMA, so the caches must be flushed on architectures whose caches
> > > are not coherent with respect to DMA.
> > >=20
> > > Would you care to elaborate on the exact nature of the bug you are
> > > fixing?
> >=20
> > I missed part of this thread, so forgive me if I'm a bit off here, but
> > if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> > this is a long solved issue on other archs such as ppc (and I _think_
> > sparc).
> 
> The thread I started was indeed regarding I/D cache coherency and PIO.
> But it diverged into DMA issues a few days ago (should have been a new
> thread).
> 
> > The way we do it, at least on powerpc which is PIPT, is to keep track on
> > a per-page basis, whether a given page is clean for execution using
> > PG_arch1 bit. This bit is cleared when a new page is popped into the
> > page cache, and we clear it from flush_dcache_page() iirc (you may want
> > to dbl check I don't have the code at hand right now, or rather, I do
> > but I'm to lazy to look right now :-)
> 
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit.
> 
> --=20
> Catalin
> -- 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium.  Thank you.
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 16:19                             ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

SORRY - one more message to apologise for the multiple reposts (and
automatically appended legal disclaimer). I've been moved to Exchange
2007 and trying to use Evolution + Exchange-MAPI. It looks like it went
terribly wrong.

Catalin


On Wed, 2010-02-17 at 15:40 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > > This seems wrong to me. Buffers for control transfers may be
> > > transfered
> > > by DMA, so the caches must be flushed on architectures whose caches
> > > are not coherent with respect to DMA.
> > >=20
> > > Would you care to elaborate on the exact nature of the bug you are
> > > fixing?
> >=20
> > I missed part of this thread, so forgive me if I'm a bit off here, but
> > if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> > this is a long solved issue on other archs such as ppc (and I _think_
> > sparc).
> 
> The thread I started was indeed regarding I/D cache coherency and PIO.
> But it diverged into DMA issues a few days ago (should have been a new
> thread).
> 
> > The way we do it, at least on powerpc which is PIPT, is to keep track on
> > a per-page basis, whether a given page is clean for execution using
> > PG_arch1 bit. This bit is cleared when a new page is popped into the
> > page cache, and we clear it from flush_dcache_page() iirc (you may want
> > to dbl check I don't have the code at hand right now, or rather, I do
> > but I'm to lazy to look right now :-)
> 
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit.
> 
> --=20
> Catalin
> -- 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium.  Thank you.
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: Re: Re: USB mass storage and ARM cache coherency
  2010-02-17 15:40                           ` Catalin Marinas
@ 2010-02-17 16:19                             ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 16:19 UTC (permalink / raw)
  To: benh
  Cc: mdharm-kernel, linux-usb, linux, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	pavel, greg, linux-arm-kernel

SORRY - one more message to apologise for the multiple reposts (and
automatically appended legal disclaimer). I've been moved to Exchange
2007 and trying to use Evolution + Exchange-MAPI. It looks like it went
terribly wrong.

Catalin


On Wed, 2010-02-17 at 15:40 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > > This seems wrong to me. Buffers for control transfers may be
> > > transfered
> > > by DMA, so the caches must be flushed on architectures whose caches
> > > are not coherent with respect to DMA.
> > >=3D20
> > > Would you care to elaborate on the exact nature of the bug you are
> > > fixing?
> >=3D20
> > I missed part of this thread, so forgive me if I'm a bit off here, but
> > if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> > this is a long solved issue on other archs such as ppc (and I _think_
> > sparc).
>=20
> The thread I started was indeed regarding I/D cache coherency and PIO.
> But it diverged into DMA issues a few days ago (should have been a new
> thread).
>=20
> > The way we do it, at least on powerpc which is PIPT, is to keep track o=
n
> > a per-page basis, whether a given page is clean for execution using
> > PG_arch1 bit. This bit is cleared when a new page is popped into the
> > page cache, and we clear it from flush_dcache_page() iirc (you may want
> > to dbl check I don't have the code at hand right now, or rather, I do
> > but I'm to lazy to look right now :-)
>=20
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit.
>=20
> --=3D20
> Catalin
> --=20
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium.  Thank you.
>=20
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 16:19                             ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-17 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

SORRY - one more message to apologise for the multiple reposts (and
automatically appended legal disclaimer). I've been moved to Exchange
2007 and trying to use Evolution + Exchange-MAPI. It looks like it went
terribly wrong.

Catalin


On Wed, 2010-02-17 at 15:40 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > > This seems wrong to me. Buffers for control transfers may be
> > > transfered
> > > by DMA, so the caches must be flushed on architectures whose caches
> > > are not coherent with respect to DMA.
> > >=3D20
> > > Would you care to elaborate on the exact nature of the bug you are
> > > fixing?
> >=3D20
> > I missed part of this thread, so forgive me if I'm a bit off here, but
> > if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> > this is a long solved issue on other archs such as ppc (and I _think_
> > sparc).
>=20
> The thread I started was indeed regarding I/D cache coherency and PIO.
> But it diverged into DMA issues a few days ago (should have been a new
> thread).
>=20
> > The way we do it, at least on powerpc which is PIPT, is to keep track o=
n
> > a per-page basis, whether a given page is clean for execution using
> > PG_arch1 bit. This bit is cleared when a new page is popped into the
> > page cache, and we clear it from flush_dcache_page() iirc (you may want
> > to dbl check I don't have the code at hand right now, or rather, I do
> > but I'm to lazy to look right now :-)
>=20
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit.
>=20
> --=3D20
> Catalin
> --=20
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium.  Thank you.
>=20
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-17  8:55                                         ` Shilimkar, Santosh
@ 2010-02-17 17:02                                           ` Alan Stern
  -1 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-17 17:02 UTC (permalink / raw)
  To: Shilimkar, Santosh
  Cc: Oliver Neukum, Russell King - ARM Linux, Catalin Marinas,
	Pavel Machek, Greg KH, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, linux-arm-kernel,
	Mankad, Maulik Ojas, Gadiyar, Anand

On Wed, 17 Feb 2010, Shilimkar, Santosh wrote:

> How about below approach? Controller driver can set 
> "uses_pio_for_control" if it can't do dma for control transfer.
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 80995ef..e3eae02 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  
>  	if (usb_endpoint_xfer_control(&urb->ep->desc)
>  	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -		if (hcd->self.uses_dma) {
> +		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
>  			urb->setup_dma = dma_map_single(
>  					hcd->self.controller,
>  					urb->setup_packet,
> @@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
>  
>  	if (usb_endpoint_xfer_control(&urb->ep->desc)
>  	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -		if (hcd->self.uses_dma)
> +		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
>  			dma_unmap_single(hcd->self.controller, urb->setup_dma,
>  					sizeof(struct usb_ctrlrequest),
>  					DMA_TO_DEVICE);
> diff --git a/include/linux/usb.h b/include/linux/usb.h
> index d7ace1b..ba5b0a2 100644
> --- a/include/linux/usb.h
> +++ b/include/linux/usb.h
> @@ -329,6 +329,9 @@ struct usb_bus {
>  	int busnum;			/* Bus number (in order of reg) */
>  	const char *bus_name;		/* stable id (PCI slot_name etc) */
>  	u8 uses_dma;			/* Does the host controller use DMA? */
> +	u8 uses_pio_for_control;	/* Does the host controller use PIO
> +					 * for control tansfers? 
> +					 */
>  	u8 otg_port;			/* 0, or number of OTG/HNP port */
>  	unsigned is_b_host:1;		/* true during some HNP roleswitches */
>  	unsigned b_hnp_enable:1;	/* OTG: did A-Host enable HNP? */

Why do you skip mapping the setup packet but not the data packet?

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 17:02                                           ` Alan Stern
  0 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-17 17:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 17 Feb 2010, Shilimkar, Santosh wrote:

> How about below approach? Controller driver can set 
> "uses_pio_for_control" if it can't do dma for control transfer.
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 80995ef..e3eae02 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  
>  	if (usb_endpoint_xfer_control(&urb->ep->desc)
>  	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -		if (hcd->self.uses_dma) {
> +		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
>  			urb->setup_dma = dma_map_single(
>  					hcd->self.controller,
>  					urb->setup_packet,
> @@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
>  
>  	if (usb_endpoint_xfer_control(&urb->ep->desc)
>  	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -		if (hcd->self.uses_dma)
> +		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
>  			dma_unmap_single(hcd->self.controller, urb->setup_dma,
>  					sizeof(struct usb_ctrlrequest),
>  					DMA_TO_DEVICE);
> diff --git a/include/linux/usb.h b/include/linux/usb.h
> index d7ace1b..ba5b0a2 100644
> --- a/include/linux/usb.h
> +++ b/include/linux/usb.h
> @@ -329,6 +329,9 @@ struct usb_bus {
>  	int busnum;			/* Bus number (in order of reg) */
>  	const char *bus_name;		/* stable id (PCI slot_name etc) */
>  	u8 uses_dma;			/* Does the host controller use DMA? */
> +	u8 uses_pio_for_control;	/* Does the host controller use PIO
> +					 * for control tansfers? 
> +					 */
>  	u8 otg_port;			/* 0, or number of OTG/HNP port */
>  	unsigned is_b_host:1;		/* true during some HNP roleswitches */
>  	unsigned b_hnp_enable:1;	/* OTG: did A-Host enable HNP? */

Why do you skip mapping the setup packet but not the data packet?

Alan Stern

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 17:02                                           ` Alan Stern
@ 2010-02-17 20:26                                             ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17 20:26 UTC (permalink / raw)
  To: Alan Stern
  Cc: Shilimkar, Santosh, Oliver Neukum, Catalin Marinas, Pavel Machek,
	Greg KH, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, linux-arm-kernel,
	Mankad, Maulik Ojas, Gadiyar, Anand

On Wed, Feb 17, 2010 at 12:02:21PM -0500, Alan Stern wrote:
> Why do you skip mapping the setup packet but not the data packet?

This is something of a FAQ in this thread.  Here are the responses to
similar questions yesterday:

"Gadiyar, Anand" <gadiyar@ti.com> said:
> Not really. For instance, in the case of the DMA engine in the MUSB
> controller in OMAP3, we can only use DMA with endpoints other than
> EP0, and EP0 is what is used for control transfers.
>
> It's not PIO for all the endpoints or DMA for all of them.

"Shilimkar, Santosh" <santosh.shilimkar@ti.com> said:
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control
> transfers and DMA for bulk transfers.
>
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU
> and they already coherent from CPU point of view.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 20:26                                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17 20:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 17, 2010 at 12:02:21PM -0500, Alan Stern wrote:
> Why do you skip mapping the setup packet but not the data packet?

This is something of a FAQ in this thread.  Here are the responses to
similar questions yesterday:

"Gadiyar, Anand" <gadiyar@ti.com> said:
> Not really. For instance, in the case of the DMA engine in the MUSB
> controller in OMAP3, we can only use DMA with endpoints other than
> EP0, and EP0 is what is used for control transfers.
>
> It's not PIO for all the endpoints or DMA for all of them.

"Shilimkar, Santosh" <santosh.shilimkar@ti.com> said:
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control
> transfers and DMA for bulk transfers.
>
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU
> and they already coherent from CPU point of view.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-17 17:02                                           ` Alan Stern
@ 2010-02-17 20:30                                             ` Gadiyar, Anand
  -1 siblings, 0 replies; 352+ messages in thread
From: Gadiyar, Anand @ 2010-02-17 20:30 UTC (permalink / raw)
  To: Alan Stern, Shilimkar, Santosh
  Cc: Oliver Neukum, Russell King - ARM Linux, Catalin Marinas,
	Pavel Machek, Greg KH, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, linux-arm-kernel,
	Mankad, Maulik Ojas

Alan Stern wrote:
> On Wed, 17 Feb 2010, Shilimkar, Santosh wrote:
> 
> > How about below approach? Controller driver can set
> > "uses_pio_for_control" if it can't do dma for control transfer.
> >
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 80995ef..e3eae02 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
> >
> >       if (usb_endpoint_xfer_control(&urb->ep->desc)
> >           && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -             if (hcd->self.uses_dma) {
> > +             if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
> >                       urb->setup_dma = dma_map_single(
> >                                       hcd->self.controller,
> >                                       urb->setup_packet,
> > @@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
> >
> >       if (usb_endpoint_xfer_control(&urb->ep->desc)
> >           && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -             if (hcd->self.uses_dma)
> > +             if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
> >                       dma_unmap_single(hcd->self.controller, urb->setup_dma,
> >                                       sizeof(struct usb_ctrlrequest),
> >                                       DMA_TO_DEVICE);
> > diff --git a/include/linux/usb.h b/include/linux/usb.h
> > index d7ace1b..ba5b0a2 100644
> > --- a/include/linux/usb.h
> > +++ b/include/linux/usb.h
> > @@ -329,6 +329,9 @@ struct usb_bus {
> >       int busnum;                     /* Bus number (in order of reg) */
> >       const char *bus_name;           /* stable id (PCI slot_name etc) */
> >       u8 uses_dma;                    /* Does the host controller use DMA? */
> > +     u8 uses_pio_for_control;        /* Does the host controller use PIO
> > +                                      * for control tansfers?
> > +                                      */
> >       u8 otg_port;                    /* 0, or number of OTG/HNP port */
> >       unsigned is_b_host:1;           /* true during some HNP roleswitches */
> >       unsigned b_hnp_enable:1;        /* OTG: did A-Host enable HNP? */
> 
> Why do you skip mapping the setup packet but not the data packet?
> 

I think that's oversight. For this controller, we need to skip mapping
all buffers used to do transfers on EP0, which is all control transfers.

Will fix in the next version of the patch.

- Anand

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 20:30                                             ` Gadiyar, Anand
  0 siblings, 0 replies; 352+ messages in thread
From: Gadiyar, Anand @ 2010-02-17 20:30 UTC (permalink / raw)
  To: linux-arm-kernel

Alan Stern wrote:
> On Wed, 17 Feb 2010, Shilimkar, Santosh wrote:
> 
> > How about below approach? Controller driver can set
> > "uses_pio_for_control" if it can't do dma for control transfer.
> >
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 80995ef..e3eae02 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
> >
> >       if (usb_endpoint_xfer_control(&urb->ep->desc)
> >           && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -             if (hcd->self.uses_dma) {
> > +             if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
> >                       urb->setup_dma = dma_map_single(
> >                                       hcd->self.controller,
> >                                       urb->setup_packet,
> > @@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
> >
> >       if (usb_endpoint_xfer_control(&urb->ep->desc)
> >           && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -             if (hcd->self.uses_dma)
> > +             if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
> >                       dma_unmap_single(hcd->self.controller, urb->setup_dma,
> >                                       sizeof(struct usb_ctrlrequest),
> >                                       DMA_TO_DEVICE);
> > diff --git a/include/linux/usb.h b/include/linux/usb.h
> > index d7ace1b..ba5b0a2 100644
> > --- a/include/linux/usb.h
> > +++ b/include/linux/usb.h
> > @@ -329,6 +329,9 @@ struct usb_bus {
> >       int busnum;                     /* Bus number (in order of reg) */
> >       const char *bus_name;           /* stable id (PCI slot_name etc) */
> >       u8 uses_dma;                    /* Does the host controller use DMA? */
> > +     u8 uses_pio_for_control;        /* Does the host controller use PIO
> > +                                      * for control tansfers?
> > +                                      */
> >       u8 otg_port;                    /* 0, or number of OTG/HNP port */
> >       unsigned is_b_host:1;           /* true during some HNP roleswitches */
> >       unsigned b_hnp_enable:1;        /* OTG: did A-Host enable HNP? */
> 
> Why do you skip mapping the setup packet but not the data packet?
> 

I think that's oversight. For this controller, we need to skip mapping
all buffers used to do transfers on EP0, which is all control transfers.

Will fix in the next version of the patch.

- Anand

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 15:27                           ` Catalin Marinas
@ 2010-02-17 20:37                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 20:37 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Oliver Neukum, Shilimkar, Santosh, Matthew Dharm,
	Russell King - ARM Linux, Ming Lei, Mankad, Maulik Ojas,
	Sergei Shtylyov, Sebastian Siewior, linux-usb, linux-kernel,
	Pavel Machek, Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 15:27 +0000, Catalin Marinas wrote:
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit. 

Actually, clear it :-)

I suppose that's one thing that needs to be fixed in the drivers.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 20:37                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 20:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 15:27 +0000, Catalin Marinas wrote:
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit. 

Actually, clear it :-)

I suppose that's one thing that needs to be fixed in the drivers.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 20:37                             ` Benjamin Herrenschmidt
@ 2010-02-17 20:44                               ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17 20:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, Oliver Neukum, Shilimkar, Santosh,
	Matthew Dharm, Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Thu, Feb 18, 2010 at 07:37:00AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-17 at 15:27 +0000, Catalin Marinas wrote:
> > We do the same on ARM. The problem with most (all) HCD drivers that do
> > PIO is that they copy the data to the transfer buffer but there is no
> > call in this driver to flush_dcache_page(). The upper mass storage or
> > filesystem layers don't call this function either, so there isn't
> > anything that would set the PG_arch1 bit. 
> 
> Actually, clear it :-)
> 
> I suppose that's one thing that needs to be fixed in the drivers.

No, because that'd probably bugger up the Sparc64 method of delaying
flush_dcache_page.

This method works as follows:

- a page cache page is allocated - this has PG_arch_1 clear.

- IO happens on it and it's placed into the page cache.  PG_arch_1 is
  still clear.

- someone calls read()/write() which accesses the page.  The generic
  file IO layers call flush_dcache_page() in response to read()/write()
  fs calls.  flush_dcache_page() spots that the page is not yet mapped
  into userspace, and sets PG_arch_1 to mark the fact that the kernel
  mapping is dirty.

- when someone maps the page, we check PG_arch_1 in update_mmu_cache.
  If PG_arch_1 is set, we flush the kernel mapping.

Clearly, if we go around having drivers clearing PG_arch_1, this is going
to break horribly.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 20:44                               ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17 20:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 18, 2010 at 07:37:00AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-17 at 15:27 +0000, Catalin Marinas wrote:
> > We do the same on ARM. The problem with most (all) HCD drivers that do
> > PIO is that they copy the data to the transfer buffer but there is no
> > call in this driver to flush_dcache_page(). The upper mass storage or
> > filesystem layers don't call this function either, so there isn't
> > anything that would set the PG_arch1 bit. 
> 
> Actually, clear it :-)
> 
> I suppose that's one thing that needs to be fixed in the drivers.

No, because that'd probably bugger up the Sparc64 method of delaying
flush_dcache_page.

This method works as follows:

- a page cache page is allocated - this has PG_arch_1 clear.

- IO happens on it and it's placed into the page cache.  PG_arch_1 is
  still clear.

- someone calls read()/write() which accesses the page.  The generic
  file IO layers call flush_dcache_page() in response to read()/write()
  fs calls.  flush_dcache_page() spots that the page is not yet mapped
  into userspace, and sets PG_arch_1 to mark the fact that the kernel
  mapping is dirty.

- when someone maps the page, we check PG_arch_1 in update_mmu_cache.
  If PG_arch_1 is set, we flush the kernel mapping.

Clearly, if we go around having drivers clearing PG_arch_1, this is going
to break horribly.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 20:44                               ` Russell King - ARM Linux
@ 2010-02-17 22:31                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 22:31 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Catalin Marinas, Oliver Neukum, Shilimkar, Santosh,
	Matthew Dharm, Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> No, because that'd probably bugger up the Sparc64 method of delaying
> flush_dcache_page.
> 
> This method works as follows:
> 
> - a page cache page is allocated - this has PG_arch_1 clear.
> 
> - IO happens on it and it's placed into the page cache.  PG_arch_1 is
>   still clear.
> 
> - someone calls read()/write() which accesses the page.  The generic
>   file IO layers call flush_dcache_page() in response to
> read()/write()
>   fs calls.  flush_dcache_page() spots that the page is not yet mapped
>   into userspace, and sets PG_arch_1 to mark the fact that the kernel
>   mapping is dirty.
> 
> - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
>   If PG_arch_1 is set, we flush the kernel mapping.
> 
> Clearly, if we go around having drivers clearing PG_arch_1, this is
> going to break horribly. 

Ok, you do things very differently than us on ppc then. We clear
PG_arch_1 in flush_dcache_page(), and we set it when the page has been
cache cleaned for execution.

We assume that anybody that dirties a page in the kernel will call
flush_dcache_page() which removes our PG_arch_1 bit thus marking the
page "dirty".

Note that from experience, doing the check & flushes in
update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
where processor one does set_pte followed by update_mmu_cache(). The
later isn't done yet but processor 2 sees the PTE now and starts using
it, cache hasn't been fully flushed yet. You may avoid that race in some
ways, but on ppc, I've stopped using that.

I now do things directly in set_pte_at(). In fact, that's why I want
your patch to change update_mmu_cache() to take a PTE pointer :-) Since
my set_pte_at() can now remove the _PAGE_EXEC bit, I need
update_mmu_cache() to re-read the PTE before it updates the hash table
or TLB.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-17 22:31                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 22:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> No, because that'd probably bugger up the Sparc64 method of delaying
> flush_dcache_page.
> 
> This method works as follows:
> 
> - a page cache page is allocated - this has PG_arch_1 clear.
> 
> - IO happens on it and it's placed into the page cache.  PG_arch_1 is
>   still clear.
> 
> - someone calls read()/write() which accesses the page.  The generic
>   file IO layers call flush_dcache_page() in response to
> read()/write()
>   fs calls.  flush_dcache_page() spots that the page is not yet mapped
>   into userspace, and sets PG_arch_1 to mark the fact that the kernel
>   mapping is dirty.
> 
> - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
>   If PG_arch_1 is set, we flush the kernel mapping.
> 
> Clearly, if we go around having drivers clearing PG_arch_1, this is
> going to break horribly. 

Ok, you do things very differently than us on ppc then. We clear
PG_arch_1 in flush_dcache_page(), and we set it when the page has been
cache cleaned for execution.

We assume that anybody that dirties a page in the kernel will call
flush_dcache_page() which removes our PG_arch_1 bit thus marking the
page "dirty".

Note that from experience, doing the check & flushes in
update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
where processor one does set_pte followed by update_mmu_cache(). The
later isn't done yet but processor 2 sees the PTE now and starts using
it, cache hasn't been fully flushed yet. You may avoid that race in some
ways, but on ppc, I've stopped using that.

I now do things directly in set_pte_at(). In fact, that's why I want
your patch to change update_mmu_cache() to take a PTE pointer :-) Since
my set_pte_at() can now remove the _PAGE_EXEC bit, I need
update_mmu_cache() to re-read the PTE before it updates the hash table
or TLB.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 20:30                                             ` Gadiyar, Anand
@ 2010-02-18  6:56                                               ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-18  6:56 UTC (permalink / raw)
  To: Gadiyar, Anand
  Cc: Alan Stern, Shilimkar, Santosh, Russell King - ARM Linux,
	Catalin Marinas, Pavel Machek, Greg KH, Matthew Dharm,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, linux-usb,
	linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Am Mittwoch, 17. Februar 2010 21:30:24 schrieb Gadiyar, Anand:
> > Why do you skip mapping the setup packet but not the data packet?
> > 
> 
> I think that's oversight. For this controller, we need to skip mapping
> all buffers used to do transfers on EP0, which is all control transfers.

One thing more. Do you have an issue with EP 0 only or all control
endpoints? EP 0 must be control, but devices are within spec if they
have multiple control endpoints provided EP 0 is control.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-18  6:56                                               ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-18  6:56 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 21:30:24 schrieb Gadiyar, Anand:
> > Why do you skip mapping the setup packet but not the data packet?
> > 
> 
> I think that's oversight. For this controller, we need to skip mapping
> all buffers used to do transfers on EP0, which is all control transfers.

One thing more. Do you have an issue with EP 0 only or all control
endpoints? EP 0 must be control, but devices are within spec if they
have multiple control endpoints provided EP 0 is control.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* RE: USB mass storage and ARM cache coherency
  2010-02-18  6:56                                               ` Oliver Neukum
@ 2010-02-18  7:14                                                 ` Gadiyar, Anand
  -1 siblings, 0 replies; 352+ messages in thread
From: Gadiyar, Anand @ 2010-02-18  7:14 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Alan Stern, Shilimkar, Santosh, Russell King - ARM Linux,
	Catalin Marinas, Pavel Machek, Greg KH, Matthew Dharm,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, linux-usb,
	linux-kernel, linux-arm-kernel, Mankad, Maulik Ojas

Oliver Neukum wrote:
> Am Mittwoch, 17. Februar 2010 21:30:24 schrieb Gadiyar, Anand:
> > > Why do you skip mapping the setup packet but not the data packet?
> > > 
> > 
> > I think that's oversight. For this controller, we need to skip mapping
> > all buffers used to do transfers on EP0, which is all control transfers.
> 
> One thing more. Do you have an issue with EP 0 only or all control
> endpoints? EP 0 must be control, but devices are within spec if they
> have multiple control endpoints provided EP 0 is control.

Sorry for the confusion. The issue is not with EP 0 of devices
connected to the controller; the problem is with EP 0 on the host
controller itself.

The controller in question is the MUSB OTG controller present in
OMAPs, Davinci chips, and some Blackfins. The MUSB HCD driver is
written such that it carries out all control transfers on EP 0 of
the controller. All bulk transfers are carried out on other hardware
endpoints.

(This is the same "hardware endpoint" that is used in when the MUSB
is used in gadget mode.)


I'm not really sure why EP0 was chosen for control transfers, or
if there is a restriction that we *need* to use it. Let me study
the docs some more.

The problem is that with the driver code as written today, we use
EP 0 for all control transfers, and the DMA engine cannot do DMA
to this endpoint's FIFO.

- Anand

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-18  7:14                                                 ` Gadiyar, Anand
  0 siblings, 0 replies; 352+ messages in thread
From: Gadiyar, Anand @ 2010-02-18  7:14 UTC (permalink / raw)
  To: linux-arm-kernel

Oliver Neukum wrote:
> Am Mittwoch, 17. Februar 2010 21:30:24 schrieb Gadiyar, Anand:
> > > Why do you skip mapping the setup packet but not the data packet?
> > > 
> > 
> > I think that's oversight. For this controller, we need to skip mapping
> > all buffers used to do transfers on EP0, which is all control transfers.
> 
> One thing more. Do you have an issue with EP 0 only or all control
> endpoints? EP 0 must be control, but devices are within spec if they
> have multiple control endpoints provided EP 0 is control.

Sorry for the confusion. The issue is not with EP 0 of devices
connected to the controller; the problem is with EP 0 on the host
controller itself.

The controller in question is the MUSB OTG controller present in
OMAPs, Davinci chips, and some Blackfins. The MUSB HCD driver is
written such that it carries out all control transfers on EP 0 of
the controller. All bulk transfers are carried out on other hardware
endpoints.

(This is the same "hardware endpoint" that is used in when the MUSB
is used in gadget mode.)


I'm not really sure why EP0 was chosen for control transfers, or
if there is a restriction that we *need* to use it. Let me study
the docs some more.

The problem is that with the driver code as written today, we use
EP 0 for all control transfers, and the DMA engine cannot do DMA
to this endpoint's FIFO.

- Anand

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-17 22:31                                 ` Benjamin Herrenschmidt
@ 2010-02-19 17:15                                   ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-19 17:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Oliver Neukum, Shilimkar, Santosh,
	Matthew Dharm, Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, 2010-02-17 at 22:31 +0000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> > No, because that'd probably bugger up the Sparc64 method of delaying
> > flush_dcache_page.
> >
> > This method works as follows:
> >
> > - a page cache page is allocated - this has PG_arch_1 clear.
> >
> > - IO happens on it and it's placed into the page cache.  PG_arch_1 is
> >   still clear.
> >
> > - someone calls read()/write() which accesses the page.  The generic
> >   file IO layers call flush_dcache_page() in response to
> > read()/write()
> >   fs calls.  flush_dcache_page() spots that the page is not yet mapped
> >   into userspace, and sets PG_arch_1 to mark the fact that the kernel
> >   mapping is dirty.
> >
> > - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
> >   If PG_arch_1 is set, we flush the kernel mapping.
> >
> > Clearly, if we go around having drivers clearing PG_arch_1, this is
> > going to break horribly.
> 
> Ok, you do things very differently than us on ppc then. We clear
> PG_arch_1 in flush_dcache_page(), and we set it when the page has been
> cache cleaned for execution.

For this perspective it's not that different, just that we use the
negated PG_arch_1.

> We assume that anybody that dirties a page in the kernel will call
> flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> page "dirty".

This assumption is not valid with some drivers like USB HCD doing PIO.
But, yes, that's how it should be done.

> Note that from experience, doing the check & flushes in
> update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> where processor one does set_pte followed by update_mmu_cache(). The
> later isn't done yet but processor 2 sees the PTE now and starts using
> it, cache hasn't been fully flushed yet. You may avoid that race in some
> ways, but on ppc, I've stopped using that.

I think that's possible on ARM too. Having two threads on different
CPUs, one thread triggers a prefetch abort (instruction page fault) on
CPU0 but the second thread on CPU1 may branch into this page after
set_pte() (hence not fault) but before update_mmu_cache() doing the
flush.

On ARM11MPCore we flush the caches in flush_dcache_page() because the
cache maintenance operations weren't visible to the other CPUs.
Cortex-A9 broadcasts the cache operations in hardware so we can use lazy
flushing but with the race you pointed out.

Using set_pte_at() for delayed flushing may be a better option for ARM
as well (and maybe Documentation/cachetlb.txt updated).

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-19 17:15                                   ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-19 17:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 22:31 +0000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> > No, because that'd probably bugger up the Sparc64 method of delaying
> > flush_dcache_page.
> >
> > This method works as follows:
> >
> > - a page cache page is allocated - this has PG_arch_1 clear.
> >
> > - IO happens on it and it's placed into the page cache.  PG_arch_1 is
> >   still clear.
> >
> > - someone calls read()/write() which accesses the page.  The generic
> >   file IO layers call flush_dcache_page() in response to
> > read()/write()
> >   fs calls.  flush_dcache_page() spots that the page is not yet mapped
> >   into userspace, and sets PG_arch_1 to mark the fact that the kernel
> >   mapping is dirty.
> >
> > - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
> >   If PG_arch_1 is set, we flush the kernel mapping.
> >
> > Clearly, if we go around having drivers clearing PG_arch_1, this is
> > going to break horribly.
> 
> Ok, you do things very differently than us on ppc then. We clear
> PG_arch_1 in flush_dcache_page(), and we set it when the page has been
> cache cleaned for execution.

For this perspective it's not that different, just that we use the
negated PG_arch_1.

> We assume that anybody that dirties a page in the kernel will call
> flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> page "dirty".

This assumption is not valid with some drivers like USB HCD doing PIO.
But, yes, that's how it should be done.

> Note that from experience, doing the check & flushes in
> update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> where processor one does set_pte followed by update_mmu_cache(). The
> later isn't done yet but processor 2 sees the PTE now and starts using
> it, cache hasn't been fully flushed yet. You may avoid that race in some
> ways, but on ppc, I've stopped using that.

I think that's possible on ARM too. Having two threads on different
CPUs, one thread triggers a prefetch abort (instruction page fault) on
CPU0 but the second thread on CPU1 may branch into this page after
set_pte() (hence not fault) but before update_mmu_cache() doing the
flush.

On ARM11MPCore we flush the caches in flush_dcache_page() because the
cache maintenance operations weren't visible to the other CPUs.
Cortex-A9 broadcasts the cache operations in hardware so we can use lazy
flushing but with the race you pointed out.

Using set_pte_at() for delayed flushing may be a better option for ARM
as well (and maybe Documentation/cachetlb.txt updated).

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-19 17:15                                   ` Catalin Marinas
@ 2010-02-19 17:36                                     ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-19 17:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Matthew Dharm, linux-usb, Russell King - ARM Linux,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 22:31 +0000, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> > > No, because that'd probably bugger up the Sparc64 method of delaying
> > > flush_dcache_page.
> > >
> > > This method works as follows:
> > >
> > > - a page cache page is allocated - this has PG_arch_1 clear.
> > >
> > > - IO happens on it and it's placed into the page cache.  PG_arch_1 is
> > >   still clear.
> > >
> > > - someone calls read()/write() which accesses the page.  The generic
> > >   file IO layers call flush_dcache_page() in response to
> > > read()/write()
> > >   fs calls.  flush_dcache_page() spots that the page is not yet mapped
> > >   into userspace, and sets PG_arch_1 to mark the fact that the kernel
> > >   mapping is dirty.
> > >
> > > - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
> > >   If PG_arch_1 is set, we flush the kernel mapping.
> > >
> > > Clearly, if we go around having drivers clearing PG_arch_1, this is
> > > going to break horribly.
> >
> > Ok, you do things very differently than us on ppc then. We clear
> > PG_arch_1 in flush_dcache_page(), and we set it when the page has been
> > cache cleaned for execution.
> 
> For this perspective it's not that different, just that we use the
> negated PG_arch_1.

I got your point now (after reading the replies on linux-arch :)).

So PPC assumes that if PG_arch_1 is clear (the default), the page wasn't
cleaned. If there is no call to flush_dcache_page() but the page gets
mapped to user space, update_mmu_cache() (or set_pte_at()) would simply
assume that the page was dirtied, flush the caches and set this bit.

We could easily do this on ARM as well and assume that the page is dirty
if !PG_arch_1. But it only partially solves the problem (only for
faulted-in pages).

If a page is already mapped in user space, flush_dcache_page() on ARM
does the flushing rather than deferring it to update_mmu_cache(). The
PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
that the HCD could transfer data into a page cache page already mapped
in user space? My understanding is that the scenario above is possible.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-19 17:36                                     ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-19 17:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 22:31 +0000, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> > > No, because that'd probably bugger up the Sparc64 method of delaying
> > > flush_dcache_page.
> > >
> > > This method works as follows:
> > >
> > > - a page cache page is allocated - this has PG_arch_1 clear.
> > >
> > > - IO happens on it and it's placed into the page cache.  PG_arch_1 is
> > >   still clear.
> > >
> > > - someone calls read()/write() which accesses the page.  The generic
> > >   file IO layers call flush_dcache_page() in response to
> > > read()/write()
> > >   fs calls.  flush_dcache_page() spots that the page is not yet mapped
> > >   into userspace, and sets PG_arch_1 to mark the fact that the kernel
> > >   mapping is dirty.
> > >
> > > - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
> > >   If PG_arch_1 is set, we flush the kernel mapping.
> > >
> > > Clearly, if we go around having drivers clearing PG_arch_1, this is
> > > going to break horribly.
> >
> > Ok, you do things very differently than us on ppc then. We clear
> > PG_arch_1 in flush_dcache_page(), and we set it when the page has been
> > cache cleaned for execution.
> 
> For this perspective it's not that different, just that we use the
> negated PG_arch_1.

I got your point now (after reading the replies on linux-arch :)).

So PPC assumes that if PG_arch_1 is clear (the default), the page wasn't
cleaned. If there is no call to flush_dcache_page() but the page gets
mapped to user space, update_mmu_cache() (or set_pte_at()) would simply
assume that the page was dirtied, flush the caches and set this bit.

We could easily do this on ARM as well and assume that the page is dirty
if !PG_arch_1. But it only partially solves the problem (only for
faulted-in pages).

If a page is already mapped in user space, flush_dcache_page() on ARM
does the flushing rather than deferring it to update_mmu_cache(). The
PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
that the HCD could transfer data into a page cache page already mapped
in user space? My understanding is that the scenario above is possible.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-19 17:36                                     ` Catalin Marinas
@ 2010-02-19 20:53                                       ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-19 20:53 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Benjamin Herrenschmidt, Matthew Dharm, linux-usb,
	Russell King - ARM Linux, Mankad,Maulik Ojas, Sergei Shtylyov,
	Ming Lei, Sebastian Siewior, linux-kernel, Shilimkar,Santosh,
	Pavel Machek, Greg KH, linux-arm-kernel, James Bottomley

Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> If a page is already mapped in user space, flush_dcache_page() on ARM
> does the flushing rather than deferring it to update_mmu_cache(). The
> PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> that the HCD could transfer data into a page cache page already mapped
> in user space? My understanding is that the scenario above is possible.

Yes, video drivers do that.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-19 20:53                                       ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-19 20:53 UTC (permalink / raw)
  To: linux-arm-kernel

Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> If a page is already mapped in user space, flush_dcache_page() on ARM
> does the flushing rather than deferring it to update_mmu_cache(). The
> PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> that the HCD could transfer data into a page cache page already mapped
> in user space? My understanding is that the scenario above is possible.

Yes, video drivers do that.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-16  8:51                         ` Gadiyar, Anand
@ 2010-02-20  7:21                           ` Pete Zaitcev
  -1 siblings, 0 replies; 352+ messages in thread
From: Pete Zaitcev @ 2010-02-20  7:21 UTC (permalink / raw)
  To: Gadiyar, Anand
  Cc: Russell King - ARM Linux, Shilimkar, Santosh, Catalin Marinas,
	Pavel Machek, Greg KH, Matthew Dharm, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, linux-usb, linux-kernel, linux-arm-kernel,
	Mankad, Maulik Ojas

On Tue, 16 Feb 2010 14:21:48 +0530
"Gadiyar, Anand" <gadiyar@ti.com> wrote:

> >         hcd->self.uses_dma = (dev->dma_mask != NULL);
> > 
> > Is it easier to make sure that PIO devices don't have dev->dma_mask set?
> 
> Not really. For instance, in the case of the DMA engine in the MUSB
> controller in OMAP3, we can only use DMA with endpoints other than
> EP0, and EP0 is what is used for control transfers.
> 
> It's not PIO for all the endpoints or DMA for all of them.

The HC driver does not have to be 100% truthful here. If the system
is not HIGHMEM, HCD can easily set uses_dma to false yet use DMA
by mapping buffers itself, without relying on the quoted code.

On a HIGHMEM system, block layer will bounce-buffer data in such case.
Hopefuly not a problem for ARM?

All network stack drivers work that way, BTW.

-- Pete

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-20  7:21                           ` Pete Zaitcev
  0 siblings, 0 replies; 352+ messages in thread
From: Pete Zaitcev @ 2010-02-20  7:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 16 Feb 2010 14:21:48 +0530
"Gadiyar, Anand" <gadiyar@ti.com> wrote:

> >         hcd->self.uses_dma = (dev->dma_mask != NULL);
> > 
> > Is it easier to make sure that PIO devices don't have dev->dma_mask set?
> 
> Not really. For instance, in the case of the DMA engine in the MUSB
> controller in OMAP3, we can only use DMA with endpoints other than
> EP0, and EP0 is what is used for control transfers.
> 
> It's not PIO for all the endpoints or DMA for all of them.

The HC driver does not have to be 100% truthful here. If the system
is not HIGHMEM, HCD can easily set uses_dma to false yet use DMA
by mapping buffers itself, without relying on the quoted code.

On a HIGHMEM system, block layer will bounce-buffer data in such case.
Hopefuly not a problem for ARM?

All network stack drivers work that way, BTW.

-- Pete

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-19 17:15                                   ` Catalin Marinas
@ 2010-02-24  2:39                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Russell King - ARM Linux, Oliver Neukum, Shilimkar, Santosh,
	Matthew Dharm, Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> > Ok, you do things very differently than us on ppc then. We clear
> > PG_arch_1 in flush_dcache_page(), and we set it when the page has
> been
> > cache cleaned for execution.
> 
> For this perspective it's not that different, just that we use the
> negated PG_arch_1.

Right, though you default as "clean" while we default as "dirty".

> > We assume that anybody that dirties a page in the kernel will call
> > flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> > page "dirty".
> 
> This assumption is not valid with some drivers like USB HCD doing PIO.
> But, yes, that's how it should be done.

So we go back to the fix should be done at the individual drivers level.
If it's going to write into the page cache, it needs to whack the bits.

Now there's of course the question as to whether you really only want to
do that for a PIO access and not for a DMA access, I think on power, we
don't really discriminate that much (since in any case our icache still
needs flushing). Maybe it would be useful to separate the I$ and D$ bits
but I'm not sure I can be bothered.
 
> > Note that from experience, doing the check & flushes in
> > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the
> case
> > where processor one does set_pte followed by update_mmu_cache(). The
> > later isn't done yet but processor 2 sees the PTE now and starts
> using
> > it, cache hasn't been fully flushed yet. You may avoid that race in
> some
> > ways, but on ppc, I've stopped using that.
> 
> I think that's possible on ARM too. Having two threads on different
> CPUs, one thread triggers a prefetch abort (instruction page fault) on
> CPU0 but the second thread on CPU1 may branch into this page after
> set_pte() (hence not fault) but before update_mmu_cache() doing the
> flush.
> 
> On ARM11MPCore we flush the caches in flush_dcache_page() because the
> cache maintenance operations weren't visible to the other CPUs.

I'm not even sure that's going to be 100% correct. Don't you also need
to flush the remote icaches when you are dealing with instructions (such
as swap) anyways ?

I've had some discussions in the past with Russell and others around the
problem of non-broadcast cache ops on ARM SMP since that's also hurting
you hard with dma mappings.

Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
are still on even in local_irq_save() blocks right ? I haven't touched
low level ARM for years tho, I may have forgotten things).

In this case, you should probably use the same bits as A9 and simply
make them use FIQs on 11MP to make the other cores flush as well.

> Cortex-A9 broadcasts the cache operations in hardware so we can use
> lazy flushing but with the race you pointed out.

Right.

> Using set_pte_at() for delayed flushing may be a better option for ARM
> as well (and maybe Documentation/cachetlb.txt updated). 

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24  2:39                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> > Ok, you do things very differently than us on ppc then. We clear
> > PG_arch_1 in flush_dcache_page(), and we set it when the page has
> been
> > cache cleaned for execution.
> 
> For this perspective it's not that different, just that we use the
> negated PG_arch_1.

Right, though you default as "clean" while we default as "dirty".

> > We assume that anybody that dirties a page in the kernel will call
> > flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> > page "dirty".
> 
> This assumption is not valid with some drivers like USB HCD doing PIO.
> But, yes, that's how it should be done.

So we go back to the fix should be done at the individual drivers level.
If it's going to write into the page cache, it needs to whack the bits.

Now there's of course the question as to whether you really only want to
do that for a PIO access and not for a DMA access, I think on power, we
don't really discriminate that much (since in any case our icache still
needs flushing). Maybe it would be useful to separate the I$ and D$ bits
but I'm not sure I can be bothered.
 
> > Note that from experience, doing the check & flushes in
> > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the
> case
> > where processor one does set_pte followed by update_mmu_cache(). The
> > later isn't done yet but processor 2 sees the PTE now and starts
> using
> > it, cache hasn't been fully flushed yet. You may avoid that race in
> some
> > ways, but on ppc, I've stopped using that.
> 
> I think that's possible on ARM too. Having two threads on different
> CPUs, one thread triggers a prefetch abort (instruction page fault) on
> CPU0 but the second thread on CPU1 may branch into this page after
> set_pte() (hence not fault) but before update_mmu_cache() doing the
> flush.
> 
> On ARM11MPCore we flush the caches in flush_dcache_page() because the
> cache maintenance operations weren't visible to the other CPUs.

I'm not even sure that's going to be 100% correct. Don't you also need
to flush the remote icaches when you are dealing with instructions (such
as swap) anyways ?

I've had some discussions in the past with Russell and others around the
problem of non-broadcast cache ops on ARM SMP since that's also hurting
you hard with dma mappings.

Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
are still on even in local_irq_save() blocks right ? I haven't touched
low level ARM for years tho, I may have forgotten things).

In this case, you should probably use the same bits as A9 and simply
make them use FIQs on 11MP to make the other cores flush as well.

> Cortex-A9 broadcasts the cache operations in hardware so we can use
> lazy flushing but with the race you pointed out.

Right.

> Using set_pte_at() for delayed flushing may be a better option for ARM
> as well (and maybe Documentation/cachetlb.txt updated). 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-19 17:36                                     ` Catalin Marinas
@ 2010-02-24  2:47                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:47 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, linux-usb, Russell King - ARM Linux,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Fri, 2010-02-19 at 17:36 +0000, Catalin Marinas wrote:
> 
> If a page is already mapped in user space, flush_dcache_page() on ARM
> does the flushing rather than deferring it to update_mmu_cache(). 

This is for D-cache aliases on VIVT right ? Or are you still talking
about I/D coherency on PIPT ARMs ? Because the later should not matter
for already mapped userspace pages in the sense that if user space
explicitely read() onto a page, it's up to userspace to cache clean that
page before executing from it in my book :-)

> The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> that the HCD could transfer data into a page cache page already mapped
> in user space? My understanding is that the scenario above is possible.

It is but I'm not confident the responsibility for doing that cleanup
is at the HCD level. That would impact a lot of HCD activities that
don't need such flushing since the use of the page is purely in-kernel.

Though I suppose that could be optimized out in most case using the page
use count.

But I still wonder whether it should be pushed down to the actual
interface drivers, that's always been the case I believe. In fact, in
the case of block ops, it's generally done at the BIO or even file
system layer right ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24  2:47                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 17:36 +0000, Catalin Marinas wrote:
> 
> If a page is already mapped in user space, flush_dcache_page() on ARM
> does the flushing rather than deferring it to update_mmu_cache(). 

This is for D-cache aliases on VIVT right ? Or are you still talking
about I/D coherency on PIPT ARMs ? Because the later should not matter
for already mapped userspace pages in the sense that if user space
explicitely read() onto a page, it's up to userspace to cache clean that
page before executing from it in my book :-)

> The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> that the HCD could transfer data into a page cache page already mapped
> in user space? My understanding is that the scenario above is possible.

It is but I'm not confident the responsibility for doing that cleanup
is at the HCD level. That would impact a lot of HCD activities that
don't need such flushing since the use of the page is purely in-kernel.

Though I suppose that could be optimized out in most case using the page
use count.

But I still wonder whether it should be pushed down to the actual
interface drivers, that's always been the case I believe. In fact, in
the case of block ops, it's generally done at the BIO or even file
system layer right ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-19 20:53                                       ` Oliver Neukum
@ 2010-02-24  2:48                                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:48 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Catalin Marinas, Matthew Dharm, Russell King - ARM Linux,
	Greg KH, Mankad, Maulik Ojas, Sergei Shtylyov, Sebastian Siewior,
	linux-usb, linux-kernel, James Bottomley, Shilimkar, Santosh,
	Pavel Machek, Ming Lei, linux-arm-kernel

On Fri, 2010-02-19 at 21:53 +0100, Oliver Neukum wrote:
> Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> > If a page is already mapped in user space, flush_dcache_page() on ARM
> > does the flushing rather than deferring it to update_mmu_cache(). The
> > PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> Yes, video drivers do that. 

In which case it would be up to the video driver to call
flush_dcache_page() (though if it's v4l you are talking about, maybe it
might make sense to push it into the v4l layer itself).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24  2:48                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 21:53 +0100, Oliver Neukum wrote:
> Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> > If a page is already mapped in user space, flush_dcache_page() on ARM
> > does the flushing rather than deferring it to update_mmu_cache(). The
> > PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> Yes, video drivers do that. 

In which case it would be up to the video driver to call
flush_dcache_page() (though if it's v4l you are talking about, maybe it
might make sense to push it into the v4l layer itself).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24  2:48                                         ` Benjamin Herrenschmidt
@ 2010-02-24  7:16                                           ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-24  7:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, Matthew Dharm, Russell King - ARM Linux,
	Greg KH, Mankad, Maulik Ojas, Sergei Shtylyov, Sebastian Siewior,
	linux-usb, linux-kernel, James Bottomley, Shilimkar, Santosh,
	Pavel Machek, Ming Lei, linux-arm-kernel

Am Mittwoch, 24. Februar 2010 03:48:09 schrieb Benjamin Herrenschmidt:
> On Fri, 2010-02-19 at 21:53 +0100, Oliver Neukum wrote:
> > Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> > > If a page is already mapped in user space, flush_dcache_page() on ARM
> > > does the flushing rather than deferring it to update_mmu_cache(). The
> > > PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > > that the HCD could transfer data into a page cache page already mapped
> > > in user space? My understanding is that the scenario above is possible.
> > 
> > Yes, video drivers do that. 
> 
> In which case it would be up to the video driver to call
> flush_dcache_page() (though if it's v4l you are talking about, maybe it
> might make sense to push it into the v4l layer itself).

I don't know. The issue seems quite complex. It would seem better to
centralize it as far as practical. Do you have a wrapper drivers could
call?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24  7:16                                           ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-24  7:16 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 24. Februar 2010 03:48:09 schrieb Benjamin Herrenschmidt:
> On Fri, 2010-02-19 at 21:53 +0100, Oliver Neukum wrote:
> > Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> > > If a page is already mapped in user space, flush_dcache_page() on ARM
> > > does the flushing rather than deferring it to update_mmu_cache(). The
> > > PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > > that the HCD could transfer data into a page cache page already mapped
> > > in user space? My understanding is that the scenario above is possible.
> > 
> > Yes, video drivers do that. 
> 
> In which case it would be up to the video driver to call
> flush_dcache_page() (though if it's v4l you are talking about, maybe it
> might make sense to push it into the v4l layer itself).

I don't know. The issue seems quite complex. It would seem better to
centralize it as far as practical. Do you have a wrapper drivers could
call?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24  2:47                                       ` Benjamin Herrenschmidt
@ 2010-02-24 16:19                                         ` Alan Stern
  -1 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-24 16:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, Matthew Dharm, linux-usb,
	Russell King - ARM Linux, Mankad,Maulik Ojas, Sergei Shtylyov,
	Ming Lei, Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Pavel Machek, Greg KH, linux-arm-kernel,
	James Bottomley

On Wed, 24 Feb 2010, Benjamin Herrenschmidt wrote:

> > The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> It is but I'm not confident the responsibility for doing that cleanup
> is at the HCD level. That would impact a lot of HCD activities that
> don't need such flushing since the use of the page is purely in-kernel.

That's right.  The HCD merely puts data wherever it's told to.  It 
doesn't know whether the destination is in the page cache, in 
userspace, or anywhere else.  The same is true for usb-storage.

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24 16:19                                         ` Alan Stern
  0 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-24 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 24 Feb 2010, Benjamin Herrenschmidt wrote:

> > The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> It is but I'm not confident the responsibility for doing that cleanup
> is at the HCD level. That would impact a lot of HCD activities that
> don't need such flushing since the use of the page is purely in-kernel.

That's right.  The HCD merely puts data wherever it's told to.  It 
doesn't know whether the destination is in the page cache, in 
userspace, or anywhere else.  The same is true for usb-storage.

Alan Stern

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24  7:16                                           ` Oliver Neukum
@ 2010-02-24 21:12                                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24 21:12 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Catalin Marinas, Matthew Dharm, Russell King - ARM Linux,
	Greg KH, Mankad, Maulik Ojas, Sergei Shtylyov, Sebastian Siewior,
	linux-usb, linux-kernel, James Bottomley, Shilimkar, Santosh,
	Pavel Machek, Ming Lei, linux-arm-kernel

On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> I don't know. The issue seems quite complex. It would seem better to
> centralize it as far as practical. Do you have a wrapper drivers could
> call?

flush_dcache_page() ? :-)

Now, the subsystem might be the one to know whether something is mapped
into userspace or not (v4l in our case) in which case a wrapper could be
created.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24 21:12                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24 21:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> I don't know. The issue seems quite complex. It would seem better to
> centralize it as far as practical. Do you have a wrapper drivers could
> call?

flush_dcache_page() ? :-)

Now, the subsystem might be the one to know whether something is mapped
into userspace or not (v4l in our case) in which case a wrapper could be
created.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24 16:19                                         ` Alan Stern
@ 2010-02-24 21:13                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24 21:13 UTC (permalink / raw)
  To: Alan Stern
  Cc: Catalin Marinas, Matthew Dharm, linux-usb,
	Russell King - ARM Linux, Mankad,Maulik Ojas, Sergei Shtylyov,
	Ming Lei, Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Pavel Machek, Greg KH, linux-arm-kernel,
	James Bottomley

On Wed, 2010-02-24 at 11:19 -0500, Alan Stern wrote:
> > It is but I'm not confident the responsibility for doing that
> cleanup
> > is at the HCD level. That would impact a lot of HCD activities that
> > don't need such flushing since the use of the page is purely
> in-kernel.
> 
> That's right.  The HCD merely puts data wherever it's told to.  It 
> doesn't know whether the destination is in the page cache, in 
> userspace, or anywhere else.  The same is true for usb-storage.

I'm surprised that usb-storage has an issue here. It shouldn't afaik,
since it's just a SCSI driver (or not anymore ?) and the BIO or
filesystems handle things there no ? I haven't seen a single call to
flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
when I looked...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24 21:13                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24 21:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 11:19 -0500, Alan Stern wrote:
> > It is but I'm not confident the responsibility for doing that
> cleanup
> > is at the HCD level. That would impact a lot of HCD activities that
> > don't need such flushing since the use of the page is purely
> in-kernel.
> 
> That's right.  The HCD merely puts data wherever it's told to.  It 
> doesn't know whether the destination is in the page cache, in 
> userspace, or anywhere else.  The same is true for usb-storage.

I'm surprised that usb-storage has an issue here. It shouldn't afaik,
since it's just a SCSI driver (or not anymore ?) and the BIO or
filesystems handle things there no ? I haven't seen a single call to
flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
when I looked...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24 21:13                                           ` Benjamin Herrenschmidt
@ 2010-02-24 21:50                                             ` Alan Stern
  -1 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-24 21:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, Matthew Dharm, linux-usb,
	Russell King - ARM Linux, Mankad,Maulik Ojas, Sergei Shtylyov,
	Ming Lei, Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Pavel Machek, Greg KH, linux-arm-kernel,
	James Bottomley

On Thu, 25 Feb 2010, Benjamin Herrenschmidt wrote:

> I'm surprised that usb-storage has an issue here. It shouldn't afaik,
> since it's just a SCSI driver (or not anymore ?)

It still is.  There's also the ub driver, which is a non-SCSI block 
device driver for some of the same devices handled by usb-storage.

> and the BIO or
> filesystems handle things there no ? I haven't seen a single call to
> flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
> when I looked...

There is no real issue; it's just that the problem was first noted in 
connection with usb-storage reading in executable pages, so Catalin's 
initial post was oriented toward modifying usb-storage.

The main issue here is that the same host controller will use PIO
sometimes and DMA sometimes, depending on the details of the transfer.  
The USB core didn't expect this and consequently we violated the rules
for DMA mapping.  The question is: If the core is fixed so that the
rules aren't violated, will everything work correctly?

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-24 21:50                                             ` Alan Stern
  0 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-24 21:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 25 Feb 2010, Benjamin Herrenschmidt wrote:

> I'm surprised that usb-storage has an issue here. It shouldn't afaik,
> since it's just a SCSI driver (or not anymore ?)

It still is.  There's also the ub driver, which is a non-SCSI block 
device driver for some of the same devices handled by usb-storage.

> and the BIO or
> filesystems handle things there no ? I haven't seen a single call to
> flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
> when I looked...

There is no real issue; it's just that the problem was first noted in 
connection with usb-storage reading in executable pages, so Catalin's 
initial post was oriented toward modifying usb-storage.

The main issue here is that the same host controller will use PIO
sometimes and DMA sometimes, depending on the details of the transfer.  
The USB core didn't expect this and consequently we violated the rules
for DMA mapping.  The question is: If the core is fixed so that the
rules aren't violated, will everything work correctly?

Alan Stern

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24 21:12                                             ` Benjamin Herrenschmidt
@ 2010-02-25  3:48                                               ` Oliver Neukum
  -1 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-25  3:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, Matthew Dharm, Russell King - ARM Linux,
	Greg KH, Mankad, Maulik Ojas, Sergei Shtylyov, Sebastian Siewior,
	linux-usb, linux-kernel, James Bottomley, Shilimkar, Santosh,
	Pavel Machek, Ming Lei, linux-arm-kernel

Am Mittwoch, 24. Februar 2010 22:12:34 schrieb Benjamin Herrenschmidt:
> On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > I don't know. The issue seems quite complex. It would seem better to
> > centralize it as far as practical. Do you have a wrapper drivers could
> > call?
> 
> flush_dcache_page() ? :-)

Will this do anything on arches that don't need it?
Secondly, can we have a wrapper that you can pass a pointer and an
offset?
 
> Now, the subsystem might be the one to know whether something is mapped
> into userspace or not (v4l in our case) in which case a wrapper could be
> created.

If possible, I'd like to centralize this. Drivers are likely to get this wrong.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-25  3:48                                               ` Oliver Neukum
  0 siblings, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-02-25  3:48 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 24. Februar 2010 22:12:34 schrieb Benjamin Herrenschmidt:
> On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > I don't know. The issue seems quite complex. It would seem better to
> > centralize it as far as practical. Do you have a wrapper drivers could
> > call?
> 
> flush_dcache_page() ? :-)

Will this do anything on arches that don't need it?
Secondly, can we have a wrapper that you can pass a pointer and an
offset?
 
> Now, the subsystem might be the one to know whether something is mapped
> into userspace or not (v4l in our case) in which case a wrapper could be
> created.

If possible, I'd like to centralize this. Drivers are likely to get this wrong.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24 21:12                                             ` Benjamin Herrenschmidt
@ 2010-02-25 12:36                                               ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-02-25 12:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Oliver Neukum, Catalin Marinas, Matthew Dharm,
	Russell King - ARM Linux, Greg KH, Mankad, Maulik Ojas,
	Sergei Shtylyov, Sebastian Siewior, linux-usb, linux-kernel,
	Shilimkar, Santosh, Pavel Machek, Ming Lei, linux-arm-kernel

On Thu, 2010-02-25 at 08:12 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > I don't know. The issue seems quite complex. It would seem better to
> > centralize it as far as practical. Do you have a wrapper drivers could
> > call?
> 
> flush_dcache_page() ? :-)

Actually, that can be wrong depending on the implementation.  The
problem is incoherency of the kernel page (dirty) with respect to user
space aliases (clean).  What has to happen on parisc is that the kernel
alias needs flushing.  We can guarantee the userspace aliases to be
clean (and not moved in).  We wouldn't want to incur the expense of
flushing the user space pages as well.

> Now, the subsystem might be the one to know whether something is mapped
> into userspace or not (v4l in our case) in which case a wrapper could be
> created.

Right, so it's the responsibility of the API used by the subsystem.
Thus Caitlin's pio_kmap seems the right one ... I don't understand what
the additional problems are.

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-25 12:36                                               ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-02-25 12:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-02-25 at 08:12 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > I don't know. The issue seems quite complex. It would seem better to
> > centralize it as far as practical. Do you have a wrapper drivers could
> > call?
> 
> flush_dcache_page() ? :-)

Actually, that can be wrong depending on the implementation.  The
problem is incoherency of the kernel page (dirty) with respect to user
space aliases (clean).  What has to happen on parisc is that the kernel
alias needs flushing.  We can guarantee the userspace aliases to be
clean (and not moved in).  We wouldn't want to incur the expense of
flushing the user space pages as well.

> Now, the subsystem might be the one to know whether something is mapped
> into userspace or not (v4l in our case) in which case a wrapper could be
> created.

Right, so it's the responsibility of the API used by the subsystem.
Thus Caitlin's pio_kmap seems the right one ... I don't understand what
the additional problems are.

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24 21:50                                             ` Alan Stern
@ 2010-02-25 20:52                                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-25 20:52 UTC (permalink / raw)
  To: Alan Stern
  Cc: Catalin Marinas, Matthew Dharm, linux-usb,
	Russell King - ARM Linux, Mankad,Maulik Ojas, Sergei Shtylyov,
	Ming Lei, Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Pavel Machek, Greg KH, linux-arm-kernel,
	James Bottomley

On Wed, 2010-02-24 at 16:50 -0500, Alan Stern wrote:
> The main issue here is that the same host controller will use PIO
> sometimes and DMA sometimes, depending on the details of the
> transfer.  
> The USB core didn't expect this and consequently we violated the rules
> for DMA mapping.  The question is: If the core is fixed so that the
> rules aren't violated, will everything work correctly? 

As long as the only issue is that one (ie, doing PIO while dma-map'ed),
then yes, I'd say things should work. If not, then there is -another-
problem to be fixed :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-25 20:52                                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-25 20:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 16:50 -0500, Alan Stern wrote:
> The main issue here is that the same host controller will use PIO
> sometimes and DMA sometimes, depending on the details of the
> transfer.  
> The USB core didn't expect this and consequently we violated the rules
> for DMA mapping.  The question is: If the core is fixed so that the
> rules aren't violated, will everything work correctly? 

As long as the only issue is that one (ie, doing PIO while dma-map'ed),
then yes, I'd say things should work. If not, then there is -another-
problem to be fixed :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-25  3:48                                               ` Oliver Neukum
@ 2010-02-26  0:22                                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26  0:22 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Matthew Dharm, Russell King - ARM Linux, Ming Lei, Mankad,
	Maulik Ojas, Sergei Shtylyov, Catalin Marinas, Sebastian Siewior,
	linux-usb, linux-kernel, James Bottomley, Shilimkar, Santosh,
	Pavel Machek, Greg KH, linux-arm-kernel

On Thu, 2010-02-25 at 04:48 +0100, Oliver Neukum wrote:
> Am Mittwoch, 24. Februar 2010 22:12:34 schrieb Benjamin Herrenschmidt:
> > On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > > I don't know. The issue seems quite complex. It would seem better to
> > > centralize it as far as practical. Do you have a wrapper drivers could
> > > call?
> > 
> > flush_dcache_page() ? :-)
> 
> Will this do anything on arches that don't need it?

No, it's going to be an empty inline:

arch/x86/include/asm/cacheflush.h:static inline void flush_dcache_page(struct page *page) { }

> Secondly, can we have a wrapper that you can pass a pointer and an
> offset?

I'm sure you can make one :-) Use virt_to_page() though that will not
work for vmap/vmalloc space of course.
 
> > Now, the subsystem might be the one to know whether something is mapped
> > into userspace or not (v4l in our case) in which case a wrapper could be
> > created.
> 
> If possible, I'd like to centralize this. Drivers are likely to get this wrong.

Right. In the case of v4l, it's probably something that should go into
the subsystem. IE. That's how it works for block too, it's done at the
BIO and/or filesystem layer (though individual filesystems do have their
hand in the pudding). 

Cheers,
Ben.

> 	Regards
> 		Oliver
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26  0:22                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26  0:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-02-25 at 04:48 +0100, Oliver Neukum wrote:
> Am Mittwoch, 24. Februar 2010 22:12:34 schrieb Benjamin Herrenschmidt:
> > On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > > I don't know. The issue seems quite complex. It would seem better to
> > > centralize it as far as practical. Do you have a wrapper drivers could
> > > call?
> > 
> > flush_dcache_page() ? :-)
> 
> Will this do anything on arches that don't need it?

No, it's going to be an empty inline:

arch/x86/include/asm/cacheflush.h:static inline void flush_dcache_page(struct page *page) { }

> Secondly, can we have a wrapper that you can pass a pointer and an
> offset?

I'm sure you can make one :-) Use virt_to_page() though that will not
work for vmap/vmalloc space of course.
 
> > Now, the subsystem might be the one to know whether something is mapped
> > into userspace or not (v4l in our case) in which case a wrapper could be
> > created.
> 
> If possible, I'd like to centralize this. Drivers are likely to get this wrong.

Right. In the case of v4l, it's probably something that should go into
the subsystem. IE. That's how it works for block too, it's done at the
BIO and/or filesystem layer (though individual filesystems do have their
hand in the pudding). 

Cheers,
Ben.

> 	Regards
> 		Oliver
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24 21:13                                           ` Benjamin Herrenschmidt
@ 2010-02-26 16:00                                             ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alan Stern, Matthew Dharm, linux-usb, Russell King - ARM Linux,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Wed, 2010-02-24 at 21:13 +0000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-24 at 11:19 -0500, Alan Stern wrote:
> > > It is but I'm not confident the responsibility for doing that cleanup
> > > is at the HCD level. That would impact a lot of HCD activities that
> > > don't need such flushing since the use of the page is purely in-kernel.
> >
> > That's right.  The HCD merely puts data wherever it's told to.  It
> > doesn't know whether the destination is in the page cache, in
> > userspace, or anywhere else.  The same is true for usb-storage.
> 
> I'm surprised that usb-storage has an issue here. It shouldn't afaik,
> since it's just a SCSI driver (or not anymore ?) and the BIO or
> filesystems handle things there no ? I haven't seen a single call to
> flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
> when I looked...

The BIO or filesystem code don't call flush_dcache_page() either (well
some do like cramfs or jffs but they decompress the data received from
the block device).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 16:00                                             ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 21:13 +0000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-24 at 11:19 -0500, Alan Stern wrote:
> > > It is but I'm not confident the responsibility for doing that cleanup
> > > is at the HCD level. That would impact a lot of HCD activities that
> > > don't need such flushing since the use of the page is purely in-kernel.
> >
> > That's right.  The HCD merely puts data wherever it's told to.  It
> > doesn't know whether the destination is in the page cache, in
> > userspace, or anywhere else.  The same is true for usb-storage.
> 
> I'm surprised that usb-storage has an issue here. It shouldn't afaik,
> since it's just a SCSI driver (or not anymore ?) and the BIO or
> filesystems handle things there no ? I haven't seen a single call to
> flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
> when I looked...

The BIO or filesystem code don't call flush_dcache_page() either (well
some do like cramfs or jffs but they decompress the data received from
the block device).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24  2:47                                       ` Benjamin Herrenschmidt
@ 2010-02-26 16:25                                         ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Matthew Dharm, linux-usb, Russell King - ARM Linux,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Wed, 2010-02-24 at 02:47 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-19 at 17:36 +0000, Catalin Marinas wrote:
> >
> > If a page is already mapped in user space, flush_dcache_page() on ARM
> > does the flushing rather than deferring it to update_mmu_cache().
> 
> This is for D-cache aliases on VIVT right ? Or are you still talking
> about I/D coherency on PIPT ARMs ? Because the later should not matter
> for already mapped userspace pages in the sense that if user space
> explicitely read() onto a page, it's up to userspace to cache clean that
> page before executing from it in my book :-)

I was still thinking about PIPT I/D coherency. The read() case you
mention is pretty clear, no need or the kernel to ensure coherency
(especially since writing is done via copy_to_user rather than to the
page cache page).

For mmap'ed pages (and present in the page cache), is it guaranteed that
the HCD driver won't write to it once it has been mapped into user
space? If that's the case, it may solve the problem by just reversing
the meaning of PG_arch_1 on ARM and assume that a newly allocated page
has dirty D-cache by default.

> > The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> It is but I'm not confident the responsibility for doing that cleanup
> is at the HCD level. That would impact a lot of HCD activities that
> don't need such flushing since the use of the page is purely in-kernel.
> 
> Though I suppose that could be optimized out in most case using the page
> use count.
> 
> But I still wonder whether it should be pushed down to the actual
> interface drivers, that's always been the case I believe. In fact, in
> the case of block ops, it's generally done at the BIO or even file
> system layer right ?

The filesystem layer does it only if it needs to touch the data written
by the block device (e.g. cramfs, jffs). Some block devices call
flush_dcache_page (like mmci.c) while some others don't (and those that
use DMA actually don't since the DMA API handles the flushing).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 16:25                                         ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 02:47 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-19 at 17:36 +0000, Catalin Marinas wrote:
> >
> > If a page is already mapped in user space, flush_dcache_page() on ARM
> > does the flushing rather than deferring it to update_mmu_cache().
> 
> This is for D-cache aliases on VIVT right ? Or are you still talking
> about I/D coherency on PIPT ARMs ? Because the later should not matter
> for already mapped userspace pages in the sense that if user space
> explicitely read() onto a page, it's up to userspace to cache clean that
> page before executing from it in my book :-)

I was still thinking about PIPT I/D coherency. The read() case you
mention is pretty clear, no need or the kernel to ensure coherency
(especially since writing is done via copy_to_user rather than to the
page cache page).

For mmap'ed pages (and present in the page cache), is it guaranteed that
the HCD driver won't write to it once it has been mapped into user
space? If that's the case, it may solve the problem by just reversing
the meaning of PG_arch_1 on ARM and assume that a newly allocated page
has dirty D-cache by default.

> > The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> It is but I'm not confident the responsibility for doing that cleanup
> is at the HCD level. That would impact a lot of HCD activities that
> don't need such flushing since the use of the page is purely in-kernel.
> 
> Though I suppose that could be optimized out in most case using the page
> use count.
> 
> But I still wonder whether it should be pushed down to the actual
> interface drivers, that's always been the case I believe. In fact, in
> the case of block ops, it's generally done at the BIO or even file
> system layer right ?

The filesystem layer does it only if it needs to touch the data written
by the block device (e.g. cramfs, jffs). Some block devices call
flush_dcache_page (like mmci.c) while some others don't (and those that
use DMA actually don't since the DMA API handles the flushing).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-24  2:39                                     ` Benjamin Herrenschmidt
@ 2010-02-26 16:44                                       ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Oliver Neukum, Shilimkar, Santosh,
	Matthew Dharm, Ming Lei, Mankad, Maulik Ojas, Sergei Shtylyov,
	Sebastian Siewior, linux-usb, linux-kernel, Pavel Machek,
	Greg KH, linux-arm-kernel

On Wed, 2010-02-24 at 02:39 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> > > We assume that anybody that dirties a page in the kernel will call
> > > flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> > > page "dirty".
> >
> > This assumption is not valid with some drivers like USB HCD doing PIO.
> > But, yes, that's how it should be done.
> 
> So we go back to the fix should be done at the individual drivers level.
> If it's going to write into the page cache, it needs to whack the bits.
> 
> Now there's of course the question as to whether you really only want to
> do that for a PIO access and not for a DMA access, I think on power, we
> don't really discriminate that much (since in any case our icache still
> needs flushing). Maybe it would be useful to separate the I$ and D$ bits
> but I'm not sure I can be bothered.

On ARM, update_mmu_cache() invalidates the I-cache (if VM_EXEC)
independent of whether the D-cache was dirty (since we can get
speculative fetches into the I-cache before it was even mapped).

> > > Note that from experience, doing the check & flushes in
> > > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> > > where processor one does set_pte followed by update_mmu_cache(). The
> > > later isn't done yet but processor 2 sees the PTE now and starts using
> > > it, cache hasn't been fully flushed yet. You may avoid that race in some
> > > ways, but on ppc, I've stopped using that.
> >
> > I think that's possible on ARM too. Having two threads on different
> > CPUs, one thread triggers a prefetch abort (instruction page fault) on
> > CPU0 but the second thread on CPU1 may branch into this page after
> > set_pte() (hence not fault) but before update_mmu_cache() doing the
> > flush.
> >
> > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > cache maintenance operations weren't visible to the other CPUs.
> 
> I'm not even sure that's going to be 100% correct. Don't you also need
> to flush the remote icaches when you are dealing with instructions (such
> as swap) anyways ?

I don't think we tried swap but for pages that have been mapped for the
first time, the I-cache would be clean. At mm switching, if a thread
migrates to a new CPU we invalidate the cache at that point.

> I've had some discussions in the past with Russell and others around the
> problem of non-broadcast cache ops on ARM SMP since that's also hurting
> you hard with dma mappings.
> 
> Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
> are still on even in local_irq_save() blocks right ? I haven't touched
> low level ARM for years tho, I may have forgotten things).

I have a patch for using IPIs via IRQ from the DMA API functions but,
while it works, it can deadlock with some drivers (complex situation).
Note that the patch added a specific IPI implementation which can cope
with interrupts being disabled (unlike the generic one).

My latest solution - http://bit.ly/apJv3O - is to use dummy
read-for-ownership or write-for-ownership accesses in the DMA cache
flushing functions to force cache line migration from the other CPUs.
Our current benchmarks only show around 10% disc throughput penalty
compared to the normal SMP case (compared to the UP case the penalty is
bigger but that's due to other things).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 16:44                                       ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 02:39 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> > > We assume that anybody that dirties a page in the kernel will call
> > > flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> > > page "dirty".
> >
> > This assumption is not valid with some drivers like USB HCD doing PIO.
> > But, yes, that's how it should be done.
> 
> So we go back to the fix should be done at the individual drivers level.
> If it's going to write into the page cache, it needs to whack the bits.
> 
> Now there's of course the question as to whether you really only want to
> do that for a PIO access and not for a DMA access, I think on power, we
> don't really discriminate that much (since in any case our icache still
> needs flushing). Maybe it would be useful to separate the I$ and D$ bits
> but I'm not sure I can be bothered.

On ARM, update_mmu_cache() invalidates the I-cache (if VM_EXEC)
independent of whether the D-cache was dirty (since we can get
speculative fetches into the I-cache before it was even mapped).

> > > Note that from experience, doing the check & flushes in
> > > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> > > where processor one does set_pte followed by update_mmu_cache(). The
> > > later isn't done yet but processor 2 sees the PTE now and starts using
> > > it, cache hasn't been fully flushed yet. You may avoid that race in some
> > > ways, but on ppc, I've stopped using that.
> >
> > I think that's possible on ARM too. Having two threads on different
> > CPUs, one thread triggers a prefetch abort (instruction page fault) on
> > CPU0 but the second thread on CPU1 may branch into this page after
> > set_pte() (hence not fault) but before update_mmu_cache() doing the
> > flush.
> >
> > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > cache maintenance operations weren't visible to the other CPUs.
> 
> I'm not even sure that's going to be 100% correct. Don't you also need
> to flush the remote icaches when you are dealing with instructions (such
> as swap) anyways ?

I don't think we tried swap but for pages that have been mapped for the
first time, the I-cache would be clean. At mm switching, if a thread
migrates to a new CPU we invalidate the cache at that point.

> I've had some discussions in the past with Russell and others around the
> problem of non-broadcast cache ops on ARM SMP since that's also hurting
> you hard with dma mappings.
> 
> Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
> are still on even in local_irq_save() blocks right ? I haven't touched
> low level ARM for years tho, I may have forgotten things).

I have a patch for using IPIs via IRQ from the DMA API functions but,
while it works, it can deadlock with some drivers (complex situation).
Note that the patch added a specific IPI implementation which can cope
with interrupts being disabled (unlike the generic one).

My latest solution - http://bit.ly/apJv3O - is to use dummy
read-for-ownership or write-for-ownership accesses in the DMA cache
flushing functions to force cache line migration from the other CPUs.
Our current benchmarks only show around 10% disc throughput penalty
compared to the normal SMP case (compared to the UP case the penalty is
bigger but that's due to other things).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 16:25                                         ` Catalin Marinas
@ 2010-02-26 16:52                                           ` Alan Stern
  -1 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-26 16:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Benjamin Herrenschmidt, Matthew Dharm, linux-usb,
	Russell King - ARM Linux, Mankad,Maulik Ojas, Sergei Shtylyov,
	Ming Lei, Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Pavel Machek, Greg KH, linux-arm-kernel,
	James Bottomley

On Fri, 26 Feb 2010, Catalin Marinas wrote:

> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

Nothing is guaranteed.  The HCD will write to wherever it is asked.  If 
a driver does input to an mmap'ed page, the HCD won't even know that 
the page is mmap'ed.

Alan Stern


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 16:52                                           ` Alan Stern
  0 siblings, 0 replies; 352+ messages in thread
From: Alan Stern @ 2010-02-26 16:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 26 Feb 2010, Catalin Marinas wrote:

> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

Nothing is guaranteed.  The HCD will write to wherever it is asked.  If 
a driver does input to an mmap'ed page, the HCD won't even know that 
the page is mmap'ed.

Alan Stern

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 16:25                                         ` Catalin Marinas
@ 2010-02-26 21:00                                           ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 21:00 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Benjamin Herrenschmidt, Matthew Dharm, linux-usb,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
unnecessary D cache flushing when clean pages get mapped into userspace.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 21:00                                           ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 21:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
unnecessary D cache flushing when clean pages get mapped into userspace.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 16:00                                             ` Catalin Marinas
@ 2010-02-26 21:36                                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:36 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alan Stern, Matthew Dharm, linux-usb, Russell King - ARM Linux,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Fri, 2010-02-26 at 16:00 +0000, Catalin Marinas wrote:
> > I'm surprised that usb-storage has an issue here. It shouldn't
> afaik,
> > since it's just a SCSI driver (or not anymore ?) and the BIO or
> > filesystems handle things there no ? I haven't seen a single call to
> > flush_dcache_page() in any of drivers/scsi, drivers/ata or
> drivers/ide
> > when I looked...
> 
> The BIO or filesystem code don't call flush_dcache_page() either (well
> some do like cramfs or jffs but they decompress the data received from
> the block device). 

That's weird... that would mean that all existing PIO IDE or SCSI is
broken etc... Including I$/D$ cache coherency on powerpc and more. That
surprises me :-)

On an older kernel tree here:

$ grep -r flush_dcache_page fs | wc -l
118

So maybe that's where things need fixing ?

Cheers,
Ben.





^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 21:36                                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 16:00 +0000, Catalin Marinas wrote:
> > I'm surprised that usb-storage has an issue here. It shouldn't
> afaik,
> > since it's just a SCSI driver (or not anymore ?) and the BIO or
> > filesystems handle things there no ? I haven't seen a single call to
> > flush_dcache_page() in any of drivers/scsi, drivers/ata or
> drivers/ide
> > when I looked...
> 
> The BIO or filesystem code don't call flush_dcache_page() either (well
> some do like cramfs or jffs but they decompress the data received from
> the block device). 

That's weird... that would mean that all existing PIO IDE or SCSI is
broken etc... Including I$/D$ cache coherency on powerpc and more. That
surprises me :-)

On an older kernel tree here:

$ grep -r flush_dcache_page fs | wc -l
118

So maybe that's where things need fixing ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 16:25                                         ` Catalin Marinas
@ 2010-02-26 21:40                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:40 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, linux-usb, Russell King - ARM Linux,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Fri, 2010-02-26 at 16:25 +0000, Catalin Marinas wrote:
> 
> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

Well, I don't see why the HCD would write to it unless it's swapped out,
and thus unmapped or read() to or similar.

> The filesystem layer does it only if it needs to touch the data written
> by the block device (e.g. cramfs, jffs). Some block devices call
> flush_dcache_page (like mmci.c) while some others don't (and those that
> use DMA actually don't since the DMA API handles the flushing). 

Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
powerpc.. I'm afraid that whole cache handling stuff is totally
inconsistent since different archs have different expectations here.

Maybe we need to revisit things in that area, though it might require to
be done properly to have not one but two bits in struct page to
separately track the D$ and I$ state ...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 21:40                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 16:25 +0000, Catalin Marinas wrote:
> 
> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

Well, I don't see why the HCD would write to it unless it's swapped out,
and thus unmapped or read() to or similar.

> The filesystem layer does it only if it needs to touch the data written
> by the block device (e.g. cramfs, jffs). Some block devices call
> flush_dcache_page (like mmci.c) while some others don't (and those that
> use DMA actually don't since the DMA API handles the flushing). 

Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
powerpc.. I'm afraid that whole cache handling stuff is totally
inconsistent since different archs have different expectations here.

Maybe we need to revisit things in that area, though it might require to
be done properly to have not one but two bits in struct page to
separately track the D$ and I$ state ...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 21:40                                           ` Benjamin Herrenschmidt
@ 2010-02-26 21:49                                             ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 21:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, Matthew Dharm, linux-usb, Mankad,Maulik Ojas,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, Oliver Neukum,
	linux-kernel, Shilimkar,Santosh, Pavel Machek, Greg KH,
	linux-arm-kernel, James Bottomley

On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> powerpc.. I'm afraid that whole cache handling stuff is totally
> inconsistent since different archs have different expectations here.

It doesn't on ARM either.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 21:49                                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 21:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> powerpc.. I'm afraid that whole cache handling stuff is totally
> inconsistent since different archs have different expectations here.

It doesn't on ARM either.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 16:44                                       ` Catalin Marinas
@ 2010-02-26 21:49                                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:49 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, linux-usb, Russell King - ARM Linux, Mankad,
	Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar, Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel


> On ARM, update_mmu_cache() invalidates the I-cache (if VM_EXEC)
> independent of whether the D-cache was dirty (since we can get
> speculative fetches into the I-cache before it was even mapped).

We can get those speculative fetches too on power.

However, we only do the invalidate when PG_arch_1 is clear to avoid
doing it multiple time for a page that was already "cleaned". But it
seems that might not be that a good idea if indeed flush_dcache_page()
is not called for DMA transfers in most cases.

(In addition there is the race I mentioned with update_mmu_cache on SMP)

> > > > Note that from experience, doing the check & flushes in
> > > > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> > > > where processor one does set_pte followed by update_mmu_cache(). The
> > > > later isn't done yet but processor 2 sees the PTE now and starts using
> > > > it, cache hasn't been fully flushed yet. You may avoid that race in some
> > > > ways, but on ppc, I've stopped using that.
> > >
> > > I think that's possible on ARM too. Having two threads on different
> > > CPUs, one thread triggers a prefetch abort (instruction page fault) on
> > > CPU0 but the second thread on CPU1 may branch into this page after
> > > set_pte() (hence not fault) but before update_mmu_cache() doing the
> > > flush.
> > >
> > > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > > cache maintenance operations weren't visible to the other CPUs.
> > 
> > I'm not even sure that's going to be 100% correct. Don't you also need
> > to flush the remote icaches when you are dealing with instructions (such
> > as swap) anyways ?
> 
> I don't think we tried swap but for pages that have been mapped for the
> first time, the I-cache would be clean. 
>
> At mm switching, if a thread
> migrates to a new CPU we invalidate the cache at that point.

That sounds fragile. What about a multithread app with one thread on
each core hitting the pages at the same time ? Sounds racy to me...

> > I've had some discussions in the past with Russell and others around the
> > problem of non-broadcast cache ops on ARM SMP since that's also hurting
> > you hard with dma mappings.
> > 
> > Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
> > are still on even in local_irq_save() blocks right ? I haven't touched
> > low level ARM for years tho, I may have forgotten things).
> 
> I have a patch for using IPIs via IRQ from the DMA API functions but,
> while it works, it can deadlock with some drivers (complex situation).
> Note that the patch added a specific IPI implementation which can cope
> with interrupts being disabled (unlike the generic one).

It will deadlock if you use normal IRQs. I don't see a good way around
that other than using a higher-level type of IRQs. I though ARM has
something like that (FIQs ?). Can you use those guys for IPIs ?

> My latest solution - http://bit.ly/apJv3O - is to use dummy
> read-for-ownership or write-for-ownership accesses in the DMA cache
> flushing functions to force cache line migration from the other CPUs.

That might do, but won't help for the icache, will it ?

> Our current benchmarks only show around 10% disc throughput penalty
> compared to the normal SMP case (compared to the UP case the penalty is
> bigger but that's due to other things).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 21:49                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:49 UTC (permalink / raw)
  To: linux-arm-kernel


> On ARM, update_mmu_cache() invalidates the I-cache (if VM_EXEC)
> independent of whether the D-cache was dirty (since we can get
> speculative fetches into the I-cache before it was even mapped).

We can get those speculative fetches too on power.

However, we only do the invalidate when PG_arch_1 is clear to avoid
doing it multiple time for a page that was already "cleaned". But it
seems that might not be that a good idea if indeed flush_dcache_page()
is not called for DMA transfers in most cases.

(In addition there is the race I mentioned with update_mmu_cache on SMP)

> > > > Note that from experience, doing the check & flushes in
> > > > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> > > > where processor one does set_pte followed by update_mmu_cache(). The
> > > > later isn't done yet but processor 2 sees the PTE now and starts using
> > > > it, cache hasn't been fully flushed yet. You may avoid that race in some
> > > > ways, but on ppc, I've stopped using that.
> > >
> > > I think that's possible on ARM too. Having two threads on different
> > > CPUs, one thread triggers a prefetch abort (instruction page fault) on
> > > CPU0 but the second thread on CPU1 may branch into this page after
> > > set_pte() (hence not fault) but before update_mmu_cache() doing the
> > > flush.
> > >
> > > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > > cache maintenance operations weren't visible to the other CPUs.
> > 
> > I'm not even sure that's going to be 100% correct. Don't you also need
> > to flush the remote icaches when you are dealing with instructions (such
> > as swap) anyways ?
> 
> I don't think we tried swap but for pages that have been mapped for the
> first time, the I-cache would be clean. 
>
> At mm switching, if a thread
> migrates to a new CPU we invalidate the cache at that point.

That sounds fragile. What about a multithread app with one thread on
each core hitting the pages at the same time ? Sounds racy to me...

> > I've had some discussions in the past with Russell and others around the
> > problem of non-broadcast cache ops on ARM SMP since that's also hurting
> > you hard with dma mappings.
> > 
> > Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
> > are still on even in local_irq_save() blocks right ? I haven't touched
> > low level ARM for years tho, I may have forgotten things).
> 
> I have a patch for using IPIs via IRQ from the DMA API functions but,
> while it works, it can deadlock with some drivers (complex situation).
> Note that the patch added a specific IPI implementation which can cope
> with interrupts being disabled (unlike the generic one).

It will deadlock if you use normal IRQs. I don't see a good way around
that other than using a higher-level type of IRQs. I though ARM has
something like that (FIQs ?). Can you use those guys for IPIs ?

> My latest solution - http://bit.ly/apJv3O - is to use dummy
> read-for-ownership or write-for-ownership accesses in the DMA cache
> flushing functions to force cache line migration from the other CPUs.

That might do, but won't help for the icache, will it ?

> Our current benchmarks only show around 10% disc throughput penalty
> compared to the normal SMP case (compared to the UP case the penalty is
> bigger but that's due to other things).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 16:52                                           ` Alan Stern
@ 2010-02-26 21:51                                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:51 UTC (permalink / raw)
  To: Alan Stern
  Cc: Catalin Marinas, Matthew Dharm, Oliver Neukum,
	Russell King - ARM Linux, Greg KH, Mankad, Maulik Ojas,
	Sergei Shtylyov, Sebastian Siewior, linux-usb, linux-kernel,
	James Bottomley, Shilimkar, Santosh, Pavel Machek, Ming Lei,
	linux-arm-kernel

On Fri, 2010-02-26 at 11:52 -0500, Alan Stern wrote:
> > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > the HCD driver won't write to it once it has been mapped into user
> > space? If that's the case, it may solve the problem by just reversing
> > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > has dirty D-cache by default.
> 
> Nothing is guaranteed.  The HCD will write to wherever it is asked.  If 
> a driver does input to an mmap'ed page, the HCD won't even know that 
> the page is mmap'ed.

Right but that won't happen unless somebody explicitely caused that
input to happen, typically, a userspace read(). I$/D$ coherency isn't
implicit in that case.

The question is more when the kernel itself moves a page in/out from
underneath the application (mmap'ed executable pages). One it's mapped
in, it won't be written to by the HCD unless something explicitely does
something to cause that write. If it's swapped out and back in, it will
have been unmapped. 

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 21:51                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 11:52 -0500, Alan Stern wrote:
> > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > the HCD driver won't write to it once it has been mapped into user
> > space? If that's the case, it may solve the problem by just reversing
> > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > has dirty D-cache by default.
> 
> Nothing is guaranteed.  The HCD will write to wherever it is asked.  If 
> a driver does input to an mmap'ed page, the HCD won't even know that 
> the page is mmap'ed.

Right but that won't happen unless somebody explicitely caused that
input to happen, typically, a userspace read(). I$/D$ coherency isn't
implicit in that case.

The question is more when the kernel itself moves a page in/out from
underneath the application (mmap'ed executable pages). One it's mapped
in, it won't be written to by the HCD unless something explicitely does
something to cause that write. If it's swapped out and back in, it will
have been unmapped. 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 21:49                                         ` Benjamin Herrenschmidt
@ 2010-02-26 22:03                                           ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 22:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, Matthew Dharm, linux-usb, Mankad, Maulik Ojas,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, Oliver Neukum,
	linux-kernel, Shilimkar, Santosh, Pavel Machek, Greg KH,
	linux-arm-kernel

On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> It will deadlock if you use normal IRQs. I don't see a good way around
> that other than using a higher-level type of IRQs. I though ARM has
> something like that (FIQs ?). Can you use those guys for IPIs ?

If the hardware did support using FIQs for IPIs, this would not be
desirable because then it takes it away from the SoC folk to do what
they will with it.

In the past, it's been used as a fast CPU-driven "DMA" interface -
some SoCs have been wired up in such a way that's the only use
available for the FIQ.

The other problem we'd encounter using FIQs for IPIs is that some IPIs
need to take locks - and in order to make that safe, we'd either need
another class of locks which disable IRQs and FIQs together, or we'd
need to disable FIQs everywhere we disable IRQs - at which point FIQs
become utterly pointless.

(There only differences between FIQ and IRQ are:
 - on simultaneous raising of both, the FIQ will be called before the IRQ.
 - each has its own (single) vector.
 - invocation of FIQ masks IRQ.

What I'm saying is that what gives FIQ an advantage for SoC people is
that it's bare bones light weight and therefore extremely fast - as soon
as you load it up with additional complexity, it becomes less useful.)


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-26 22:03                                           ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 22:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> It will deadlock if you use normal IRQs. I don't see a good way around
> that other than using a higher-level type of IRQs. I though ARM has
> something like that (FIQs ?). Can you use those guys for IPIs ?

If the hardware did support using FIQs for IPIs, this would not be
desirable because then it takes it away from the SoC folk to do what
they will with it.

In the past, it's been used as a fast CPU-driven "DMA" interface -
some SoCs have been wired up in such a way that's the only use
available for the FIQ.

The other problem we'd encounter using FIQs for IPIs is that some IPIs
need to take locks - and in order to make that safe, we'd either need
another class of locks which disable IRQs and FIQs together, or we'd
need to disable FIQs everywhere we disable IRQs - at which point FIQs
become utterly pointless.

(There only differences between FIQ and IRQ are:
 - on simultaneous raising of both, the FIQ will be called before the IRQ.
 - each has its own (single) vector.
 - invocation of FIQ masks IRQ.

What I'm saying is that what gives FIQ an advantage for SoC people is
that it's bare bones light weight and therefore extremely fast - as soon
as you load it up with additional complexity, it becomes less useful.)

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 21:00                                           ` Russell King - ARM Linux
@ 2010-02-28  0:14                                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:14 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Catalin Marinas, Matthew Dharm, linux-usb, Mankad,Maulik Ojas,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, Oliver Neukum,
	linux-kernel, Shilimkar,Santosh, Pavel Machek, Greg KH,
	linux-arm-kernel, James Bottomley

On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > the HCD driver won't write to it once it has been mapped into user
> > space? If that's the case, it may solve the problem by just reversing
> > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > has dirty D-cache by default.
> 
> I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> unnecessary D cache flushing when clean pages get mapped into userspace.

That's an interesting thought for us too. When doing I$/D$ coherency, we
have to fist flush the D$ and then invalidate the I$. If we could keep
track of D$ and I$ separately, we could avoid the first step in many
cases, including the DMA API trick you mentioned.

I wonder if it's time to get a PG_arch_2 :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-28  0:14                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > the HCD driver won't write to it once it has been mapped into user
> > space? If that's the case, it may solve the problem by just reversing
> > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > has dirty D-cache by default.
> 
> I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> unnecessary D cache flushing when clean pages get mapped into userspace.

That's an interesting thought for us too. When doing I$/D$ coherency, we
have to fist flush the D$ and then invalidate the I$. If we could keep
track of D$ and I$ separately, we could avoid the first step in many
cases, including the DMA API trick you mentioned.

I wonder if it's time to get a PG_arch_2 :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 21:49                                             ` Russell King - ARM Linux
@ 2010-02-28  0:24                                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:24 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Catalin Marinas, Matthew Dharm, linux-usb, Mankad,Maulik Ojas,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, Oliver Neukum,
	linux-kernel, Shilimkar,Santosh, Pavel Machek, Greg KH,
	linux-arm-kernel, James Bottomley

On Fri, 2010-02-26 at 21:49 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> > Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> > powerpc.. I'm afraid that whole cache handling stuff is totally
> > inconsistent since different archs have different expectations here.
> 
> It doesn't on ARM either.

Ok, pfiew :-)

So far, my understanding with I$/D$ is that we only care in a few cases
which is executing of an mmap'ed piece of executable that is -not- being
written to, and swap.

I -think- that in both cases, the page cache always pops up a new page
with PG_arch_1 clear before the driver gets to either DMA or PIO to it
when faulted the first time around, before any PTE is inserted.

So the current approach on powerpc with I$/D$ should work fine, and it
-might- make sense to use a similar one on PIPT ARM, provided we don't
have expectations of the I$/D$ coherency being maintained on
-subsequent- writes (PIO or DMA either) to such a page by the same
program transparently by the kernel.

There's two potential problems with the approach, and maybe more that I
have missed though. One is the case of a networked filesystem where the
executable pages are modified remotely. However, I would expect such a
program to invalidate the PTE mappings before making the change visible,
so we -do- get a chance to re-flush provided something clears PG_arch_1.

Then, there's In the case of a multithread app, where one thread does
the cache flush and another thread then executes, the earlier ARMs
without broadcast ops have a potential problem there. In fact, some
variant of PowerPC 440 have the same problem and some people are
(ab)using those for SMP setups I'm being told.

For that case, I see two options. One is a big hammer but would make
existing code work to "most" extent: Don't allow a page to be both
writable and executable. Ping-pong the page permission lazily and flush
when transitioning from write to exec.

That means using a spare bit for Linux _PAGE_RW separate from your real
RW bit I suppose, since you have HW loaded PTEs (on 440 it's easier
since we SW load, we can do the fixup there, though it has a perf impact
obviously).

Another option would be to make some syscall mandatory to "sync" caches
which could then do IPIs or whatever else is needed. But that would
require changing existing userspace code.




^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-28  0:24                                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 21:49 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> > Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> > powerpc.. I'm afraid that whole cache handling stuff is totally
> > inconsistent since different archs have different expectations here.
> 
> It doesn't on ARM either.

Ok, pfiew :-)

So far, my understanding with I$/D$ is that we only care in a few cases
which is executing of an mmap'ed piece of executable that is -not- being
written to, and swap.

I -think- that in both cases, the page cache always pops up a new page
with PG_arch_1 clear before the driver gets to either DMA or PIO to it
when faulted the first time around, before any PTE is inserted.

So the current approach on powerpc with I$/D$ should work fine, and it
-might- make sense to use a similar one on PIPT ARM, provided we don't
have expectations of the I$/D$ coherency being maintained on
-subsequent- writes (PIO or DMA either) to such a page by the same
program transparently by the kernel.

There's two potential problems with the approach, and maybe more that I
have missed though. One is the case of a networked filesystem where the
executable pages are modified remotely. However, I would expect such a
program to invalidate the PTE mappings before making the change visible,
so we -do- get a chance to re-flush provided something clears PG_arch_1.

Then, there's In the case of a multithread app, where one thread does
the cache flush and another thread then executes, the earlier ARMs
without broadcast ops have a potential problem there. In fact, some
variant of PowerPC 440 have the same problem and some people are
(ab)using those for SMP setups I'm being told.

For that case, I see two options. One is a big hammer but would make
existing code work to "most" extent: Don't allow a page to be both
writable and executable. Ping-pong the page permission lazily and flush
when transitioning from write to exec.

That means using a spare bit for Linux _PAGE_RW separate from your real
RW bit I suppose, since you have HW loaded PTEs (on 440 it's easier
since we SW load, we can do the fixup there, though it has a perf impact
obviously).

Another option would be to make some syscall mandatory to "sync" caches
which could then do IPIs or whatever else is needed. But that would
require changing existing userspace code.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 22:03                                           ` Russell King - ARM Linux
@ 2010-02-28  0:29                                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:29 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Catalin Marinas, Matthew Dharm, linux-usb, Mankad, Maulik Ojas,
	Sergei Shtylyov, Ming Lei, Sebastian Siewior, Oliver Neukum,
	linux-kernel, Shilimkar, Santosh, Pavel Machek, Greg KH,
	linux-arm-kernel

On Fri, 2010-02-26 at 22:03 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> > It will deadlock if you use normal IRQs. I don't see a good way around
> > that other than using a higher-level type of IRQs. I though ARM has
> > something like that (FIQs ?). Can you use those guys for IPIs ?
> 
> If the hardware did support using FIQs for IPIs, this would not be
> desirable because then it takes it away from the SoC folk to do what
> they will with it.
> 
> In the past, it's been used as a fast CPU-driven "DMA" interface -
> some SoCs have been wired up in such a way that's the only use
> available for the FIQ.

This is an issue indeed.

> The other problem we'd encounter using FIQs for IPIs is that some IPIs
> need to take locks - and in order to make that safe, we'd either need
> another class of locks which disable IRQs and FIQs together, or we'd
> need to disable FIQs everywhere we disable IRQs - at which point FIQs
> become utterly pointless.

That's solvable easily :-) I mentioned having potentially to deal with a
similar problem with people using PowerPC 440 for SMP (doesn't broadcast
cache ops either). 440 has critical interrupts, which are akin to FIQs.

The trick here is that you don't use -only- critical interrupts for
IPIs. You use normal interrupts for all the current IPI types. You -add-
a fast one using critical interrupts specifically for cache ops, with a
very fast asm only path.

This works for us because masking interrupts doesn't mask critical
interrupts (it's a separate mask bit in our MSR). If that isn't the case
with FIQs then the whole idea is moot.

> (There only differences between FIQ and IRQ are:
>  - on simultaneous raising of both, the FIQ will be called before the IRQ.
>  - each has its own (single) vector.
>  - invocation of FIQ masks IRQ.
> 
> What I'm saying is that what gives FIQ an advantage for SoC people is
> that it's bare bones light weight and therefore extremely fast - as soon
> as you load it up with additional complexity, it becomes less useful.)

I understand.

Then Catalin idea of tricking the cache with load and stores would work
for the D$ side of thing. The I$ side of thing probably still needs IPIs
though, and you might need to use non-blocking async SMP call function
for that if you're going to do it from set_pte_at() instead of
update_mmu_cache() since the later is racy. In any case, it's a lot less
of a deadlock nest than the D$ side which needs to be dealt with in the
DMA ops, called below layers of driver and subsystem locks.

Note: Somebody at ARM needs to be severely beaten up for coming up with
that SMP scheme without broadcast cache ops and not also mandating some
kind FIQ IPI scheme that isn't masked with normal interrupts :-)

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-28  0:29                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 22:03 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> > It will deadlock if you use normal IRQs. I don't see a good way around
> > that other than using a higher-level type of IRQs. I though ARM has
> > something like that (FIQs ?). Can you use those guys for IPIs ?
> 
> If the hardware did support using FIQs for IPIs, this would not be
> desirable because then it takes it away from the SoC folk to do what
> they will with it.
> 
> In the past, it's been used as a fast CPU-driven "DMA" interface -
> some SoCs have been wired up in such a way that's the only use
> available for the FIQ.

This is an issue indeed.

> The other problem we'd encounter using FIQs for IPIs is that some IPIs
> need to take locks - and in order to make that safe, we'd either need
> another class of locks which disable IRQs and FIQs together, or we'd
> need to disable FIQs everywhere we disable IRQs - at which point FIQs
> become utterly pointless.

That's solvable easily :-) I mentioned having potentially to deal with a
similar problem with people using PowerPC 440 for SMP (doesn't broadcast
cache ops either). 440 has critical interrupts, which are akin to FIQs.

The trick here is that you don't use -only- critical interrupts for
IPIs. You use normal interrupts for all the current IPI types. You -add-
a fast one using critical interrupts specifically for cache ops, with a
very fast asm only path.

This works for us because masking interrupts doesn't mask critical
interrupts (it's a separate mask bit in our MSR). If that isn't the case
with FIQs then the whole idea is moot.

> (There only differences between FIQ and IRQ are:
>  - on simultaneous raising of both, the FIQ will be called before the IRQ.
>  - each has its own (single) vector.
>  - invocation of FIQ masks IRQ.
> 
> What I'm saying is that what gives FIQ an advantage for SoC people is
> that it's bare bones light weight and therefore extremely fast - as soon
> as you load it up with additional complexity, it becomes less useful.)

I understand.

Then Catalin idea of tricking the cache with load and stores would work
for the D$ side of thing. The I$ side of thing probably still needs IPIs
though, and you might need to use non-blocking async SMP call function
for that if you're going to do it from set_pte_at() instead of
update_mmu_cache() since the later is racy. In any case, it's a lot less
of a deadlock nest than the D$ side which needs to be dealt with in the
DMA ops, called below layers of driver and subsystem locks.

Note: Somebody at ARM needs to be severely beaten up for coming up with
that SMP scheme without broadcast cache ops and not also mandating some
kind FIQ IPI scheme that isn't masked with normal interrupts :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-28  0:14                                             ` Benjamin Herrenschmidt
@ 2010-02-28  5:01                                               ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-02-28  5:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Catalin Marinas, Matthew Dharm,
	linux-usb, Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Pavel Machek, Greg KH, linux-arm-kernel

On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > the HCD driver won't write to it once it has been mapped into user
> > > space? If that's the case, it may solve the problem by just reversing
> > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > has dirty D-cache by default.
> > 
> > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > unnecessary D cache flushing when clean pages get mapped into userspace.
> 
> That's an interesting thought for us too. When doing I$/D$ coherency, we
> have to fist flush the D$ and then invalidate the I$. If we could keep
> track of D$ and I$ separately, we could avoid the first step in many
> cases, including the DMA API trick you mentioned.
> 
> I wonder if it's time to get a PG_arch_2 :-)

Sorry to be a bit late to the party (on holiday), but I/D coherency is
supposed to be taken care of using flush_cache_page in the memory
mapping routines.  On parisc, at least, we don't use any PG_arch flags
to help.  The way it's supposed to work is that I is invalidated on
mapping or remapping, so the I/O code only needs to worry about flushing
D.  The guarantee we pass to userland is that any page we do I/O to has
a clean D cache before it goes back to userspace.  Thus if userspace
executes the page, the I cache gets its first movein there.  There is an
underlying assumption to all of this:  The CPU won't speculatively move
in I cache until the page is executed, so we can rely on the
flush_cache_page in the mapping to keep the I cache invalidated until
we're ready to execute.  The other fundamental assumption is that if
userspace needs to modify an executable region (say for dynamic linking)
it has to take care of reinvalidating the I cache itself ... although it
can do this by remapping the region to alter the flags (i.e W no X then
X no W).

But the point of all of this is that I cache invalidation doesn't appear
anywhere in the I/O path ... so  if we're getting I/D incoherency,
there's some problem in the mm code (or there's a missing arch
assumption ... like I cache gets moved in more aggressively than we
expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
there were a serious generic problem here.

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-28  5:01                                               ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-02-28  5:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > the HCD driver won't write to it once it has been mapped into user
> > > space? If that's the case, it may solve the problem by just reversing
> > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > has dirty D-cache by default.
> > 
> > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > unnecessary D cache flushing when clean pages get mapped into userspace.
> 
> That's an interesting thought for us too. When doing I$/D$ coherency, we
> have to fist flush the D$ and then invalidate the I$. If we could keep
> track of D$ and I$ separately, we could avoid the first step in many
> cases, including the DMA API trick you mentioned.
> 
> I wonder if it's time to get a PG_arch_2 :-)

Sorry to be a bit late to the party (on holiday), but I/D coherency is
supposed to be taken care of using flush_cache_page in the memory
mapping routines.  On parisc, at least, we don't use any PG_arch flags
to help.  The way it's supposed to work is that I is invalidated on
mapping or remapping, so the I/O code only needs to worry about flushing
D.  The guarantee we pass to userland is that any page we do I/O to has
a clean D cache before it goes back to userspace.  Thus if userspace
executes the page, the I cache gets its first movein there.  There is an
underlying assumption to all of this:  The CPU won't speculatively move
in I cache until the page is executed, so we can rely on the
flush_cache_page in the mapping to keep the I cache invalidated until
we're ready to execute.  The other fundamental assumption is that if
userspace needs to modify an executable region (say for dynamic linking)
it has to take care of reinvalidating the I cache itself ... although it
can do this by remapping the region to alter the flags (i.e W no X then
X no W).

But the point of all of this is that I cache invalidation doesn't appear
anywhere in the I/O path ... so  if we're getting I/D incoherency,
there's some problem in the mm code (or there's a missing arch
assumption ... like I cache gets moved in more aggressively than we
expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
there were a serious generic problem here.

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-28  0:24                                               ` Benjamin Herrenschmidt
@ 2010-02-28 19:17                                                 ` Pavel Machek
  -1 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-02-28 19:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Catalin Marinas, Matthew Dharm,
	linux-usb, Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Greg KH, linux-arm-kernel, James Bottomley

> There's two potential problems with the approach, and maybe more that I
> have missed though. One is the case of a networked filesystem where the
> executable pages are modified remotely. However, I would expect such a
> program to invalidate the PTE mappings before making the change visible,
> so we -do- get a chance to re-flush provided something clears PG_arch_1.
> 
> Then, there's In the case of a multithread app, where one thread does
> the cache flush and another thread then executes, the earlier ARMs
> without broadcast ops have a potential problem there. In fact, some
> variant of PowerPC 440 have the same problem and some people are
> (ab)using those for SMP setups I'm being told.
> 
> For that case, I see two options. One is a big hammer but would make
> existing code work to "most" extent: Don't allow a page to be both
> writable and executable. Ping-pong the page permission lazily and flush
> when transitioning from write to exec.
> 
> That means using a spare bit for Linux _PAGE_RW separate from your real
> RW bit I suppose, since you have HW loaded PTEs (on 440 it's easier
> since we SW load, we can do the fixup there, though it has a perf impact
> obviously).
> 
> Another option would be to make some syscall mandatory to "sync" caches
> which could then do IPIs or whatever else is needed. But that would
> require changing existing userspace code.

Or you could do first option by default, and add mmap flag that says
that application is responsible for cross-cpu cache flushes...?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-28 19:17                                                 ` Pavel Machek
  0 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-02-28 19:17 UTC (permalink / raw)
  To: linux-arm-kernel

> There's two potential problems with the approach, and maybe more that I
> have missed though. One is the case of a networked filesystem where the
> executable pages are modified remotely. However, I would expect such a
> program to invalidate the PTE mappings before making the change visible,
> so we -do- get a chance to re-flush provided something clears PG_arch_1.
> 
> Then, there's In the case of a multithread app, where one thread does
> the cache flush and another thread then executes, the earlier ARMs
> without broadcast ops have a potential problem there. In fact, some
> variant of PowerPC 440 have the same problem and some people are
> (ab)using those for SMP setups I'm being told.
> 
> For that case, I see two options. One is a big hammer but would make
> existing code work to "most" extent: Don't allow a page to be both
> writable and executable. Ping-pong the page permission lazily and flush
> when transitioning from write to exec.
> 
> That means using a spare bit for Linux _PAGE_RW separate from your real
> RW bit I suppose, since you have HW loaded PTEs (on 440 it's easier
> since we SW load, we can do the fixup there, though it has a perf impact
> obviously).
> 
> Another option would be to make some syscall mandatory to "sync" caches
> which could then do IPIs or whatever else is needed. But that would
> require changing existing userspace code.

Or you could do first option by default, and add mmap flag that says
that application is responsible for cross-cpu cache flushes...?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 21:49                                         ` Benjamin Herrenschmidt
@ 2010-02-28 23:17                                           ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-28 23:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Matthew Dharm, linux-usb, Russell King - ARM Linux, Mankad,
	Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar, Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel

On Fri, 2010-02-26 at 21:49 +0000, Benjamin Herrenschmidt wrote:
> > > > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > > > cache maintenance operations weren't visible to the other CPUs.
> > >
> > > I'm not even sure that's going to be 100% correct. Don't you also need
> > > to flush the remote icaches when you are dealing with instructions (such
> > > as swap) anyways ?
> >
> > I don't think we tried swap but for pages that have been mapped for the
> > first time, the I-cache would be clean.
> >
> > At mm switching, if a thread
> > migrates to a new CPU we invalidate the cache at that point.
> 
> That sounds fragile. What about a multithread app with one thread on
> each core hitting the pages at the same time ? Sounds racy to me...

Interestingly, until commit 826cbdaff29 (< 2 years ago), we didn't have
any I-cache flushing in update_mmu_cache() and it was working fine. I
added it for correctness reasons rather than to fix something. My theory
is that it was working because a page cache page tends to keep the same
physical address, especially if we don't swap pages, and a 16KB PIPT
cache cannot hold enough lines to show any issues (lines are replaced
frequently).

I suspect that's one of the reasons why only invalidating the whole
I-cache when switching the mm to a new CPU seems to suffice. Once we
enable some form of swapping, it may show the problem.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-28 23:17                                           ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-28 23:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 21:49 +0000, Benjamin Herrenschmidt wrote:
> > > > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > > > cache maintenance operations weren't visible to the other CPUs.
> > >
> > > I'm not even sure that's going to be 100% correct. Don't you also need
> > > to flush the remote icaches when you are dealing with instructions (such
> > > as swap) anyways ?
> >
> > I don't think we tried swap but for pages that have been mapped for the
> > first time, the I-cache would be clean.
> >
> > At mm switching, if a thread
> > migrates to a new CPU we invalidate the cache at that point.
> 
> That sounds fragile. What about a multithread app with one thread on
> each core hitting the pages at the same time ? Sounds racy to me...

Interestingly, until commit 826cbdaff29 (< 2 years ago), we didn't have
any I-cache flushing in update_mmu_cache() and it was working fine. I
added it for correctness reasons rather than to fix something. My theory
is that it was working because a page cache page tends to keep the same
physical address, especially if we don't swap pages, and a 16KB PIPT
cache cannot hold enough lines to show any issues (lines are replaced
frequently).

I suspect that's one of the reasons why only invalidating the whole
I-cache when switching the mm to a new CPU seems to suffice. Once we
enable some form of swapping, it may show the problem.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-26 22:03                                           ` Russell King - ARM Linux
@ 2010-02-28 23:20                                             ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-28 23:20 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Benjamin Herrenschmidt, Matthew Dharm, linux-usb, Mankad,
	Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar, Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel

On Fri, 2010-02-26 at 22:03 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> > It will deadlock if you use normal IRQs. I don't see a good way around
> > that other than using a higher-level type of IRQs. I though ARM has
> > something like that (FIQs ?). Can you use those guys for IPIs ?
[...]
> The other problem we'd encounter using FIQs for IPIs is that some IPIs
> need to take locks - and in order to make that safe, we'd either need
> another class of locks which disable IRQs and FIQs together, or we'd
> need to disable FIQs everywhere we disable IRQs - at which point FIQs
> become utterly pointless.

You could use the FIQ only for the DMA cache maintenance operations and
not as a generic IPI mechanism. But the hardware needs to be modified.


-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-02-28 23:20                                             ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-02-28 23:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 22:03 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> > It will deadlock if you use normal IRQs. I don't see a good way around
> > that other than using a higher-level type of IRQs. I though ARM has
> > something like that (FIQs ?). Can you use those guys for IPIs ?
[...]
> The other problem we'd encounter using FIQs for IPIs is that some IPIs
> need to take locks - and in order to make that safe, we'd either need
> another class of locks which disable IRQs and FIQs together, or we'd
> need to disable FIQs everywhere we disable IRQs - at which point FIQs
> become utterly pointless.

You could use the FIQ only for the DMA cache maintenance operations and
not as a generic IPI mechanism. But the hardware needs to be modified.


-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-28  5:01                                               ` James Bottomley
@ 2010-03-01 10:39                                                 ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-01 10:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Benjamin Herrenschmidt, Russell King - ARM Linux, Matthew Dharm,
	linux-usb, Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei,
	Sebastian Siewior, Oliver Neukum, linux-kernel,
	Shilimkar,Santosh, Pavel Machek, Greg KH, linux-arm-kernel

On Sun, 2010-02-28 at 05:01 +0000, James Bottomley wrote:
> On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > > the HCD driver won't write to it once it has been mapped into user
> > > > space? If that's the case, it may solve the problem by just reversing
> > > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > > has dirty D-cache by default.
> > >
> > > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > > unnecessary D cache flushing when clean pages get mapped into userspace.
> >
> > That's an interesting thought for us too. When doing I$/D$ coherency, we
> > have to fist flush the D$ and then invalidate the I$. If we could keep
> > track of D$ and I$ separately, we could avoid the first step in many
> > cases, including the DMA API trick you mentioned.
> >
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> Sorry to be a bit late to the party (on holiday), but I/D coherency is
> supposed to be taken care of using flush_cache_page in the memory
> mapping routines.  On parisc, at least, we don't use any PG_arch flags
> to help.  The way it's supposed to work is that I is invalidated on
> mapping or remapping, so the I/O code only needs to worry about flushing
> D.  The guarantee we pass to userland is that any page we do I/O to has
> a clean D cache before it goes back to userspace.  Thus if userspace
> executes the page, the I cache gets its first movein there.  There is an
> underlying assumption to all of this:  The CPU won't speculatively move
> in I cache until the page is executed, so we can rely on the
> flush_cache_page in the mapping to keep the I cache invalidated until
> we're ready to execute.  

We cannot guarantee this assumption on ARM. As soon as the page is
accessible and executable, the CPU can fetch into the I-cache
speculatively. Even if the page hasn't been mapped into user-space yet,
we still have the kernel linear mapping via which we can get the same
I-cache lines fetched (PIPT cache).

The only place we can safely invalidate the I-cache is after the D-cache
was flushed (after flush_dcache_page).

On ARM PIPT, flush_cache_page is a no-op.

> The other fundamental assumption is that if
> userspace needs to modify an executable region (say for dynamic linking)
> it has to take care of reinvalidating the I cache itself ... although it
> can do this by remapping the region to alter the flags (i.e W no X then
> X no W).

The ARM dynamic linker remaps the page with no-exec, writes the data and
then remaps it back with exec. The COW code flushes the D-cache. Anyway,
recent dynamic linker no longer touches a code page.
> 
> But the point of all of this is that I cache invalidation doesn't appear
> anywhere in the I/O path ... so  if we're getting I/D incoherency,
> there's some problem in the mm code (or there's a missing arch
> assumption ... like I cache gets moved in more aggressively than we
> expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> there were a serious generic problem here.

On ARM PIPT, it's probably because flush_cache_page isn't implemented.
But as I said above, given the speculative fetches I don't think it
would help much (well, it would work a bit better but not a complete
fix).

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-01 10:39                                                 ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-01 10:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 05:01 +0000, James Bottomley wrote:
> On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > > the HCD driver won't write to it once it has been mapped into user
> > > > space? If that's the case, it may solve the problem by just reversing
> > > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > > has dirty D-cache by default.
> > >
> > > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > > unnecessary D cache flushing when clean pages get mapped into userspace.
> >
> > That's an interesting thought for us too. When doing I$/D$ coherency, we
> > have to fist flush the D$ and then invalidate the I$. If we could keep
> > track of D$ and I$ separately, we could avoid the first step in many
> > cases, including the DMA API trick you mentioned.
> >
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> Sorry to be a bit late to the party (on holiday), but I/D coherency is
> supposed to be taken care of using flush_cache_page in the memory
> mapping routines.  On parisc, at least, we don't use any PG_arch flags
> to help.  The way it's supposed to work is that I is invalidated on
> mapping or remapping, so the I/O code only needs to worry about flushing
> D.  The guarantee we pass to userland is that any page we do I/O to has
> a clean D cache before it goes back to userspace.  Thus if userspace
> executes the page, the I cache gets its first movein there.  There is an
> underlying assumption to all of this:  The CPU won't speculatively move
> in I cache until the page is executed, so we can rely on the
> flush_cache_page in the mapping to keep the I cache invalidated until
> we're ready to execute.  

We cannot guarantee this assumption on ARM. As soon as the page is
accessible and executable, the CPU can fetch into the I-cache
speculatively. Even if the page hasn't been mapped into user-space yet,
we still have the kernel linear mapping via which we can get the same
I-cache lines fetched (PIPT cache).

The only place we can safely invalidate the I-cache is after the D-cache
was flushed (after flush_dcache_page).

On ARM PIPT, flush_cache_page is a no-op.

> The other fundamental assumption is that if
> userspace needs to modify an executable region (say for dynamic linking)
> it has to take care of reinvalidating the I cache itself ... although it
> can do this by remapping the region to alter the flags (i.e W no X then
> X no W).

The ARM dynamic linker remaps the page with no-exec, writes the data and
then remaps it back with exec. The COW code flushes the D-cache. Anyway,
recent dynamic linker no longer touches a code page.
> 
> But the point of all of this is that I cache invalidation doesn't appear
> anywhere in the I/O path ... so  if we're getting I/D incoherency,
> there's some problem in the mm code (or there's a missing arch
> assumption ... like I cache gets moved in more aggressively than we
> expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> there were a serious generic problem here.

On ARM PIPT, it's probably because flush_cache_page isn't implemented.
But as I said above, given the speculative fetches I don't think it
would help much (well, it would work a bit better but not a complete
fix).

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-28  0:14                                             ` Benjamin Herrenschmidt
@ 2010-03-01 10:42                                               ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-01 10:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Matthew Dharm, linux-usb,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Sun, 2010-02-28 at 00:14 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > the HCD driver won't write to it once it has been mapped into user
> > > space? If that's the case, it may solve the problem by just reversing
> > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > has dirty D-cache by default.
> >
> > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > unnecessary D cache flushing when clean pages get mapped into userspace.

That sounds good to me.

> That's an interesting thought for us too. When doing I$/D$ coherency, we
> have to fist flush the D$ and then invalidate the I$. If we could keep
> track of D$ and I$ separately, we could avoid the first step in many
> cases, including the DMA API trick you mentioned.
> 
> I wonder if it's time to get a PG_arch_2 :-)

As an optimisation, I think this would help (rather than always
invalidating the I-cache in update_mmu_cache or set_pte_at).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-01 10:42                                               ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-01 10:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 00:14 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > the HCD driver won't write to it once it has been mapped into user
> > > space? If that's the case, it may solve the problem by just reversing
> > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > has dirty D-cache by default.
> >
> > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > unnecessary D cache flushing when clean pages get mapped into userspace.

That sounds good to me.

> That's an interesting thought for us too. When doing I$/D$ coherency, we
> have to fist flush the D$ and then invalidate the I$. If we could keep
> track of D$ and I$ separately, we could avoid the first step in many
> cases, including the DMA API trick you mentioned.
> 
> I wonder if it's time to get a PG_arch_2 :-)

As an optimisation, I think this would help (rather than always
invalidating the I-cache in update_mmu_cache or set_pte_at).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-01 10:39                                                 ` Catalin Marinas
@ 2010-03-01 11:06                                                   ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-01 11:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: James Bottomley, Matthew Dharm, Oliver Neukum, Greg KH, Mankad,
	Maulik Ojas, Sergei Shtylyov, Benjamin Herrenschmidt,
	Sebastian Siewior, linux-usb, linux-kernel, Shilimkar, Santosh,
	Pavel Machek, Ming Lei, linux-arm-kernel

On Mon, Mar 01, 2010 at 10:39:14AM +0000, Catalin Marinas wrote:
> On Sun, 2010-02-28 at 05:01 +0000, James Bottomley wrote:
> > But the point of all of this is that I cache invalidation doesn't appear
> > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > there's some problem in the mm code (or there's a missing arch
> > assumption ... like I cache gets moved in more aggressively than we
> > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > there were a serious generic problem here.
> 
> On ARM PIPT, it's probably because flush_cache_page isn't implemented.
> But as I said above, given the speculative fetches I don't think it
> would help much (well, it would work a bit better but not a complete
> fix).

Not quite.  flush_cache_page() is called when we unmap or replace a page
in userspace, which is completely the wrong place to do I-cache coherency
when you have speculatively loaded caches - or even D-cache coherency if
your cache behaves as a speculatively loaded PIPT or non-aliasing VIPT.

Flushing the I-cache after a page has been in userspace does nothing to
ensure that there aren't any I-cache lines associated with that page
when you next come to map it into userspace.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-01 11:06                                                   ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-01 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 01, 2010 at 10:39:14AM +0000, Catalin Marinas wrote:
> On Sun, 2010-02-28 at 05:01 +0000, James Bottomley wrote:
> > But the point of all of this is that I cache invalidation doesn't appear
> > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > there's some problem in the mm code (or there's a missing arch
> > assumption ... like I cache gets moved in more aggressively than we
> > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > there were a serious generic problem here.
> 
> On ARM PIPT, it's probably because flush_cache_page isn't implemented.
> But as I said above, given the speculative fetches I don't think it
> would help much (well, it would work a bit better but not a complete
> fix).

Not quite.  flush_cache_page() is called when we unmap or replace a page
in userspace, which is completely the wrong place to do I-cache coherency
when you have speculatively loaded caches - or even D-cache coherency if
your cache behaves as a speculatively loaded PIPT or non-aliasing VIPT.

Flushing the I-cache after a page has been in userspace does nothing to
ensure that there aren't any I-cache lines associated with that page
when you next come to map it into userspace.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-28  0:24                                               ` Benjamin Herrenschmidt
@ 2010-03-01 11:10                                                 ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-01 11:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Matthew Dharm, linux-usb,
	Mankad,Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	Oliver Neukum, linux-kernel, Shilimkar,Santosh, Pavel Machek,
	Greg KH, linux-arm-kernel, James Bottomley

On Sun, 2010-02-28 at 00:24 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:49 +0000, Russell King - ARM Linux wrote:
> > On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> > > Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> > > powerpc.. I'm afraid that whole cache handling stuff is totally
> > > inconsistent since different archs have different expectations here.
> >
> > It doesn't on ARM either.
> 
> Ok, pfiew :-)
> 
> So far, my understanding with I$/D$ is that we only care in a few cases
> which is executing of an mmap'ed piece of executable that is -not- being
> written to, and swap.
> 
> I -think- that in both cases, the page cache always pops up a new page
> with PG_arch_1 clear before the driver gets to either DMA or PIO to it
> when faulted the first time around, before any PTE is inserted.

That's my understanding too.

> So the current approach on powerpc with I$/D$ should work fine, and it
> -might- make sense to use a similar one on PIPT ARM, provided we don't
> have expectations of the I$/D$ coherency being maintained on
> -subsequent- writes (PIO or DMA either) to such a page by the same
> program transparently by the kernel.

Are these subsequent writes likely to happen?

> There's two potential problems with the approach, and maybe more that I
> have missed though. One is the case of a networked filesystem where the
> executable pages are modified remotely. However, I would expect such a
> program to invalidate the PTE mappings before making the change visible,
> so we -do- get a chance to re-flush provided something clears PG_arch_1.

I think the NFS code in Linux calls flush_dcache_page(). This function
can check whether the page is already mapped and do the cache flushing
rather than deferring it to set_pte_at().

> Then, there's In the case of a multithread app, where one thread does
> the cache flush and another thread then executes, the earlier ARMs
> without broadcast ops have a potential problem there. In fact, some
> variant of PowerPC 440 have the same problem and some people are
> (ab)using those for SMP setups I'm being told.

Yes. That could be solved at set_pte_at() level using IPIs.

> For that case, I see two options. One is a big hammer but would make
> existing code work to "most" extent: Don't allow a page to be both
> writable and executable. Ping-pong the page permission lazily and flush
> when transitioning from write to exec.

Are you referring to the SMP and non-broadcasting cache maintenance
issue? The same pte could be shared between multiple CPUs, so once you
make it executable on one it becomes executable on the others.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-01 11:10                                                 ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-01 11:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 00:24 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:49 +0000, Russell King - ARM Linux wrote:
> > On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> > > Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> > > powerpc.. I'm afraid that whole cache handling stuff is totally
> > > inconsistent since different archs have different expectations here.
> >
> > It doesn't on ARM either.
> 
> Ok, pfiew :-)
> 
> So far, my understanding with I$/D$ is that we only care in a few cases
> which is executing of an mmap'ed piece of executable that is -not- being
> written to, and swap.
> 
> I -think- that in both cases, the page cache always pops up a new page
> with PG_arch_1 clear before the driver gets to either DMA or PIO to it
> when faulted the first time around, before any PTE is inserted.

That's my understanding too.

> So the current approach on powerpc with I$/D$ should work fine, and it
> -might- make sense to use a similar one on PIPT ARM, provided we don't
> have expectations of the I$/D$ coherency being maintained on
> -subsequent- writes (PIO or DMA either) to such a page by the same
> program transparently by the kernel.

Are these subsequent writes likely to happen?

> There's two potential problems with the approach, and maybe more that I
> have missed though. One is the case of a networked filesystem where the
> executable pages are modified remotely. However, I would expect such a
> program to invalidate the PTE mappings before making the change visible,
> so we -do- get a chance to re-flush provided something clears PG_arch_1.

I think the NFS code in Linux calls flush_dcache_page(). This function
can check whether the page is already mapped and do the cache flushing
rather than deferring it to set_pte_at().

> Then, there's In the case of a multithread app, where one thread does
> the cache flush and another thread then executes, the earlier ARMs
> without broadcast ops have a potential problem there. In fact, some
> variant of PowerPC 440 have the same problem and some people are
> (ab)using those for SMP setups I'm being told.

Yes. That could be solved at set_pte_at() level using IPIs.

> For that case, I see two options. One is a big hammer but would make
> existing code work to "most" extent: Don't allow a page to be both
> writable and executable. Ping-pong the page permission lazily and flush
> when transitioning from write to exec.

Are you referring to the SMP and non-broadcasting cache maintenance
issue? The same pte could be shared between multiple CPUs, so once you
make it executable on one it becomes executable on the others.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-01 11:10                                                 ` Catalin Marinas
@ 2010-03-02  4:11                                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02  4:11 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Matthew Dharm, Oliver Neukum, Russell King - ARM Linux, Mankad,
	Maulik Ojas, Sergei Shtylyov, Ming Lei, Sebastian Siewior,
	linux-usb, linux-kernel, James Bottomley, Shilimkar, Santosh,
	Pavel Machek, Greg KH, linux-arm-kernel

On Mon, 2010-03-01 at 11:10 +0000, Catalin Marinas wrote:
> 
> 
> Yes. That could be solved at set_pte_at() level using IPIs.

Well, set_pte_at() itself is called with the PTE lock held, so you have
to be careful with IPIs at that point. You need the flush to happen
-before- the PTE is visible and you cannot synchronously send an IPI.

> > For that case, I see two options. One is a big hammer but would make
> > existing code work to "most" extent: Don't allow a page to be both
> > writable and executable. Ping-pong the page permission lazily and
> flush
> > when transitioning from write to exec.
> 
> Are you referring to the SMP and non-broadcasting cache maintenance
> issue? The same pte could be shared between multiple CPUs, so once you
> make it executable on one it becomes executable on the others.

Right, you would have to play the ping-pong trick globally. That's what
I do on ppc 440 for bluegene though that code isn't upstream.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-02  4:11                                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02  4:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2010-03-01 at 11:10 +0000, Catalin Marinas wrote:
> 
> 
> Yes. That could be solved at set_pte_at() level using IPIs.

Well, set_pte_at() itself is called with the PTE lock held, so you have
to be careful with IPIs at that point. You need the flush to happen
-before- the PTE is visible and you cannot synchronously send an IPI.

> > For that case, I see two options. One is a big hammer but would make
> > existing code work to "most" extent: Don't allow a page to be both
> > writable and executable. Ping-pong the page permission lazily and
> flush
> > when transitioning from write to exec.
> 
> Are you referring to the SMP and non-broadcasting cache maintenance
> issue? The same pte could be shared between multiple CPUs, so once you
> make it executable on one it becomes executable on the others.

Right, you would have to play the ping-pong trick globally. That's what
I do on ppc 440 for bluegene though that code isn't upstream.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-28  5:01                                               ` James Bottomley
@ 2010-03-02 12:11                                                 ` FUJITA Tomonori
  -1 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-02 12:11 UTC (permalink / raw)
  To: James.Bottomley
  Cc: benh, linux, catalin.marinas, mdharm-kernel, linux-usb, x0082077,
	sshtylyov, tom.leiming, bigeasy, oliver, linux-kernel,
	santosh.shilimkar, pavel, greg, linux-arm-kernel

On Sun, 28 Feb 2010 10:31:03 +0530
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > > the HCD driver won't write to it once it has been mapped into user
> > > > space? If that's the case, it may solve the problem by just reversing
> > > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > > has dirty D-cache by default.
> > > 
> > > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > > unnecessary D cache flushing when clean pages get mapped into userspace.
> > 
> > That's an interesting thought for us too. When doing I$/D$ coherency, we
> > have to fist flush the D$ and then invalidate the I$. If we could keep
> > track of D$ and I$ separately, we could avoid the first step in many
> > cases, including the DMA API trick you mentioned.
> > 
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> Sorry to be a bit late to the party (on holiday), but I/D coherency is
> supposed to be taken care of using flush_cache_page in the memory
> mapping routines.

powerpc does that? To be exact, powerpc doesn't need
flush_cache_page() and handles I/D coherency in the pte modification
code. powerpc uses PG_arch_1 to avoid unnecessarily handling I/D
coherency. Seems that IA64 does the same trick with PG_arch_1.


> But the point of all of this is that I cache invalidation doesn't appear
> anywhere in the I/O path ... so  if we're getting I/D incoherency,
> there's some problem in the mm code (or there's a missing arch
> assumption ... like I cache gets moved in more aggressively than we
> expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> there were a serious generic problem here.

I'm not sure that there are some problems in the mm or common code. Is
this ARM's implementation issue? (Of course, the usb stack and the
driver's misuse of the DMA API needs to be fixed too).

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-02 12:11                                                 ` FUJITA Tomonori
  0 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-02 12:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 28 Feb 2010 10:31:03 +0530
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > > the HCD driver won't write to it once it has been mapped into user
> > > > space? If that's the case, it may solve the problem by just reversing
> > > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > > has dirty D-cache by default.
> > > 
> > > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > > unnecessary D cache flushing when clean pages get mapped into userspace.
> > 
> > That's an interesting thought for us too. When doing I$/D$ coherency, we
> > have to fist flush the D$ and then invalidate the I$. If we could keep
> > track of D$ and I$ separately, we could avoid the first step in many
> > cases, including the DMA API trick you mentioned.
> > 
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> Sorry to be a bit late to the party (on holiday), but I/D coherency is
> supposed to be taken care of using flush_cache_page in the memory
> mapping routines.

powerpc does that? To be exact, powerpc doesn't need
flush_cache_page() and handles I/D coherency in the pte modification
code. powerpc uses PG_arch_1 to avoid unnecessarily handling I/D
coherency. Seems that IA64 does the same trick with PG_arch_1.


> But the point of all of this is that I cache invalidation doesn't appear
> anywhere in the I/O path ... so  if we're getting I/D incoherency,
> there's some problem in the mm code (or there's a missing arch
> assumption ... like I cache gets moved in more aggressively than we
> expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> there were a serious generic problem here.

I'm not sure that there are some problems in the mm or common code. Is
this ARM's implementation issue? (Of course, the usb stack and the
driver's misuse of the DMA API needs to be fixed too).

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 12:11                                                 ` FUJITA Tomonori
@ 2010-03-02 17:05                                                   ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-02 17:05 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: James.Bottomley, benh, linux, mdharm-kernel, linux-usb, x0082077,
	sshtylyov, tom.leiming, bigeasy, oliver, linux-kernel,
	santosh.shilimkar, pavel, greg, linux-arm-kernel

On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> On Sun, 28 Feb 2010 10:31:03 +0530
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > But the point of all of this is that I cache invalidation doesn't appear
> > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > there's some problem in the mm code (or there's a missing arch
> > assumption ... like I cache gets moved in more aggressively than we
> > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > there were a serious generic problem here.
> 
> I'm not sure that there are some problems in the mm or common code. Is
> this ARM's implementation issue? (Of course, the usb stack and the
> driver's misuse of the DMA API needs to be fixed too).

Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
invalidation for user pages in update_mmu_cache() (it could actually be
in set_pte_at on SMP to avoid a race but that's for another thread). The
D-cache is flushed by this function only if the PG_arch_1 bit is set.
This bit is set in the ARM case by flush_dcache_page(), following the
advice in Documentation/cachetlb.txt.

With some drivers (those doing PIO) or subsystems (SCSI mass storage
over USB HCD), there is no call to flush_dcache_page() for page cache
pages, hence the ARM implementation of update_mmu_cache() doesn't flush
the D-cache (and only invalidating the I-cache doesn't help).

The viable solutions so far:

     1. Implement a PIO mapping API similar to the DMA API which takes
        care of the D-cache flushing. This means that PIO drivers would
        need to be modified to use an API like pio_kmap()/pio_kunmap()
        before writing to a page cache page.
     2. Invert the meaning of PG_arch_1 to denote a clean page. This
        means that by default newly allocated page cache pages are
        considered dirty and even if there isn't a call to
        flush_dcache_page(), update_mmu_cache() would flush the D-cache.
        This is the PowerPC approach.

Option 2 above looks pretty appealing to me since it can be done in the
ARM code exclusively. I've done some tests and it indeed solves the
cache coherency with a rootfs on a USB stick. As Russell suggested, it
can be optimised to mark a page as clean when the DMA API is involved to
avoid duplicate flushing.

It was also suggested to add a PG_arch_2 flag which would keep track of
the I-cache status as well.

I can post a proposal to modify the cachetlb.txt document to reflect the
issues we currently have on ARM.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-02 17:05                                                   ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-02 17:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> On Sun, 28 Feb 2010 10:31:03 +0530
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > But the point of all of this is that I cache invalidation doesn't appear
> > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > there's some problem in the mm code (or there's a missing arch
> > assumption ... like I cache gets moved in more aggressively than we
> > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > there were a serious generic problem here.
> 
> I'm not sure that there are some problems in the mm or common code. Is
> this ARM's implementation issue? (Of course, the usb stack and the
> driver's misuse of the DMA API needs to be fixed too).

Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
invalidation for user pages in update_mmu_cache() (it could actually be
in set_pte_at on SMP to avoid a race but that's for another thread). The
D-cache is flushed by this function only if the PG_arch_1 bit is set.
This bit is set in the ARM case by flush_dcache_page(), following the
advice in Documentation/cachetlb.txt.

With some drivers (those doing PIO) or subsystems (SCSI mass storage
over USB HCD), there is no call to flush_dcache_page() for page cache
pages, hence the ARM implementation of update_mmu_cache() doesn't flush
the D-cache (and only invalidating the I-cache doesn't help).

The viable solutions so far:

     1. Implement a PIO mapping API similar to the DMA API which takes
        care of the D-cache flushing. This means that PIO drivers would
        need to be modified to use an API like pio_kmap()/pio_kunmap()
        before writing to a page cache page.
     2. Invert the meaning of PG_arch_1 to denote a clean page. This
        means that by default newly allocated page cache pages are
        considered dirty and even if there isn't a call to
        flush_dcache_page(), update_mmu_cache() would flush the D-cache.
        This is the PowerPC approach.

Option 2 above looks pretty appealing to me since it can be done in the
ARM code exclusively. I've done some tests and it indeed solves the
cache coherency with a rootfs on a USB stick. As Russell suggested, it
can be optimised to mark a page as clean when the DMA API is involved to
avoid duplicate flushing.

It was also suggested to add a PG_arch_2 flag which would keep track of
the I-cache status as well.

I can post a proposal to modify the cachetlb.txt document to reflect the
issues we currently have on ARM.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 17:05                                                   ` Catalin Marinas
@ 2010-03-02 17:47                                                     ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-02 17:47 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: mdharm-kernel, oliver, linux, greg, x0082077, sshtylyov, benh,
	bigeasy, linux-usb, linux-kernel, James.Bottomley,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> > On Sun, 28 Feb 2010 10:31:03 +0530
> > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > But the point of all of this is that I cache invalidation doesn't appear
> > > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > > there's some problem in the mm code (or there's a missing arch
> > > assumption ... like I cache gets moved in more aggressively than we
> > > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > > there were a serious generic problem here.
> >
> > I'm not sure that there are some problems in the mm or common code. Is
> > this ARM's implementation issue? (Of course, the usb stack and the
> > driver's misuse of the DMA API needs to be fixed too).
> 
> Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
> invalidation for user pages in update_mmu_cache() (it could actually be
> in set_pte_at on SMP to avoid a race but that's for another thread). The
> D-cache is flushed by this function only if the PG_arch_1 bit is set.
> This bit is set in the ARM case by flush_dcache_page(), following the
> advice in Documentation/cachetlb.txt.
> 
> With some drivers (those doing PIO) or subsystems (SCSI mass storage
> over USB HCD), there is no call to flush_dcache_page() for page cache
> pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> the D-cache (and only invalidating the I-cache doesn't help).
> 
> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.
> 
> Option 2 above looks pretty appealing to me since it can be done in the
> ARM code exclusively. I've done some tests and it indeed solves the
> cache coherency with a rootfs on a USB stick. As Russell suggested, it
> can be optimised to mark a page as clean when the DMA API is involved to
> avoid duplicate flushing.

Actually, option 2 still has an issue - does not easily work on SMP
systems where cache maintenance operations aren't broadcast in hardware.
In this case (ARM11MPCore), flush_dcache_page() is implemented
non-lazily so that the flushing happens on the same processor that
dirtied the cache. But since with some drivers there is no call to this
function, it wouldn't make any difference.

A solution is to do something like read-for-ownership before flushing
the D-cache in update_mmu_cache() (or set_pte_at()).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-02 17:47                                                     ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-02 17:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> > On Sun, 28 Feb 2010 10:31:03 +0530
> > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > But the point of all of this is that I cache invalidation doesn't appear
> > > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > > there's some problem in the mm code (or there's a missing arch
> > > assumption ... like I cache gets moved in more aggressively than we
> > > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > > there were a serious generic problem here.
> >
> > I'm not sure that there are some problems in the mm or common code. Is
> > this ARM's implementation issue? (Of course, the usb stack and the
> > driver's misuse of the DMA API needs to be fixed too).
> 
> Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
> invalidation for user pages in update_mmu_cache() (it could actually be
> in set_pte_at on SMP to avoid a race but that's for another thread). The
> D-cache is flushed by this function only if the PG_arch_1 bit is set.
> This bit is set in the ARM case by flush_dcache_page(), following the
> advice in Documentation/cachetlb.txt.
> 
> With some drivers (those doing PIO) or subsystems (SCSI mass storage
> over USB HCD), there is no call to flush_dcache_page() for page cache
> pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> the D-cache (and only invalidating the I-cache doesn't help).
> 
> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.
> 
> Option 2 above looks pretty appealing to me since it can be done in the
> ARM code exclusively. I've done some tests and it indeed solves the
> cache coherency with a rootfs on a USB stick. As Russell suggested, it
> can be optimised to mark a page as clean when the DMA API is involved to
> avoid duplicate flushing.

Actually, option 2 still has an issue - does not easily work on SMP
systems where cache maintenance operations aren't broadcast in hardware.
In this case (ARM11MPCore), flush_dcache_page() is implemented
non-lazily so that the flushing happens on the same processor that
dirtied the cache. But since with some drivers there is no call to this
function, it wouldn't make any difference.

A solution is to do something like read-for-ownership before flushing
the D-cache in update_mmu_cache() (or set_pte_at()).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 12:11                                                 ` FUJITA Tomonori
@ 2010-03-02 23:26                                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:26 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: James.Bottomley, linux, catalin.marinas, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, pavel, greg, linux-arm-kernel

On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> 
> > Sorry to be a bit late to the party (on holiday), but I/D coherency
> is
> > supposed to be taken care of using flush_cache_page in the memory
> > mapping routines.
> 
> powerpc does that? To be exact, powerpc doesn't need
> flush_cache_page() and handles I/D coherency in the pte modification
> code. powerpc uses PG_arch_1 to avoid unnecessarily handling I/D
> coherency. Seems that IA64 does the same trick with PG_arch_1.

Right. We set PG_arch_1 to avoid doing it again of a given physical
page. We assume that it's always cleared when a page is recycled by the
page cache and we also clear it in flush_dcache_page() though the need
for that later thing is dubious...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-02 23:26                                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> 
> > Sorry to be a bit late to the party (on holiday), but I/D coherency
> is
> > supposed to be taken care of using flush_cache_page in the memory
> > mapping routines.
> 
> powerpc does that? To be exact, powerpc doesn't need
> flush_cache_page() and handles I/D coherency in the pte modification
> code. powerpc uses PG_arch_1 to avoid unnecessarily handling I/D
> coherency. Seems that IA64 does the same trick with PG_arch_1.

Right. We set PG_arch_1 to avoid doing it again of a given physical
page. We assume that it's always cleared when a page is recycled by the
page cache and we also clear it in flush_dcache_page() though the need
for that later thing is dubious...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 17:05                                                   ` Catalin Marinas
@ 2010-03-02 23:29                                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: FUJITA Tomonori, mdharm-kernel, oliver, linux, greg, x0082077,
	sshtylyov, bigeasy, linux-usb, linux-kernel, James.Bottomley,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:

> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.

I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note
that flushing the D-cache isn't enough, you also need to invalidate the
I-cache as we discussed earlier, though you mostly get away if you don't
by luck.

There's also a question as to whether clearing PG_arch_1 is
flush_dcache_page() is really necessary or not.

> Option 2 above looks pretty appealing to me since it can be done in the
> ARM code exclusively. I've done some tests and it indeed solves the
> cache coherency with a rootfs on a USB stick. As Russell suggested, it
> can be optimised to mark a page as clean when the DMA API is involved to
> avoid duplicate flushing.

That wouldn't solve the need for invalidating the I-cache... Unless we
use another bit.

> It was also suggested to add a PG_arch_2 flag which would keep track of
> the I-cache status as well.
> 
> I can post a proposal to modify the cachetlb.txt document to reflect the
> issues we currently have on ARM.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-02 23:29                                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:

> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.

I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note
that flushing the D-cache isn't enough, you also need to invalidate the
I-cache as we discussed earlier, though you mostly get away if you don't
by luck.

There's also a question as to whether clearing PG_arch_1 is
flush_dcache_page() is really necessary or not.

> Option 2 above looks pretty appealing to me since it can be done in the
> ARM code exclusively. I've done some tests and it indeed solves the
> cache coherency with a rootfs on a USB stick. As Russell suggested, it
> can be optimised to mark a page as clean when the DMA API is involved to
> avoid duplicate flushing.

That wouldn't solve the need for invalidating the I-cache... Unless we
use another bit.

> It was also suggested to add a PG_arch_2 flag which would keep track of
> the I-cache status as well.
> 
> I can post a proposal to modify the cachetlb.txt document to reflect the
> issues we currently have on ARM.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 17:47                                                     ` Catalin Marinas
@ 2010-03-02 23:33                                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:33 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: FUJITA Tomonori, mdharm-kernel, linux-usb, linux, tom.leiming,
	x0082077, sshtylyov, greg, bigeasy, oliver, linux-kernel,
	James.Bottomley, santosh.shilimkar, pavel, linux-arm-kernel

On Tue, 2010-03-02 at 17:47 +0000, Catalin Marinas wrote:
> 
> Actually, option 2 still has an issue - does not easily work on SMP
> systems where cache maintenance operations aren't broadcast in hardware.
> In this case (ARM11MPCore), flush_dcache_page() is implemented
> non-lazily so that the flushing happens on the same processor that
> dirtied the cache. But since with some drivers there is no call to this
> function, it wouldn't make any difference.

Also, option 1 would not solve the icache issue which has the same
problem related to IPIs. You -really- need to spank some HW folks
here :-)

> A solution is to do something like read-for-ownership before flushing
> the D-cache in update_mmu_cache() (or set_pte_at()). 

You might also want to experiment with not clearing PG_arch_1 in
flush_dcache_page(). I'm not 100% convinced it is necessary and that may
reduce the amount of flushing needed.

Another thing is, on powerpc, we only do the cleaning when we try to
execute from the pages. IE. We basically "filter out" exec permission
when pages are not clean. At least on processors that support per-page
exec permission. You may want to consider something like that as well.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-02 23:33                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 17:47 +0000, Catalin Marinas wrote:
> 
> Actually, option 2 still has an issue - does not easily work on SMP
> systems where cache maintenance operations aren't broadcast in hardware.
> In this case (ARM11MPCore), flush_dcache_page() is implemented
> non-lazily so that the flushing happens on the same processor that
> dirtied the cache. But since with some drivers there is no call to this
> function, it wouldn't make any difference.

Also, option 1 would not solve the icache issue which has the same
problem related to IPIs. You -really- need to spank some HW folks
here :-)

> A solution is to do something like read-for-ownership before flushing
> the D-cache in update_mmu_cache() (or set_pte_at()). 

You might also want to experiment with not clearing PG_arch_1 in
flush_dcache_page(). I'm not 100% convinced it is necessary and that may
reduce the amount of flushing needed.

Another thing is, on powerpc, we only do the cleaning when we try to
execute from the pages. IE. We basically "filter out" exec permission
when pages are not clean. At least on processors that support per-page
exec permission. You may want to consider something like that as well.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 23:29                                                     ` Benjamin Herrenschmidt
@ 2010-03-03  3:47                                                       ` FUJITA Tomonori
  -1 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-03  3:47 UTC (permalink / raw)
  To: benh
  Cc: catalin.marinas, fujita.tomonori, mdharm-kernel, oliver, linux,
	greg, x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel,
	James.Bottomley, santosh.shilimkar, pavel, tom.leiming,
	linux-arm-kernel

On Wed, 03 Mar 2010 10:29:54 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> 
> > The viable solutions so far:
> > 
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
> 
> I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note

Yeah, as powerpc and ia64 do, arm can flush D cache and invalidate I
cache when inserting a executable page to pte, IIUC. No need for the
new API for I/D consistency.

The ways to improve the approach (introducing PG_arch_2 or marking a
page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
to architectures.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03  3:47                                                       ` FUJITA Tomonori
  0 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-03  3:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 03 Mar 2010 10:29:54 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> 
> > The viable solutions so far:
> > 
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
> 
> I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note

Yeah, as powerpc and ia64 do, arm can flush D cache and invalidate I
cache when inserting a executable page to pte, IIUC. No need for the
new API for I/D consistency.

The ways to improve the approach (introducing PG_arch_2 or marking a
page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
to architectures.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03  3:47                                                       ` FUJITA Tomonori
@ 2010-03-03  5:10                                                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-03  5:10 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: catalin.marinas, mdharm-kernel, oliver, linux, greg, x0082077,
	sshtylyov, bigeasy, linux-usb, linux-kernel, James.Bottomley,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> The ways to improve the approach (introducing PG_arch_2 or marking a
> page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> to architectures. 

How does the above work ? IE, the dma unmap will flush the D side but
not the I side ... or is the ia64 flush primitive magic enough to do
both ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03  5:10                                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-03  5:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> The ways to improve the approach (introducing PG_arch_2 or marking a
> page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> to architectures. 

How does the above work ? IE, the dma unmap will flush the D side but
not the I side ... or is the ia64 flush primitive magic enough to do
both ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03  5:10                                                         ` Benjamin Herrenschmidt
@ 2010-03-03  5:40                                                           ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-03  5:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: FUJITA Tomonori, catalin.marinas, mdharm-kernel, oliver, linux,
	greg, x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > The ways to improve the approach (introducing PG_arch_2 or marking a
> > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > to architectures. 
> 
> How does the above work ? IE, the dma unmap will flush the D side but
> not the I side ... or is the ia64 flush primitive magic enough to do
> both ?

The point is that in a well regulated system, the I cache shouldn't need
extra flushing in the kernel.  We should only be faulting in R-X pages.
If we're operating on RWX pages (i.e. self modifying code), it's the job
of userspace to keep I/D coherency.

So the only case the kernel needs to worry about is the R-X fault case
for executable text code.

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03  5:40                                                           ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-03  5:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > The ways to improve the approach (introducing PG_arch_2 or marking a
> > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > to architectures. 
> 
> How does the above work ? IE, the dma unmap will flush the D side but
> not the I side ... or is the ia64 flush primitive magic enough to do
> both ?

The point is that in a well regulated system, the I cache shouldn't need
extra flushing in the kernel.  We should only be faulting in R-X pages.
If we're operating on RWX pages (i.e. self modifying code), it's the job
of userspace to keep I/D coherency.

So the only case the kernel needs to worry about is the R-X fault case
for executable text code.

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03  5:10                                                         ` Benjamin Herrenschmidt
@ 2010-03-03  6:35                                                           ` FUJITA Tomonori
  -1 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-03  6:35 UTC (permalink / raw)
  To: benh
  Cc: fujita.tomonori, catalin.marinas, mdharm-kernel, oliver, linux,
	greg, x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel,
	James.Bottomley, santosh.shilimkar, pavel, tom.leiming,
	linux-arm-kernel

On Wed, 03 Mar 2010 16:10:32 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > The ways to improve the approach (introducing PG_arch_2 or marking a
> > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > to architectures. 
> 
> How does the above work ? IE, the dma unmap will flush the D side but
> not the I side ... or is the ia64 flush primitive magic enough to do
> both ?

On ia64 platform, I (and D) cache is coherent with the memory that you
did DMA to, I think. But better to ask an ia64 guru. :)

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03  6:35                                                           ` FUJITA Tomonori
  0 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-03  6:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 03 Mar 2010 16:10:32 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > The ways to improve the approach (introducing PG_arch_2 or marking a
> > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > to architectures. 
> 
> How does the above work ? IE, the dma unmap will flush the D side but
> not the I side ... or is the ia64 flush primitive magic enough to do
> both ?

On ia64 platform, I (and D) cache is coherent with the memory that you
did DMA to, I think. But better to ask an ia64 guru. :)

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03  5:40                                                           ` James Bottomley
@ 2010-03-03  9:36                                                             ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-03  9:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Benjamin Herrenschmidt, FUJITA Tomonori, catalin.marinas,
	mdharm-kernel, oliver, greg, x0082077, sshtylyov, bigeasy,
	linux-usb, linux-kernel, santosh.shilimkar, pavel, tom.leiming,
	linux-arm-kernel

On Wed, Mar 03, 2010 at 11:10:09AM +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > to architectures. 
> > 
> > How does the above work ? IE, the dma unmap will flush the D side but
> > not the I side ... or is the ia64 flush primitive magic enough to do
> > both ?
> 
> The point is that in a well regulated system, the I cache shouldn't need
> extra flushing in the kernel.  We should only be faulting in R-X pages.

James, that's a pipedream.  If you have a processor which doesn't support
NX, then the kernel marks all regions executable, even if the app only
asks for RW protection.

You end up with the protection masks always having VM_EXEC set in them,
so there's no way to distinguish from the kernel POV which pages are
going to be executed and those which aren't.

And if you can't do that, you have to _always_ flush the I cache for
every page fault, because you don't know if the I cache is out of sync
with the page that you've just read in from disk - and therefore you
may end up executing bad code instead of the glibc text that was
intended.

So here's the question: in a system where the responsibility for I-cache
flushing is in userspace, how do you ensure that you can execute code
in userspace to do this I-cache flushing without first having flushed
the (speculatively prefetching) I-cache?

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03  9:36                                                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-03  9:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 03, 2010 at 11:10:09AM +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > to architectures. 
> > 
> > How does the above work ? IE, the dma unmap will flush the D side but
> > not the I side ... or is the ia64 flush primitive magic enough to do
> > both ?
> 
> The point is that in a well regulated system, the I cache shouldn't need
> extra flushing in the kernel.  We should only be faulting in R-X pages.

James, that's a pipedream.  If you have a processor which doesn't support
NX, then the kernel marks all regions executable, even if the app only
asks for RW protection.

You end up with the protection masks always having VM_EXEC set in them,
so there's no way to distinguish from the kernel POV which pages are
going to be executed and those which aren't.

And if you can't do that, you have to _always_ flush the I cache for
every page fault, because you don't know if the I cache is out of sync
with the page that you've just read in from disk - and therefore you
may end up executing bad code instead of the glibc text that was
intended.

So here's the question: in a system where the responsibility for I-cache
flushing is in userspace, how do you ensure that you can execute code
in userspace to do this I-cache flushing without first having flushed
the (speculatively prefetching) I-cache?

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 23:33                                                       ` Benjamin Herrenschmidt
@ 2010-03-03 10:21                                                         ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: FUJITA Tomonori, mdharm-kernel, linux-usb, linux, tom.leiming,
	x0082077, sshtylyov, greg, bigeasy, oliver, linux-kernel,
	James.Bottomley, santosh.shilimkar, pavel, linux-arm-kernel

On Tue, 2010-03-02 at 23:33 +0000, Benjamin Herrenschmidt wrote:
> On Tue, 2010-03-02 at 17:47 +0000, Catalin Marinas wrote:
> >
> > Actually, option 2 still has an issue - does not easily work on SMP
> > systems where cache maintenance operations aren't broadcast in hardware.
> > In this case (ARM11MPCore), flush_dcache_page() is implemented
> > non-lazily so that the flushing happens on the same processor that
> > dirtied the cache. But since with some drivers there is no call to this
> > function, it wouldn't make any difference.
> 
> Also, option 1 would not solve the icache issue which has the same
> problem related to IPIs. 

Correct. But that's true for both options.

It would have been simpler if we had software TLBs.

> You -really- need to spank some HW folks here :-)

I think they got the message :). Cortex-A9 does it properly.

> > A solution is to do something like read-for-ownership before flushing
> > the D-cache in update_mmu_cache() (or set_pte_at()).
> 
> You might also want to experiment with not clearing PG_arch_1 in
> flush_dcache_page(). I'm not 100% convinced it is necessary and that may
> reduce the amount of flushing needed.

Could a file map page be swapped out (and the mapping removed), then the
page cache page modified (i.e. NFS filesystem) and flush_dcache_page()
called?

> Another thing is, on powerpc, we only do the cleaning when we try to
> execute from the pages. IE. We basically "filter out" exec permission
> when pages are not clean. At least on processors that support per-page
> exec permission. You may want to consider something like that as well.

For non-aliasing VIPT, I think that's a fair optimisation.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03 10:21                                                         ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 23:33 +0000, Benjamin Herrenschmidt wrote:
> On Tue, 2010-03-02 at 17:47 +0000, Catalin Marinas wrote:
> >
> > Actually, option 2 still has an issue - does not easily work on SMP
> > systems where cache maintenance operations aren't broadcast in hardware.
> > In this case (ARM11MPCore), flush_dcache_page() is implemented
> > non-lazily so that the flushing happens on the same processor that
> > dirtied the cache. But since with some drivers there is no call to this
> > function, it wouldn't make any difference.
> 
> Also, option 1 would not solve the icache issue which has the same
> problem related to IPIs. 

Correct. But that's true for both options.

It would have been simpler if we had software TLBs.

> You -really- need to spank some HW folks here :-)

I think they got the message :). Cortex-A9 does it properly.

> > A solution is to do something like read-for-ownership before flushing
> > the D-cache in update_mmu_cache() (or set_pte_at()).
> 
> You might also want to experiment with not clearing PG_arch_1 in
> flush_dcache_page(). I'm not 100% convinced it is necessary and that may
> reduce the amount of flushing needed.

Could a file map page be swapped out (and the mapping removed), then the
page cache page modified (i.e. NFS filesystem) and flush_dcache_page()
called?

> Another thing is, on powerpc, we only do the cleaning when we try to
> execute from the pages. IE. We basically "filter out" exec permission
> when pages are not clean. At least on processors that support per-page
> exec permission. You may want to consider something like that as well.

For non-aliasing VIPT, I think that's a fair optimisation.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03  9:36                                                             ` Russell King - ARM Linux
@ 2010-03-03 10:24                                                               ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-03 10:24 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Benjamin Herrenschmidt, FUJITA Tomonori, catalin.marinas,
	mdharm-kernel, oliver, greg, x0082077, sshtylyov, bigeasy,
	linux-usb, linux-kernel, santosh.shilimkar, pavel, tom.leiming,
	linux-arm-kernel

On Wed, 2010-03-03 at 09:36 +0000, Russell King - ARM Linux wrote:
> On Wed, Mar 03, 2010 at 11:10:09AM +0530, James Bottomley wrote:
> > On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > > to architectures. 
> > > 
> > > How does the above work ? IE, the dma unmap will flush the D side but
> > > not the I side ... or is the ia64 flush primitive magic enough to do
> > > both ?
> > 
> > The point is that in a well regulated system, the I cache shouldn't need
> > extra flushing in the kernel.  We should only be faulting in R-X pages.
> 
> James, that's a pipedream.  If you have a processor which doesn't support
> NX, then the kernel marks all regions executable, even if the app only
> asks for RW protection.

I'm not talking about what the processor supports ... I'm talking about
what the user sets on the VMA.  My point is that the kernel only has
responsibility in specific situations ... it's those paths we do the I/D
coherency on.

> You end up with the protection masks always having VM_EXEC set in them,
> so there's no way to distinguish from the kernel POV which pages are
> going to be executed and those which aren't.

I think you're talking about the pte page flags, I'm talking about the
VMA ones above.

> And if you can't do that, you have to _always_ flush the I cache for
> every page fault, because you don't know if the I cache is out of sync
> with the page that you've just read in from disk - and therefore you
> may end up executing bad code instead of the glibc text that was
> intended.

If you're doing a not present, fault in a VMA executable region, I
agree ... since that's the start of the lifecycle where we have to begin
with I/D coherent.

> So here's the question: in a system where the responsibility for I-cache
> flushing is in userspace, how do you ensure that you can execute code
> in userspace to do this I-cache flushing without first having flushed
> the (speculatively prefetching) I-cache?

I'm not saying the common path (faulting in text sections) is the
responsibility of user space.  I'm saying the uncommon path, write
modification of binaries, is.  So the kernel only needs to worry about
the ordinary text fault path.

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03 10:24                                                               ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-03 10:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 09:36 +0000, Russell King - ARM Linux wrote:
> On Wed, Mar 03, 2010 at 11:10:09AM +0530, James Bottomley wrote:
> > On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > > to architectures. 
> > > 
> > > How does the above work ? IE, the dma unmap will flush the D side but
> > > not the I side ... or is the ia64 flush primitive magic enough to do
> > > both ?
> > 
> > The point is that in a well regulated system, the I cache shouldn't need
> > extra flushing in the kernel.  We should only be faulting in R-X pages.
> 
> James, that's a pipedream.  If you have a processor which doesn't support
> NX, then the kernel marks all regions executable, even if the app only
> asks for RW protection.

I'm not talking about what the processor supports ... I'm talking about
what the user sets on the VMA.  My point is that the kernel only has
responsibility in specific situations ... it's those paths we do the I/D
coherency on.

> You end up with the protection masks always having VM_EXEC set in them,
> so there's no way to distinguish from the kernel POV which pages are
> going to be executed and those which aren't.

I think you're talking about the pte page flags, I'm talking about the
VMA ones above.

> And if you can't do that, you have to _always_ flush the I cache for
> every page fault, because you don't know if the I cache is out of sync
> with the page that you've just read in from disk - and therefore you
> may end up executing bad code instead of the glibc text that was
> intended.

If you're doing a not present, fault in a VMA executable region, I
agree ... since that's the start of the lifecycle where we have to begin
with I/D coherent.

> So here's the question: in a system where the responsibility for I-cache
> flushing is in userspace, how do you ensure that you can execute code
> in userspace to do this I-cache flushing without first having flushed
> the (speculatively prefetching) I-cache?

I'm not saying the common path (faulting in text sections) is the
responsibility of user space.  I'm saying the uncommon path, write
modification of binaries, is.  So the kernel only needs to worry about
the ordinary text fault path.

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 23:29                                                     ` Benjamin Herrenschmidt
@ 2010-03-03 10:40                                                       ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: FUJITA Tomonori, mdharm-kernel, oliver, linux, greg, x0082077,
	sshtylyov, bigeasy, linux-usb, linux-kernel, James.Bottomley,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Tue, 2010-03-02 at 23:29 +0000, Benjamin Herrenschmidt wrote:
> On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> 
> > The viable solutions so far:
> >
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
[...]
> > Option 2 above looks pretty appealing to me since it can be done in the
> > ARM code exclusively. I've done some tests and it indeed solves the
> > cache coherency with a rootfs on a USB stick. As Russell suggested, it
> > can be optimised to mark a page as clean when the DMA API is involved to
> > avoid duplicate flushing.
> 
> That wouldn't solve the need for invalidating the I-cache... Unless we
> use another bit.

Indeed. We currently always invalidate the I-cache when the page is
mapped. With PG_arch_2, we could optimise this but I'm not sure it is
worth since I think we only get an update_mmu_cache() call for a page
(unless it is unmapped and re-mapped again).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03 10:40                                                       ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 23:29 +0000, Benjamin Herrenschmidt wrote:
> On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> 
> > The viable solutions so far:
> >
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
[...]
> > Option 2 above looks pretty appealing to me since it can be done in the
> > ARM code exclusively. I've done some tests and it indeed solves the
> > cache coherency with a rootfs on a USB stick. As Russell suggested, it
> > can be optimised to mark a page as clean when the DMA API is involved to
> > avoid duplicate flushing.
> 
> That wouldn't solve the need for invalidating the I-cache... Unless we
> use another bit.

Indeed. We currently always invalidate the I-cache when the page is
mapped. With PG_arch_2, we could optimise this but I'm not sure it is
worth since I think we only get an update_mmu_cache() call for a page
(unless it is unmapped and re-mapped again).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03  3:47                                                       ` FUJITA Tomonori
@ 2010-03-03 10:43                                                         ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:43 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: benh, mdharm-kernel, oliver, linux, greg, x0082077, sshtylyov,
	bigeasy, linux-usb, linux-kernel, James.Bottomley,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Wed, 2010-03-03 at 03:47 +0000, FUJITA Tomonori wrote:
> On Wed, 03 Mar 2010 10:29:54 +1100
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> >
> > > The viable solutions so far:
> > >
> > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > >         care of the D-cache flushing. This means that PIO drivers would
> > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > >         before writing to a page cache page.
> > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > >         means that by default newly allocated page cache pages are
> > >         considered dirty and even if there isn't a call to
> > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > >         This is the PowerPC approach.
> >
> > I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note
> 
> Yeah, as powerpc and ia64 do, arm can flush D cache and invalidate I
> cache when inserting a executable page to pte, IIUC. No need for the
> new API for I/D consistency.

I can see that IA-64 uses the PG_arch_1 bit to mark a clean page rather
than dirty (as we did for ARM). The Documentation/cachetlb.txt needs
updating.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03 10:43                                                         ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 03:47 +0000, FUJITA Tomonori wrote:
> On Wed, 03 Mar 2010 10:29:54 +1100
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> >
> > > The viable solutions so far:
> > >
> > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > >         care of the D-cache flushing. This means that PIO drivers would
> > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > >         before writing to a page cache page.
> > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > >         means that by default newly allocated page cache pages are
> > >         considered dirty and even if there isn't a call to
> > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > >         This is the PowerPC approach.
> >
> > I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note
> 
> Yeah, as powerpc and ia64 do, arm can flush D cache and invalidate I
> cache when inserting a executable page to pte, IIUC. No need for the
> new API for I/D consistency.

I can see that IA-64 uses the PG_arch_1 bit to mark a clean page rather
than dirty (as we did for ARM). The Documentation/cachetlb.txt needs
updating.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03 10:24                                                               ` James Bottomley
@ 2010-03-03 19:41                                                                 ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-03 19:41 UTC (permalink / raw)
  To: James Bottomley
  Cc: Benjamin Herrenschmidt, FUJITA Tomonori, catalin.marinas,
	mdharm-kernel, oliver, greg, x0082077, sshtylyov, bigeasy,
	linux-usb, linux-kernel, santosh.shilimkar, pavel, tom.leiming,
	linux-arm-kernel

On Wed, Mar 03, 2010 at 03:54:37PM +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 09:36 +0000, Russell King - ARM Linux wrote:
> > James, that's a pipedream.  If you have a processor which doesn't support
> > NX, then the kernel marks all regions executable, even if the app only
> > asks for RW protection.
> 
> I'm not talking about what the processor supports ... I'm talking about
> what the user sets on the VMA.  My point is that the kernel only has
> responsibility in specific situations ... it's those paths we do the I/D
> coherency on.

You may not be talking about what the processor supports, but it is
directly relevant.

> > You end up with the protection masks always having VM_EXEC set in them,
> > so there's no way to distinguish from the kernel POV which pages are
> > going to be executed and those which aren't.
> 
> I think you're talking about the pte page flags, I'm talking about the
> VMA ones above.

No, I'm talking about the VMA ones.

        if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
                if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
                        prot |= PROT_EXEC;
...
        /* Do simple checking here so the lower-level routines won't have
         * to. we assume access permissions have been handled by the open
         * of the memory object, so we don't do any here.
         */
        vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
                        mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

calc_vm_prot_bits(unsigned long prot)
{
        return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
               _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
               _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
               arch_calc_vm_prot_bits(prot);
}

So, if you have a CPU which does not support NX, then READ_IMPLIES_EXEC
is set in the personality.  That forces PROT_EXEC for anything with
PROT_READ, which in turn forces VM_EXEC.

> I'm not saying the common path (faulting in text sections) is the
> responsibility of user space.  I'm saying the uncommon path, write
> modification of binaries, is.  So the kernel only needs to worry about
> the ordinary text fault path.

What I'm saying is that you can't always tell the difference between
what's an executable page and what isn't in the kernel.  On NX-incapable
CPUs, the kernel treats *all* readable pages as executable, and there's
no way to tell from the VMA or page protection flags that this isn't
the case.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03 19:41                                                                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-03 19:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 03, 2010 at 03:54:37PM +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 09:36 +0000, Russell King - ARM Linux wrote:
> > James, that's a pipedream.  If you have a processor which doesn't support
> > NX, then the kernel marks all regions executable, even if the app only
> > asks for RW protection.
> 
> I'm not talking about what the processor supports ... I'm talking about
> what the user sets on the VMA.  My point is that the kernel only has
> responsibility in specific situations ... it's those paths we do the I/D
> coherency on.

You may not be talking about what the processor supports, but it is
directly relevant.

> > You end up with the protection masks always having VM_EXEC set in them,
> > so there's no way to distinguish from the kernel POV which pages are
> > going to be executed and those which aren't.
> 
> I think you're talking about the pte page flags, I'm talking about the
> VMA ones above.

No, I'm talking about the VMA ones.

        if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
                if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
                        prot |= PROT_EXEC;
...
        /* Do simple checking here so the lower-level routines won't have
         * to. we assume access permissions have been handled by the open
         * of the memory object, so we don't do any here.
         */
        vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
                        mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

calc_vm_prot_bits(unsigned long prot)
{
        return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
               _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
               _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
               arch_calc_vm_prot_bits(prot);
}

So, if you have a CPU which does not support NX, then READ_IMPLIES_EXEC
is set in the personality.  That forces PROT_EXEC for anything with
PROT_READ, which in turn forces VM_EXEC.

> I'm not saying the common path (faulting in text sections) is the
> responsibility of user space.  I'm saying the uncommon path, write
> modification of binaries, is.  So the kernel only needs to worry about
> the ordinary text fault path.

What I'm saying is that you can't always tell the difference between
what's an executable page and what isn't in the kernel.  On NX-incapable
CPUs, the kernel treats *all* readable pages as executable, and there's
no way to tell from the VMA or page protection flags that this isn't
the case.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-01 10:42                                               ` Catalin Marinas
@ 2010-03-03 20:24                                                 ` Jamie Lokier
  -1 siblings, 0 replies; 352+ messages in thread
From: Jamie Lokier @ 2010-03-03 20:24 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Benjamin Herrenschmidt, Matthew Dharm, Oliver Neukum,
	Russell King - ARM Linux, Mankad, Maulik Ojas, Sergei Shtylyov,
	Ming Lei, Sebastian Siewior, linux-usb, linux-kernel,
	James Bottomley, Shilimkar, Santosh, Pavel Machek, Greg KH,
	linux-arm-kernel

Catalin Marinas wrote:
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> As an optimisation, I think this would help (rather than always
> invalidating the I-cache in update_mmu_cache or set_pte_at).

If PG_arch_{1,2} are used in the same way on all architectures, when
they are used at all, perhaps they should be renamed :-)

-- Jamie

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03 20:24                                                 ` Jamie Lokier
  0 siblings, 0 replies; 352+ messages in thread
From: Jamie Lokier @ 2010-03-03 20:24 UTC (permalink / raw)
  To: linux-arm-kernel

Catalin Marinas wrote:
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> As an optimisation, I think this would help (rather than always
> invalidating the I-cache in update_mmu_cache or set_pte_at).

If PG_arch_{1,2} are used in the same way on all architectures, when
they are used at all, perhaps they should be renamed :-)

-- Jamie

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-02 17:05                                                   ` Catalin Marinas
@ 2010-03-03 21:54                                                     ` Pavel Machek
  -1 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-03-03 21:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: FUJITA Tomonori, James.Bottomley, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

Hi!

> > I'm not sure that there are some problems in the mm or common code. Is
> > this ARM's implementation issue? (Of course, the usb stack and the
> > driver's misuse of the DMA API needs to be fixed too).
> 
> Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
> invalidation for user pages in update_mmu_cache() (it could actually be
> in set_pte_at on SMP to avoid a race but that's for another thread). The
> D-cache is flushed by this function only if the PG_arch_1 bit is set.
> This bit is set in the ARM case by flush_dcache_page(), following the
> advice in Documentation/cachetlb.txt.
> 
> With some drivers (those doing PIO) or subsystems (SCSI mass storage
> over USB HCD), there is no call to flush_dcache_page() for page cache
> pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> the D-cache (and only invalidating the I-cache doesn't help).
> 
> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.

What about option

3. Forget about PG_arch_1 and always do the flush?

How big is the performance impact? Note that current code does not
even *work* so working, 10% slower code will be an improvement.

								Pavel

(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-03 21:54                                                     ` Pavel Machek
  0 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-03-03 21:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > I'm not sure that there are some problems in the mm or common code. Is
> > this ARM's implementation issue? (Of course, the usb stack and the
> > driver's misuse of the DMA API needs to be fixed too).
> 
> Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
> invalidation for user pages in update_mmu_cache() (it could actually be
> in set_pte_at on SMP to avoid a race but that's for another thread). The
> D-cache is flushed by this function only if the PG_arch_1 bit is set.
> This bit is set in the ARM case by flush_dcache_page(), following the
> advice in Documentation/cachetlb.txt.
> 
> With some drivers (those doing PIO) or subsystems (SCSI mass storage
> over USB HCD), there is no call to flush_dcache_page() for page cache
> pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> the D-cache (and only invalidating the I-cache doesn't help).
> 
> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.

What about option

3. Forget about PG_arch_1 and always do the flush?

How big is the performance impact? Note that current code does not
even *work* so working, 10% slower code will be an improvement.

								Pavel

(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03  5:40                                                           ` James Bottomley
@ 2010-03-04  2:00                                                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04  2:00 UTC (permalink / raw)
  To: James Bottomley
  Cc: FUJITA Tomonori, catalin.marinas, mdharm-kernel, oliver, linux,
	greg, x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Wed, 2010-03-03 at 11:10 +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > to architectures. 
> > 
> > How does the above work ? IE, the dma unmap will flush the D side but
> > not the I side ... or is the ia64 flush primitive magic enough to do
> > both ?
> 
> The point is that in a well regulated system, the I cache shouldn't need
> extra flushing in the kernel.  We should only be faulting in R-X pages.
> If we're operating on RWX pages (i.e. self modifying code), it's the job
> of userspace to keep I/D coherency.
> 
> So the only case the kernel needs to worry about is the R-X fault case
> for executable text code.

Still, you do need to flush I when a page cache page is recycled.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04  2:00                                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04  2:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 11:10 +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > to architectures. 
> > 
> > How does the above work ? IE, the dma unmap will flush the D side but
> > not the I side ... or is the ia64 flush primitive magic enough to do
> > both ?
> 
> The point is that in a well regulated system, the I cache shouldn't need
> extra flushing in the kernel.  We should only be faulting in R-X pages.
> If we're operating on RWX pages (i.e. self modifying code), it's the job
> of userspace to keep I/D coherency.
> 
> So the only case the kernel needs to worry about is the R-X fault case
> for executable text code.

Still, you do need to flush I when a page cache page is recycled.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03 21:54                                                     ` Pavel Machek
  (?)
@ 2010-03-04  6:54                                                     ` Wolfgang Mües
  2010-03-04  9:31                                                       ` Russell King - ARM Linux
  2010-03-04 13:47                                                       ` Catalin Marinas
  -1 siblings, 2 replies; 352+ messages in thread
From: Wolfgang Mües @ 2010-03-04  6:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Pavel Machek wrote:

> 3. Forget about PG_arch_1 and always do the flush?
> 
> How big is the performance impact? Note that current code does not
> even *work* so working, 10% slower code will be an improvement.

... and this is what *I* don't understand in this discussion. Obviously a 
flush() in PIO drivers is a clean and quick solution to the problem. And how 
much execution time will it cost - given the fact that if there is NO flush, 
the flush operation will not be avoided, only delayed (up to the time the data 
cache is doing the flush himself). If the data cache is doing the flush BEFORE 
the data is used in userspace (this includes the most common case of reading 
large files from the device), there will be no performance impact.

Just my 2 cents.

regards
Wolfgang
-- 
Wahre Worte sind nicht sch?n - Sch?ne Worte sind nicht wahr. (Laotse)

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04  2:00                                                             ` Benjamin Herrenschmidt
@ 2010-03-04  8:26                                                               ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-04  8:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: FUJITA Tomonori, catalin.marinas, mdharm-kernel, oliver, linux,
	greg, x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel,
	santosh.shilimkar, pavel, tom.leiming, linux-arm-kernel

On Thu, 2010-03-04 at 13:00 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-03-03 at 11:10 +0530, James Bottomley wrote:
> > On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > > to architectures. 
> > > 
> > > How does the above work ? IE, the dma unmap will flush the D side but
> > > not the I side ... or is the ia64 flush primitive magic enough to do
> > > both ?
> > 
> > The point is that in a well regulated system, the I cache shouldn't need
> > extra flushing in the kernel.  We should only be faulting in R-X pages.
> > If we're operating on RWX pages (i.e. self modifying code), it's the job
> > of userspace to keep I/D coherency.
> > 
> > So the only case the kernel needs to worry about is the R-X fault case
> > for executable text code.
> 
> Still, you do need to flush I when a page cache page is recycled.

Technically not if we've got all the I flushing when mapped executable
sorted out.  This is one of the dangers of over flushing ... if we start
flushing where we don't need it "just to be sure" we end up papering
over holes in the operating system and make catching actual bugs in
operations a lot harder.

The other thing you might not appreciate in ppc land is that for a lot
of other systems (well, like parisc) flushing a dirty cache line is
incredibly expensive (because we halt the CPU to wait for the memory
eviction), so ideally we want to flush as late as possible to give the
natural operations a chance to clean most of the cache lines.  Flushing
a clean cache line on parisc as well as invalidations are fast
operations.  That's why the kmap makes the most sense to us for
implementing PIO ops ... it's the farthest point we can flush the cache
at (because beyond it we've lost the mapping the VIPT cache requires to
flush).

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04  8:26                                                               ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-04  8:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 13:00 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-03-03 at 11:10 +0530, James Bottomley wrote:
> > On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > > to architectures. 
> > > 
> > > How does the above work ? IE, the dma unmap will flush the D side but
> > > not the I side ... or is the ia64 flush primitive magic enough to do
> > > both ?
> > 
> > The point is that in a well regulated system, the I cache shouldn't need
> > extra flushing in the kernel.  We should only be faulting in R-X pages.
> > If we're operating on RWX pages (i.e. self modifying code), it's the job
> > of userspace to keep I/D coherency.
> > 
> > So the only case the kernel needs to worry about is the R-X fault case
> > for executable text code.
> 
> Still, you do need to flush I when a page cache page is recycled.

Technically not if we've got all the I flushing when mapped executable
sorted out.  This is one of the dangers of over flushing ... if we start
flushing where we don't need it "just to be sure" we end up papering
over holes in the operating system and make catching actual bugs in
operations a lot harder.

The other thing you might not appreciate in ppc land is that for a lot
of other systems (well, like parisc) flushing a dirty cache line is
incredibly expensive (because we halt the CPU to wait for the memory
eviction), so ideally we want to flush as late as possible to give the
natural operations a chance to clean most of the cache lines.  Flushing
a clean cache line on parisc as well as invalidations are fast
operations.  That's why the kmap makes the most sense to us for
implementing PIO ops ... it's the farthest point we can flush the cache
at (because beyond it we've lost the mapping the VIPT cache requires to
flush).

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  6:54                                                     ` Wolfgang Mües
@ 2010-03-04  9:31                                                       ` Russell King - ARM Linux
  2010-03-06 10:56                                                         ` Wolfgang Mües
  2010-03-04 13:47                                                       ` Catalin Marinas
  1 sibling, 1 reply; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04  9:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 07:54:57AM +0100, Wolfgang M?es wrote:
> ... and this is what *I* don't understand in this discussion. Obviously a 
> flush() in PIO drivers is a clean and quick solution to the problem. And how 
> much execution time will it cost - given the fact that if there is NO flush, 
> the flush operation will not be avoided, only delayed (up to the time the data 
> cache is doing the flush himself). If the data cache is doing the flush BEFORE 
> the data is used in userspace (this includes the most common case of reading 
> large files from the device), there will be no performance impact.

You're assuming that every page is used in the same way.  Here's some
examples where this is wrong:

1. A page is faulted in for an application, and it is a text page.
   - the data read in to the page needs to be visible to the instruction
     stream, so on Harvard architecture machines, this may require cache
     maintainence on both the D and I caches.

2. A page is faulted in for an application's data page.
   - data may be written to the kernel mapping, which may alias with the
     eventual userspace address.  These aliases need to be dealt with, to
     make the data visible to the user mapping of the page.

3. A page may be read in response to an application issuing a read(2) call.
   - the data is read from the kernel mapping, and isn't mapped into a
     userspace address.

So, in case (3), flushing the I and D caches could be completely wasteful
- consider if this file is a 600MB MPEG video file which is being read by
a video player.  There's no need to flush the I cache because MPEG data
will never be executed.  There's no need to flush the D cache because
there isn't a user mapping of that data yet, and therefore there aren't
any aliases.

In case (2), it would be wasteful to flush the I cache - the application
isn't going to execute the data.

In case (1), everything is required to ensure that the instruction stream
can see the instructions.

So, the PG_arch_1 'delayed flush' is not only about delaying flushes until
they're required, it's about eliminating those which are not required to
give additional system performance - maybe to the point where you can
serve MP3 files via NFS with a low enough latency that your player isn't
regularly starved of data because of all the needless flushing going on.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-03 21:54                                                     ` Pavel Machek
@ 2010-03-04 13:35                                                       ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 13:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: FUJITA Tomonori, James.Bottomley, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > over USB HCD), there is no call to flush_dcache_page() for page cache
> > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > the D-cache (and only invalidating the I-cache doesn't help).
> >
> > The viable solutions so far:
> >
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
> 
> What about option
> 
> 3. Forget about PG_arch_1 and always do the flush?
> 
> How big is the performance impact? Note that current code does not
> even *work* so working, 10% slower code will be an improvement.

The driver fix is as simple as calling a flush_dcache_page() and I've
been carrying such patches in my tree for some time now. The question is
whether we need to do it in the driver or not (would need to update
Documentation/cachetlb.txt as well).

The reason I'm not in favour always doing the flush is that we penalise
DMA drivers where there is no need for extra D-cache flushing (already
handled by the DMA API; option 1 above is similar, just that it is meant
for PIO usage). An ARM patch I proposed for inverting the meaning of
PG_arch_1 also marks a page as clean in the dma_map_* functions.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 13:35                                                       ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 13:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > over USB HCD), there is no call to flush_dcache_page() for page cache
> > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > the D-cache (and only invalidating the I-cache doesn't help).
> >
> > The viable solutions so far:
> >
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
> 
> What about option
> 
> 3. Forget about PG_arch_1 and always do the flush?
> 
> How big is the performance impact? Note that current code does not
> even *work* so working, 10% slower code will be an improvement.

The driver fix is as simple as calling a flush_dcache_page() and I've
been carrying such patches in my tree for some time now. The question is
whether we need to do it in the driver or not (would need to update
Documentation/cachetlb.txt as well).

The reason I'm not in favour always doing the flush is that we penalise
DMA drivers where there is no need for extra D-cache flushing (already
handled by the DMA API; option 1 above is similar, just that it is meant
for PIO usage). An ARM patch I proposed for inverting the meaning of
PG_arch_1 also marks a page as clean in the dma_map_* functions.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  6:54                                                     ` Wolfgang Mües
  2010-03-04  9:31                                                       ` Russell King - ARM Linux
@ 2010-03-04 13:47                                                       ` Catalin Marinas
  1 sibling, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 13:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 06:54 +0000, Wolfgang M?es wrote:
> Pavel Machek wrote:
> 
> > 3. Forget about PG_arch_1 and always do the flush?
> >
> > How big is the performance impact? Note that current code does not
> > even *work* so working, 10% slower code will be an improvement.
> 
> ... and this is what *I* don't understand in this discussion. Obviously a
> flush() in PIO drivers is a clean and quick solution to the problem. And how
> much execution time will it cost - given the fact that if there is NO flush,
> the flush operation will not be avoided, only delayed (up to the time the data
> cache is doing the flush himself). If the data cache is doing the flush BEFORE
> the data is used in userspace (this includes the most common case of reading
> large files from the device), there will be no performance impact.

Indeed, I don't care much about whether we do delayed cache flushing or
not. What I care about is that we need flushing at least once (and
ideally only once). Most PIO drivers don't call any cache flushing
function. Upper layers like USB mass storage or VFS don't do it either
(and probably they shouldn't).

This leaves us with either modifying existing PIO drivers (two patches I
submitted are already in mainline) or clarifying the flush_dcache_page()
usage throughout the kernel (and modifying the architecture code
accordingly). The Documentation/cachetlb.txt states that
flush_dcache_page() is called any time the kernel writes to a page cache
page, which is not the case for PIO drivers.

There may be a small advantage with the delayed flushing since not all
pages read from a device would be mapped in user space but I haven't
done any benchmarks to see the impact.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 13:35                                                       ` Catalin Marinas
@ 2010-03-04 13:51                                                         ` Pavel Machek
  -1 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-03-04 13:51 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: FUJITA Tomonori, James.Bottomley, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

> On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > the D-cache (and only invalidating the I-cache doesn't help).
> > >
> > > The viable solutions so far:
> > >
> > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > >         care of the D-cache flushing. This means that PIO drivers would
> > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > >         before writing to a page cache page.
> > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > >         means that by default newly allocated page cache pages are
> > >         considered dirty and even if there isn't a call to
> > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > >         This is the PowerPC approach.
> > 
> > What about option
> > 
> > 3. Forget about PG_arch_1 and always do the flush?
> > 
> > How big is the performance impact? Note that current code does not
> > even *work* so working, 10% slower code will be an improvement.
> 
> The driver fix is as simple as calling a flush_dcache_page() and I've
> been carrying such patches in my tree for some time now. The question is
> whether we need to do it in the driver or not (would need to update
> Documentation/cachetlb.txt as well).
> 
> The reason I'm not in favour always doing the flush is that we penalise
> DMA drivers where there is no need for extra D-cache flushing (already
> handled by the DMA API; option 1 above is similar, just that it is meant
> for PIO usage). An ARM patch I proposed for inverting the meaning of
> PG_arch_1 also marks a page as clean in the dma_map_* functions.

But you are not fixing driver bug, are you?

Seems like ARM has requirement other architectures do not, that is
a) not documented anywhere
b) causes problems

You could argue that performance improvement (how big is it, anyway?)
is worth it, but this should be agreed to by wider community...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 13:51                                                         ` Pavel Machek
  0 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-03-04 13:51 UTC (permalink / raw)
  To: linux-arm-kernel

> On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > the D-cache (and only invalidating the I-cache doesn't help).
> > >
> > > The viable solutions so far:
> > >
> > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > >         care of the D-cache flushing. This means that PIO drivers would
> > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > >         before writing to a page cache page.
> > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > >         means that by default newly allocated page cache pages are
> > >         considered dirty and even if there isn't a call to
> > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > >         This is the PowerPC approach.
> > 
> > What about option
> > 
> > 3. Forget about PG_arch_1 and always do the flush?
> > 
> > How big is the performance impact? Note that current code does not
> > even *work* so working, 10% slower code will be an improvement.
> 
> The driver fix is as simple as calling a flush_dcache_page() and I've
> been carrying such patches in my tree for some time now. The question is
> whether we need to do it in the driver or not (would need to update
> Documentation/cachetlb.txt as well).
> 
> The reason I'm not in favour always doing the flush is that we penalise
> DMA drivers where there is no need for extra D-cache flushing (already
> handled by the DMA API; option 1 above is similar, just that it is meant
> for PIO usage). An ARM patch I proposed for inverting the meaning of
> PG_arch_1 also marks a page as clean in the dma_map_* functions.

But you are not fixing driver bug, are you?

Seems like ARM has requirement other architectures do not, that is
a) not documented anywhere
b) causes problems

You could argue that performance improvement (how big is it, anyway?)
is worth it, but this should be agreed to by wider community...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 13:51                                                         ` Pavel Machek
@ 2010-03-04 14:21                                                           ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-04 14:21 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Catalin Marinas, FUJITA Tomonori, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > > the D-cache (and only invalidating the I-cache doesn't help).
> > > >
> > > > The viable solutions so far:
> > > >
> > > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > > >         care of the D-cache flushing. This means that PIO drivers would
> > > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > > >         before writing to a page cache page.
> > > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > > >         means that by default newly allocated page cache pages are
> > > >         considered dirty and even if there isn't a call to
> > > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > > >         This is the PowerPC approach.
> > > 
> > > What about option
> > > 
> > > 3. Forget about PG_arch_1 and always do the flush?
> > > 
> > > How big is the performance impact? Note that current code does not
> > > even *work* so working, 10% slower code will be an improvement.
> > 
> > The driver fix is as simple as calling a flush_dcache_page() and I've
> > been carrying such patches in my tree for some time now. The question is
> > whether we need to do it in the driver or not (would need to update
> > Documentation/cachetlb.txt as well).
> > 
> > The reason I'm not in favour always doing the flush is that we penalise
> > DMA drivers where there is no need for extra D-cache flushing (already
> > handled by the DMA API; option 1 above is similar, just that it is meant
> > for PIO usage). An ARM patch I proposed for inverting the meaning of
> > PG_arch_1 also marks a page as clean in the dma_map_* functions.
> 
> But you are not fixing driver bug, are you?

Technically, he is.  In the old days, most VI architectures were high
end enough not to require PIO transfers.  The only exception was an IDE
driver used by sparc, which lead to the arch specific ide in/out string
instructions, in which sparc actually did all the necessary flushing.

So no other drivers than old IDE grew up with cache flushing in the PIO
case (and almost no high end VI hardware had an IDE interface, so they
rarely got implemented in the arch layer).  However, recently, with the
transition from old IDE to libata and the prevalence of ARM with more
commodity hardware, the deficiency is becoming exposed.  Even the PA8000
workstations now come with an IDE CD, which means we're starting to have
problems with them as well.

> Seems like ARM has requirement other architectures do not, that is
> a) not documented anywhere
> b) causes problems
> 
> You could argue that performance improvement (how big is it, anyway?)
> is worth it, but this should be agreed to by wider community...

Performance is always worth it provided we don't sacrifice correctness.
The thing which was discovered in this thread is basically that ARM is
handling deferred flushing (for D/I coherency) in a slightly different
way from everyone else ... once that's fixed, ARM will likely not have
the D/I problem, but we'll still have the libata (and other PIO systems)
D flushing issue.

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 14:21                                                           ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-04 14:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > > the D-cache (and only invalidating the I-cache doesn't help).
> > > >
> > > > The viable solutions so far:
> > > >
> > > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > > >         care of the D-cache flushing. This means that PIO drivers would
> > > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > > >         before writing to a page cache page.
> > > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > > >         means that by default newly allocated page cache pages are
> > > >         considered dirty and even if there isn't a call to
> > > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > > >         This is the PowerPC approach.
> > > 
> > > What about option
> > > 
> > > 3. Forget about PG_arch_1 and always do the flush?
> > > 
> > > How big is the performance impact? Note that current code does not
> > > even *work* so working, 10% slower code will be an improvement.
> > 
> > The driver fix is as simple as calling a flush_dcache_page() and I've
> > been carrying such patches in my tree for some time now. The question is
> > whether we need to do it in the driver or not (would need to update
> > Documentation/cachetlb.txt as well).
> > 
> > The reason I'm not in favour always doing the flush is that we penalise
> > DMA drivers where there is no need for extra D-cache flushing (already
> > handled by the DMA API; option 1 above is similar, just that it is meant
> > for PIO usage). An ARM patch I proposed for inverting the meaning of
> > PG_arch_1 also marks a page as clean in the dma_map_* functions.
> 
> But you are not fixing driver bug, are you?

Technically, he is.  In the old days, most VI architectures were high
end enough not to require PIO transfers.  The only exception was an IDE
driver used by sparc, which lead to the arch specific ide in/out string
instructions, in which sparc actually did all the necessary flushing.

So no other drivers than old IDE grew up with cache flushing in the PIO
case (and almost no high end VI hardware had an IDE interface, so they
rarely got implemented in the arch layer).  However, recently, with the
transition from old IDE to libata and the prevalence of ARM with more
commodity hardware, the deficiency is becoming exposed.  Even the PA8000
workstations now come with an IDE CD, which means we're starting to have
problems with them as well.

> Seems like ARM has requirement other architectures do not, that is
> a) not documented anywhere
> b) causes problems
> 
> You could argue that performance improvement (how big is it, anyway?)
> is worth it, but this should be agreed to by wider community...

Performance is always worth it provided we don't sacrifice correctness.
The thing which was discovered in this thread is basically that ARM is
handling deferred flushing (for D/I coherency) in a slightly different
way from everyone else ... once that's fixed, ARM will likely not have
the D/I problem, but we'll still have the libata (and other PIO systems)
D flushing issue.

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 14:21                                                           ` James Bottomley
@ 2010-03-04 14:27                                                             ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 14:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pavel Machek, Catalin Marinas, FUJITA Tomonori, benh,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> > 
> > You could argue that performance improvement (how big is it, anyway?)
> > is worth it, but this should be agreed to by wider community...
> 
> Performance is always worth it provided we don't sacrifice correctness.
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... once that's fixed, ARM will likely not have
> the D/I problem, but we'll still have the libata (and other PIO systems)
> D flushing issue.

I think you've got that backwards.

Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
since we'll interpret '0' to mean "page is dirty, it needs flushing before
hitting userspace", whereas '1' means "page has been cleaned; there are no
aliases."

This doesn not address the I/D coherency issue, where the Icache needs
attention to get rid of speculatively loaded cache lines while old data
was present in the cache.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 14:27                                                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 14:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> > 
> > You could argue that performance improvement (how big is it, anyway?)
> > is worth it, but this should be agreed to by wider community...
> 
> Performance is always worth it provided we don't sacrifice correctness.
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... once that's fixed, ARM will likely not have
> the D/I problem, but we'll still have the libata (and other PIO systems)
> D flushing issue.

I think you've got that backwards.

Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
since we'll interpret '0' to mean "page is dirty, it needs flushing before
hitting userspace", whereas '1' means "page has been cleaned; there are no
aliases."

This doesn not address the I/D coherency issue, where the Icache needs
attention to get rid of speculatively loaded cache lines while old data
was present in the cache.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 14:27                                                             ` Russell King - ARM Linux
@ 2010-03-04 15:25                                                               ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:25 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: James Bottomley, Pavel Machek, FUJITA Tomonori, benh,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> > >
> > > You could argue that performance improvement (how big is it, anyway?)
> > > is worth it, but this should be agreed to by wider community...
> >
> > Performance is always worth it provided we don't sacrifice correctness.
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... once that's fixed, ARM will likely not have
> > the D/I problem, but we'll still have the libata (and other PIO systems)
> > D flushing issue.
> 
> I think you've got that backwards.
> 
> Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> since we'll interpret '0' to mean "page is dirty, it needs flushing before
> hitting userspace", whereas '1' means "page has been cleaned; there are no
> aliases."
> 
> This doesn not address the I/D coherency issue, where the Icache needs
> attention to get rid of speculatively loaded cache lines while old data
> was present in the cache.

The I-cache flushing is already handled in update_mmu_cache (or
set_pte_at in a future patch; I'm not talking about other issues on
ARM11MPCore here).

We always invalidate the I-cache currently (since we may have DMA
transfers and the page's D-cache is clean). As an optimisation, we could
use PG_arch_2 for I-cache but I don't think there is much performance
benefit compared to always invalidating the I-cache flushing.

My understanding from this long discussion is that we cannot get the
kernel modifying a page cache page which is already mapped in user space
(well, ptrace does this but we flush the cache there already).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 15:25                                                               ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> > >
> > > You could argue that performance improvement (how big is it, anyway?)
> > > is worth it, but this should be agreed to by wider community...
> >
> > Performance is always worth it provided we don't sacrifice correctness.
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... once that's fixed, ARM will likely not have
> > the D/I problem, but we'll still have the libata (and other PIO systems)
> > D flushing issue.
> 
> I think you've got that backwards.
> 
> Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> since we'll interpret '0' to mean "page is dirty, it needs flushing before
> hitting userspace", whereas '1' means "page has been cleaned; there are no
> aliases."
> 
> This doesn not address the I/D coherency issue, where the Icache needs
> attention to get rid of speculatively loaded cache lines while old data
> was present in the cache.

The I-cache flushing is already handled in update_mmu_cache (or
set_pte_at in a future patch; I'm not talking about other issues on
ARM11MPCore here).

We always invalidate the I-cache currently (since we may have DMA
transfers and the page's D-cache is clean). As an optimisation, we could
use PG_arch_2 for I-cache but I don't think there is much performance
benefit compared to always invalidating the I-cache flushing.

My understanding from this long discussion is that we cannot get the
kernel modifying a page cache page which is already mapped in user space
(well, ptrace does this but we flush the cache there already).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 14:21                                                           ` James Bottomley
@ 2010-03-04 15:29                                                             ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:29 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pavel Machek, FUJITA Tomonori, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... 

Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
and IA-64 use PG_arch_1 as a clean rather than dirty bit.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 15:29                                                             ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... 

Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
and IA-64 use PG_arch_1 as a clean rather than dirty bit.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 15:25                                                               ` Catalin Marinas
@ 2010-03-04 15:34                                                                 ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 15:34 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: James Bottomley, Pavel Machek, FUJITA Tomonori, benh,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Thu, Mar 04, 2010 at 03:25:23PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> > On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > > Seems like ARM has requirement other architectures do not, that is
> > > > a) not documented anywhere
> > > > b) causes problems
> > > >
> > > > You could argue that performance improvement (how big is it, anyway?)
> > > > is worth it, but this should be agreed to by wider community...
> > >
> > > Performance is always worth it provided we don't sacrifice correctness.
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... once that's fixed, ARM will likely not have
> > > the D/I problem, but we'll still have the libata (and other PIO systems)
> > > D flushing issue.
> > 
> > I think you've got that backwards.
> > 
> > Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> > since we'll interpret '0' to mean "page is dirty, it needs flushing before
> > hitting userspace", whereas '1' means "page has been cleaned; there are no
> > aliases."
> > 
> > This doesn not address the I/D coherency issue, where the Icache needs
> > attention to get rid of speculatively loaded cache lines while old data
> > was present in the cache.
> 
> The I-cache flushing is already handled in update_mmu_cache (or
> set_pte_at in a future patch; I'm not talking about other issues on
> ARM11MPCore here).

You may not have been; my message was addressed to James to correct
his message, which seems to have the issues confused.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 15:34                                                                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 03:25:23PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> > On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > > Seems like ARM has requirement other architectures do not, that is
> > > > a) not documented anywhere
> > > > b) causes problems
> > > >
> > > > You could argue that performance improvement (how big is it, anyway?)
> > > > is worth it, but this should be agreed to by wider community...
> > >
> > > Performance is always worth it provided we don't sacrifice correctness.
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... once that's fixed, ARM will likely not have
> > > the D/I problem, but we'll still have the libata (and other PIO systems)
> > > D flushing issue.
> > 
> > I think you've got that backwards.
> > 
> > Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> > since we'll interpret '0' to mean "page is dirty, it needs flushing before
> > hitting userspace", whereas '1' means "page has been cleaned; there are no
> > aliases."
> > 
> > This doesn not address the I/D coherency issue, where the Icache needs
> > attention to get rid of speculatively loaded cache lines while old data
> > was present in the cache.
> 
> The I-cache flushing is already handled in update_mmu_cache (or
> set_pte_at in a future patch; I'm not talking about other issues on
> ARM11MPCore here).

You may not have been; my message was addressed to James to correct
his message, which seems to have the issues confused.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 13:51                                                         ` Pavel Machek
@ 2010-03-04 15:35                                                           ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: FUJITA Tomonori, James.Bottomley, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

On Thu, 2010-03-04 at 13:51 +0000, Pavel Machek wrote:
> > On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > > the D-cache (and only invalidating the I-cache doesn't help).
> > > >
> > > > The viable solutions so far:
> > > >
> > > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > > >         care of the D-cache flushing. This means that PIO drivers would
> > > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > > >         before writing to a page cache page.
> > > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > > >         means that by default newly allocated page cache pages are
> > > >         considered dirty and even if there isn't a call to
> > > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > > >         This is the PowerPC approach.
> > >
> > > What about option
> > >
> > > 3. Forget about PG_arch_1 and always do the flush?
> > >
> > > How big is the performance impact? Note that current code does not
> > > even *work* so working, 10% slower code will be an improvement.
> >
> > The driver fix is as simple as calling a flush_dcache_page() and I've
> > been carrying such patches in my tree for some time now. The question is
> > whether we need to do it in the driver or not (would need to update
> > Documentation/cachetlb.txt as well).
> >
> > The reason I'm not in favour always doing the flush is that we penalise
> > DMA drivers where there is no need for extra D-cache flushing (already
> > handled by the DMA API; option 1 above is similar, just that it is meant
> > for PIO usage). An ARM patch I proposed for inverting the meaning of
> > PG_arch_1 also marks a page as clean in the dma_map_* functions.
> 
> But you are not fixing driver bug, are you?

Some drivers I fixed already: db8516f61b481e8, 2d68b7fe55d9e19.

> Seems like ARM has requirement other architectures do not, that is
> a) not documented anywhere
> b) causes problems

Well, ARM is pretty similar to other architectures in this respect. And
I'm sure other architectures have similar problems, only that they only
become visible in some circumstances they may not have encountered (i.e.
PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
Some other architectures may do heavier flushing

Of course, a Documentation/arm/cachetlb.txt file would make sense.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 15:35                                                           ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 13:51 +0000, Pavel Machek wrote:
> > On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > > the D-cache (and only invalidating the I-cache doesn't help).
> > > >
> > > > The viable solutions so far:
> > > >
> > > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > > >         care of the D-cache flushing. This means that PIO drivers would
> > > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > > >         before writing to a page cache page.
> > > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > > >         means that by default newly allocated page cache pages are
> > > >         considered dirty and even if there isn't a call to
> > > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > > >         This is the PowerPC approach.
> > >
> > > What about option
> > >
> > > 3. Forget about PG_arch_1 and always do the flush?
> > >
> > > How big is the performance impact? Note that current code does not
> > > even *work* so working, 10% slower code will be an improvement.
> >
> > The driver fix is as simple as calling a flush_dcache_page() and I've
> > been carrying such patches in my tree for some time now. The question is
> > whether we need to do it in the driver or not (would need to update
> > Documentation/cachetlb.txt as well).
> >
> > The reason I'm not in favour always doing the flush is that we penalise
> > DMA drivers where there is no need for extra D-cache flushing (already
> > handled by the DMA API; option 1 above is similar, just that it is meant
> > for PIO usage). An ARM patch I proposed for inverting the meaning of
> > PG_arch_1 also marks a page as clean in the dma_map_* functions.
> 
> But you are not fixing driver bug, are you?

Some drivers I fixed already: db8516f61b481e8, 2d68b7fe55d9e19.

> Seems like ARM has requirement other architectures do not, that is
> a) not documented anywhere
> b) causes problems

Well, ARM is pretty similar to other architectures in this respect. And
I'm sure other architectures have similar problems, only that they only
become visible in some circumstances they may not have encountered (i.e.
PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
Some other architectures may do heavier flushing

Of course, a Documentation/arm/cachetlb.txt file would make sense.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 15:29                                                             ` Catalin Marinas
@ 2010-03-04 15:41                                                               ` Paul Mundt
  -1 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-04 15:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: James Bottomley, Pavel Machek, FUJITA Tomonori, benh, linux,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... 
> 
> Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> 
SH used to use it as a PG_mapped which was roughly similar to the
PG_dcache_clean approach, at which point things like flushing for the PIO
case in the HCD wasn't necessary. It did result in rather aggressive over
flushing though, which is one of the reasons we elected to switch to
PG_dcache_dirty.

Note that the PG_dcache_dirty semantics are also outlined in
Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 15:41                                                               ` Paul Mundt
  0 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-04 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... 
> 
> Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> 
SH used to use it as a PG_mapped which was roughly similar to the
PG_dcache_clean approach, at which point things like flushing for the PIO
case in the HCD wasn't necessary. It did result in rather aggressive over
flushing though, which is one of the reasons we elected to switch to
PG_dcache_dirty.

Note that the PG_dcache_dirty semantics are also outlined in
Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 15:41                                                               ` Paul Mundt
@ 2010-03-04 16:30                                                                 ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 16:30 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Catalin Marinas, FUJITA Tomonori, mdharm-kernel, oliver, greg,
	x0082077, sshtylyov, benh, bigeasy, linux-usb, linux-kernel,
	James Bottomley, santosh.shilimkar, Pavel Machek, tom.leiming,
	linux-arm-kernel

On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... 
> > 
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.
> 
> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Indeed; the ARM approach was basically taken from Sparc64.

The problem being talked about (with data from PIO drivers not being
visible to userspace) is one of those corner cases.  It's been around
for something like 6 years or more, being reported by folk on the ARM
list on and off - so it's nothing new.

However, it seems very obscure - I've never been able to reproduce it
on any platform I have here, even with people's test programs which
instantly show it on their hardware.  It seems to require a very
specific set of hardware and software conditions to trigger it.

The general critera (from memory) seems to be:
- a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
- write allocate caches show the problem better than read allocate only
- using a block device for the filesystem
- mmap'ing a page and immediately accessing the last few cache lines in
  that page

The problem is that if enough of your data cache gets cycled through
in between the data being written to the page, and userspace trying to
read it, then you're going to see correct data.  So, the larger the L1
cache, the greater the chance that you'll see a problem.

Here is a program which Lothar sent me some time ago (the timestamp on
the .c is June 2004 - I can't find the original email though.)  I've
just checked with Lothar, who has given me permission to reproduce it.

I can't guarantee that this program still shows a problem - since I
believe I've never been able to reproduce it myself.  It might be worth
checking how other architectures behave.

Note that loop did get fixed with flush_dcache_page(), so trying it
against a loopback mounted filesystem won't show the problem.

/*
 * creates a testfile, 'mmap's it, and checks its content reading
 * page back to front. If a data error is found, the same page is read
 * over and over again, until data is eventually correct after some time.
 *
 * This points out a cache problem in the ARM linux kernel
 * Using the cache in Write-Through mode (kernel command line option: cachepolicy=writethrough)
 * or CONFIG_XSCALE_CACHE_ERRATA=y in older kernels prevents this problem
 *
 * (C) Lothar Wassmann, <LW@KARO-electronics.de>
 *
 */
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/ioctl.h>

#define PAGE_SIZE	4096
#define PAGE_SIZE_INT	((PAGE_SIZE)/sizeof(unsigned long))
#define PAGE_MASK	((PAGE_SIZE)-1)

#undef USE_BLKFLSBUF
#define BLKFRASET  _IO(0x12,100)/* set filesystem (mm/filemap.c) read-ahead */


size_t file_size = 256 * PAGE_SIZE;

unsigned long *buf=NULL;

const char* fn="testfile";

void usage(const char* name)
{
	printf("%s <mount point> [filename]\n", name);
	printf("\trequires <mount point> to be defined in /etc/fstab\n");
	printf("\t<mount point> will be unmounted and remounted during the test\n");
}

int create_file(const char* name, size_t size)
{
	int ret=0;
	int i;
	int fd;

	fd = open(name, O_CREAT|O_RDWR|O_SYNC|O_TRUNC, S_IWUSR|S_IRUSR|S_IRGRP|S_IROTH);
	if (fd < 0) {
		fprintf(stderr, "Failed to open '%s' for writing, errno=%d\n", name, errno);
		return errno;
	}

	for (i = size / sizeof(*buf); i > 0; i--) {
		buf[i-1] = i;
	}
	write(fd, buf, size);
	memset(buf, 0x55, size);

	close(fd);
	return ret;
}

int do_check(int fd, void *mapptr, size_t size)
{
	const int num_pages=size/PAGE_SIZE;
	volatile unsigned char *ptr=mapptr;
	int errors = 0;
	int soft = 0;
	int page;

	printf("Checking data from %08lx to %08lx\n", (unsigned long)(ptr + size),
	       (unsigned long)ptr);

	for (page = num_pages - 1; page >= 0; page--) {
		volatile unsigned long *pp=(volatile unsigned long *)&ptr[page*PAGE_SIZE];
		int offs;
		int page_errs=0;
		int err_offs=-1;

		for (offs = 0; offs < PAGE_SIZE; offs += sizeof(unsigned long)) {
			volatile unsigned long *lp=&pp[offs/sizeof(unsigned long)];
			unsigned long data=*lp;
			unsigned long ref=(((page*PAGE_SIZE)+offs)/sizeof(data)) + 1;

			if (data != ref) {
				const int max_tries=100000;
				int retries=max_tries;
				unsigned long new_data=*lp;

				errors++;
				page_errs++;
				while ((new_data != ref) && (--retries > 0)) {
					if (data != new_data) {
						fprintf(stderr, "Data @ page %03x:%03x (%08lx) changed to %08lx(%08lx)\n",
							page, offs, (unsigned long)lp, new_data, ref);
					}
					data = new_data;
					new_data = *lp;
				}
				if (new_data == ref) {
					fprintf(stderr, "Data @ page %03x:%03x (%08lx) OK after %d retries: %08lx\n",
						page, offs, (unsigned long)lp, max_tries - retries, new_data);
					soft++;
				} else {
					if (err_offs != offs) {
						fprintf(stderr, "Data error @ page %03x:%03x (%08lx): %08lx -> %08lx\n",
							page, offs, (unsigned long)lp, ref, data);
						err_offs = offs;
					}
					// retry the same page again, until data is correct
					offs = 0;
				}
			}
		}
		if (page_errs) {
			page = num_pages;
		}
	}

	fprintf(stderr, "Errors reverse check: %d; soft: %d; total bytes %d in %d pages\n",
		errors, soft, size, num_pages);

	return errors;
}

int check_file(const char* name, size_t size)
{
	int ret=0;
	int fd;
	void *ptr=NULL;
	int errors=0;
	int last_errors=0;

	fd = open(name, O_RDONLY|O_SYNC);
	if (fd < 0) {
		fprintf(stderr, "Failed to open '%s' for reading\n", name);
		return errno;
	}

	ptr = mmap(NULL, size, PROT_READ, MAP_SHARED/*PRIVATE*/, fd, 0);
	if (ptr == MAP_FAILED) {
		close(fd);
		return -ENOMEM;
	}

	printf("Checking file '%s'\n", name);
	do {
		last_errors = errors;
		errors = do_check(fd, ptr, size);
		if (errors != 0) {
			ret = errors;
		}
	} while (errors > 0 && errors != last_errors);

	if (munmap(ptr, size) != 0) {
		fprintf(stderr, "Failed to unmap %08lx\n", (unsigned long)ptr);
		if (ret == 0) {
			ret = -ENOMEM;
		}
	}
	close(fd);
	if (buf != NULL) {
		memset(buf, 0x55, size);
	}

	if (ret == 0) {
		printf("check successful\n");
	} else {
		printf("check failed\n");
	}

	return ret;
}

int main(int argc, char *argv[])
{
	int rc=0;
	char fname[100];
	char mount[44];
	char umount[44];

	if (argc < 2) {
		// first argument is required
		usage(argv[0]);
		return 1;
	}
	if (argc > 2) {
		// take optional second argument as filename
		fn = argv[2];
	}

	sprintf(fname, "%s/%s", argv[1], fn);
	sprintf(mount, "mount %s", argv[1]);
	sprintf(umount, "umount %s", argv[1]);

	file_size &= ~PAGE_MASK; // round size to page boundary
	buf = malloc(file_size);

	if (buf == NULL) {
		fprintf(stderr, "Failed to allocate buffer\n");
		rc = -ENOMEM;
	}

#ifdef USE_BLKFLSBUF	
	printf("Mounting '%s'\n", argv[1]);
	system(mount);
#endif

	while (rc == 0) {
		printf("Opening '%s'\n", fname);
		rc = create_file(fname, file_size);
		if (rc != 0) {
			fprintf(stderr, "Failed to create file '%s', rc=%d\n", fname, rc);
			break;
		}

#ifndef USE_BLKFLSBUF
		printf("Unmounting '%s'\n", argv[1]);
		system(umount);

		printf("Remounting '%s'\n", argv[1]);
		system(mount);
#else
		{
			int fd = open("/dev/loop0", O_RDONLY);
			ioctl(fd, BLKFLSBUF, 0);
			ioctl(fd, BLKRASET, 0);
			ioctl(fd, BLKFRASET, 0);
			close(fd);
		}
#endif

		rc = check_file(fname, file_size);
	}

	if (buf != NULL) {
		free(buf);
	}

	return rc;
}

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 16:30                                                                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 16:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... 
> > 
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.
> 
> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Indeed; the ARM approach was basically taken from Sparc64.

The problem being talked about (with data from PIO drivers not being
visible to userspace) is one of those corner cases.  It's been around
for something like 6 years or more, being reported by folk on the ARM
list on and off - so it's nothing new.

However, it seems very obscure - I've never been able to reproduce it
on any platform I have here, even with people's test programs which
instantly show it on their hardware.  It seems to require a very
specific set of hardware and software conditions to trigger it.

The general critera (from memory) seems to be:
- a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
- write allocate caches show the problem better than read allocate only
- using a block device for the filesystem
- mmap'ing a page and immediately accessing the last few cache lines in
  that page

The problem is that if enough of your data cache gets cycled through
in between the data being written to the page, and userspace trying to
read it, then you're going to see correct data.  So, the larger the L1
cache, the greater the chance that you'll see a problem.

Here is a program which Lothar sent me some time ago (the timestamp on
the .c is June 2004 - I can't find the original email though.)  I've
just checked with Lothar, who has given me permission to reproduce it.

I can't guarantee that this program still shows a problem - since I
believe I've never been able to reproduce it myself.  It might be worth
checking how other architectures behave.

Note that loop did get fixed with flush_dcache_page(), so trying it
against a loopback mounted filesystem won't show the problem.

/*
 * creates a testfile, 'mmap's it, and checks its content reading
 * page back to front. If a data error is found, the same page is read
 * over and over again, until data is eventually correct after some time.
 *
 * This points out a cache problem in the ARM linux kernel
 * Using the cache in Write-Through mode (kernel command line option: cachepolicy=writethrough)
 * or CONFIG_XSCALE_CACHE_ERRATA=y in older kernels prevents this problem
 *
 * (C) Lothar Wassmann, <LW@KARO-electronics.de>
 *
 */
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/ioctl.h>

#define PAGE_SIZE	4096
#define PAGE_SIZE_INT	((PAGE_SIZE)/sizeof(unsigned long))
#define PAGE_MASK	((PAGE_SIZE)-1)

#undef USE_BLKFLSBUF
#define BLKFRASET  _IO(0x12,100)/* set filesystem (mm/filemap.c) read-ahead */


size_t file_size = 256 * PAGE_SIZE;

unsigned long *buf=NULL;

const char* fn="testfile";

void usage(const char* name)
{
	printf("%s <mount point> [filename]\n", name);
	printf("\trequires <mount point> to be defined in /etc/fstab\n");
	printf("\t<mount point> will be unmounted and remounted during the test\n");
}

int create_file(const char* name, size_t size)
{
	int ret=0;
	int i;
	int fd;

	fd = open(name, O_CREAT|O_RDWR|O_SYNC|O_TRUNC, S_IWUSR|S_IRUSR|S_IRGRP|S_IROTH);
	if (fd < 0) {
		fprintf(stderr, "Failed to open '%s' for writing, errno=%d\n", name, errno);
		return errno;
	}

	for (i = size / sizeof(*buf); i > 0; i--) {
		buf[i-1] = i;
	}
	write(fd, buf, size);
	memset(buf, 0x55, size);

	close(fd);
	return ret;
}

int do_check(int fd, void *mapptr, size_t size)
{
	const int num_pages=size/PAGE_SIZE;
	volatile unsigned char *ptr=mapptr;
	int errors = 0;
	int soft = 0;
	int page;

	printf("Checking data from %08lx to %08lx\n", (unsigned long)(ptr + size),
	       (unsigned long)ptr);

	for (page = num_pages - 1; page >= 0; page--) {
		volatile unsigned long *pp=(volatile unsigned long *)&ptr[page*PAGE_SIZE];
		int offs;
		int page_errs=0;
		int err_offs=-1;

		for (offs = 0; offs < PAGE_SIZE; offs += sizeof(unsigned long)) {
			volatile unsigned long *lp=&pp[offs/sizeof(unsigned long)];
			unsigned long data=*lp;
			unsigned long ref=(((page*PAGE_SIZE)+offs)/sizeof(data)) + 1;

			if (data != ref) {
				const int max_tries=100000;
				int retries=max_tries;
				unsigned long new_data=*lp;

				errors++;
				page_errs++;
				while ((new_data != ref) && (--retries > 0)) {
					if (data != new_data) {
						fprintf(stderr, "Data @ page %03x:%03x (%08lx) changed to %08lx(%08lx)\n",
							page, offs, (unsigned long)lp, new_data, ref);
					}
					data = new_data;
					new_data = *lp;
				}
				if (new_data == ref) {
					fprintf(stderr, "Data @ page %03x:%03x (%08lx) OK after %d retries: %08lx\n",
						page, offs, (unsigned long)lp, max_tries - retries, new_data);
					soft++;
				} else {
					if (err_offs != offs) {
						fprintf(stderr, "Data error @ page %03x:%03x (%08lx): %08lx -> %08lx\n",
							page, offs, (unsigned long)lp, ref, data);
						err_offs = offs;
					}
					// retry the same page again, until data is correct
					offs = 0;
				}
			}
		}
		if (page_errs) {
			page = num_pages;
		}
	}

	fprintf(stderr, "Errors reverse check: %d; soft: %d; total bytes %d in %d pages\n",
		errors, soft, size, num_pages);

	return errors;
}

int check_file(const char* name, size_t size)
{
	int ret=0;
	int fd;
	void *ptr=NULL;
	int errors=0;
	int last_errors=0;

	fd = open(name, O_RDONLY|O_SYNC);
	if (fd < 0) {
		fprintf(stderr, "Failed to open '%s' for reading\n", name);
		return errno;
	}

	ptr = mmap(NULL, size, PROT_READ, MAP_SHARED/*PRIVATE*/, fd, 0);
	if (ptr == MAP_FAILED) {
		close(fd);
		return -ENOMEM;
	}

	printf("Checking file '%s'\n", name);
	do {
		last_errors = errors;
		errors = do_check(fd, ptr, size);
		if (errors != 0) {
			ret = errors;
		}
	} while (errors > 0 && errors != last_errors);

	if (munmap(ptr, size) != 0) {
		fprintf(stderr, "Failed to unmap %08lx\n", (unsigned long)ptr);
		if (ret == 0) {
			ret = -ENOMEM;
		}
	}
	close(fd);
	if (buf != NULL) {
		memset(buf, 0x55, size);
	}

	if (ret == 0) {
		printf("check successful\n");
	} else {
		printf("check failed\n");
	}

	return ret;
}

int main(int argc, char *argv[])
{
	int rc=0;
	char fname[100];
	char mount[44];
	char umount[44];

	if (argc < 2) {
		// first argument is required
		usage(argv[0]);
		return 1;
	}
	if (argc > 2) {
		// take optional second argument as filename
		fn = argv[2];
	}

	sprintf(fname, "%s/%s", argv[1], fn);
	sprintf(mount, "mount %s", argv[1]);
	sprintf(umount, "umount %s", argv[1]);

	file_size &= ~PAGE_MASK; // round size to page boundary
	buf = malloc(file_size);

	if (buf == NULL) {
		fprintf(stderr, "Failed to allocate buffer\n");
		rc = -ENOMEM;
	}

#ifdef USE_BLKFLSBUF	
	printf("Mounting '%s'\n", argv[1]);
	system(mount);
#endif

	while (rc == 0) {
		printf("Opening '%s'\n", fname);
		rc = create_file(fname, file_size);
		if (rc != 0) {
			fprintf(stderr, "Failed to create file '%s', rc=%d\n", fname, rc);
			break;
		}

#ifndef USE_BLKFLSBUF
		printf("Unmounting '%s'\n", argv[1]);
		system(umount);

		printf("Remounting '%s'\n", argv[1]);
		system(mount);
#else
		{
			int fd = open("/dev/loop0", O_RDONLY);
			ioctl(fd, BLKFLSBUF, 0);
			ioctl(fd, BLKRASET, 0);
			ioctl(fd, BLKFRASET, 0);
			close(fd);
		}
#endif

		rc = check_file(fname, file_size);
	}

	if (buf != NULL) {
		free(buf);
	}

	return rc;
}

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 16:30                                                                 ` Russell King - ARM Linux
@ 2010-03-04 17:34                                                                   ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 17:34 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Paul Mundt, FUJITA Tomonori, mdharm-kernel, oliver, greg,
	x0082077, sshtylyov, benh, bigeasy, linux-usb, linux-kernel,
	James Bottomley, santosh.shilimkar, Pavel Machek, tom.leiming,
	linux-arm-kernel

On Thu, 2010-03-04 at 16:30 +0000, Russell King - ARM Linux wrote:
> On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> > On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > > The thing which was discovered in this thread is basically that ARM is
> > > > handling deferred flushing (for D/I coherency) in a slightly different
> > > > way from everyone else ...
> > >
> > > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> > SH used to use it as a PG_mapped which was roughly similar to the
> > PG_dcache_clean approach, at which point things like flushing for the PIO
> > case in the HCD wasn't necessary. It did result in rather aggressive over
> > flushing though, which is one of the reasons we elected to switch to
> > PG_dcache_dirty.
> >
> > Note that the PG_dcache_dirty semantics are also outlined in
> > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.
> 
> Indeed; the ARM approach was basically taken from Sparc64.
[...]
> The general critera (from memory) seems to be:
> - a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
> - write allocate caches show the problem better than read allocate only
> - using a block device for the filesystem
> - mmap'ing a page and immediately accessing the last few cache lines in
>   that page

It actually triggers easily with a non-aliasing VIPT cache (can't even
start /sbin/init). The main condition is for the caches to be in
write-allocate mode (and the processor to support this, i.e. Cortex-A9).

A simple test is to use an ext2/3 filesystem (cramfs, jffs2 etc.
wouldn't do since they call flush_dcache_page) on a compact flash card
using the pata_platform driver (and without commit 2d68b7fe55d9e19).

Other forms of triggering this is to use something like slram + ext2/3.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 17:34                                                                   ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 17:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 16:30 +0000, Russell King - ARM Linux wrote:
> On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> > On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > > The thing which was discovered in this thread is basically that ARM is
> > > > handling deferred flushing (for D/I coherency) in a slightly different
> > > > way from everyone else ...
> > >
> > > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> > SH used to use it as a PG_mapped which was roughly similar to the
> > PG_dcache_clean approach, at which point things like flushing for the PIO
> > case in the HCD wasn't necessary. It did result in rather aggressive over
> > flushing though, which is one of the reasons we elected to switch to
> > PG_dcache_dirty.
> >
> > Note that the PG_dcache_dirty semantics are also outlined in
> > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.
> 
> Indeed; the ARM approach was basically taken from Sparc64.
[...]
> The general critera (from memory) seems to be:
> - a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
> - write allocate caches show the problem better than read allocate only
> - using a block device for the filesystem
> - mmap'ing a page and immediately accessing the last few cache lines in
>   that page

It actually triggers easily with a non-aliasing VIPT cache (can't even
start /sbin/init). The main condition is for the caches to be in
write-allocate mode (and the processor to support this, i.e. Cortex-A9).

A simple test is to use an ext2/3 filesystem (cramfs, jffs2 etc.
wouldn't do since they call flush_dcache_page) on a compact flash card
using the pata_platform driver (and without commit 2d68b7fe55d9e19).

Other forms of triggering this is to use something like slram + ext2/3.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 17:34                                                                   ` Catalin Marinas
@ 2010-03-04 17:54                                                                     ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 17:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Paul Mundt, FUJITA Tomonori, mdharm-kernel, oliver, greg,
	x0082077, sshtylyov, benh, bigeasy, linux-usb, linux-kernel,
	James Bottomley, santosh.shilimkar, Pavel Machek, tom.leiming,
	linux-arm-kernel

On Thu, Mar 04, 2010 at 05:34:28PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 16:30 +0000, Russell King - ARM Linux wrote:
> > On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> > > On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > > > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > > > The thing which was discovered in this thread is basically that ARM is
> > > > > handling deferred flushing (for D/I coherency) in a slightly different
> > > > > way from everyone else ...
> > > >
> > > > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > > > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > > > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > > 
> > > SH used to use it as a PG_mapped which was roughly similar to the
> > > PG_dcache_clean approach, at which point things like flushing for the PIO
> > > case in the HCD wasn't necessary. It did result in rather aggressive over
> > > flushing though, which is one of the reasons we elected to switch to
> > > PG_dcache_dirty.
> > >
> > > Note that the PG_dcache_dirty semantics are also outlined in
> > > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.
> > 
> > Indeed; the ARM approach was basically taken from Sparc64.
> [...]
> > The general critera (from memory) seems to be:
> > - a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
> > - write allocate caches show the problem better than read allocate only
> > - using a block device for the filesystem
> > - mmap'ing a page and immediately accessing the last few cache lines in
> >   that page
> 
> It actually triggers easily with a non-aliasing VIPT cache (can't even
> start /sbin/init). The main condition is for the caches to be in
> write-allocate mode (and the processor to support this, i.e. Cortex-A9).
> 
> A simple test is to use an ext2/3 filesystem (cramfs, jffs2 etc.
> wouldn't do since they call flush_dcache_page) on a compact flash card
> using the pata_platform driver (and without commit 2d68b7fe55d9e19).

Yes, but this is a combination of hardware has only become available to
me in the last three months.

Previously, I've had reports of ext2 on CF cards on PXA255 based systems
giving problems.  However, I have a PXA255 system which runs its rootfs
off a CF card (which runs applications such as Abiword and gnumeric), but
it has never exhibited the reported problems...

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 17:54                                                                     ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 17:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 05:34:28PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 16:30 +0000, Russell King - ARM Linux wrote:
> > On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> > > On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > > > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > > > The thing which was discovered in this thread is basically that ARM is
> > > > > handling deferred flushing (for D/I coherency) in a slightly different
> > > > > way from everyone else ...
> > > >
> > > > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > > > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > > > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > > 
> > > SH used to use it as a PG_mapped which was roughly similar to the
> > > PG_dcache_clean approach, at which point things like flushing for the PIO
> > > case in the HCD wasn't necessary. It did result in rather aggressive over
> > > flushing though, which is one of the reasons we elected to switch to
> > > PG_dcache_dirty.
> > >
> > > Note that the PG_dcache_dirty semantics are also outlined in
> > > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.
> > 
> > Indeed; the ARM approach was basically taken from Sparc64.
> [...]
> > The general critera (from memory) seems to be:
> > - a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
> > - write allocate caches show the problem better than read allocate only
> > - using a block device for the filesystem
> > - mmap'ing a page and immediately accessing the last few cache lines in
> >   that page
> 
> It actually triggers easily with a non-aliasing VIPT cache (can't even
> start /sbin/init). The main condition is for the caches to be in
> write-allocate mode (and the processor to support this, i.e. Cortex-A9).
> 
> A simple test is to use an ext2/3 filesystem (cramfs, jffs2 etc.
> wouldn't do since they call flush_dcache_page) on a compact flash card
> using the pata_platform driver (and without commit 2d68b7fe55d9e19).

Yes, but this is a combination of hardware has only become available to
me in the last three months.

Previously, I've had reports of ext2 on CF cards on PXA255 based systems
giving problems.  However, I have a PXA255 system which runs its rootfs
off a CF card (which runs applications such as Abiword and gnumeric), but
it has never exhibited the reported problems...

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 15:41                                                               ` Paul Mundt
@ 2010-03-04 18:07                                                                 ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 18:07 UTC (permalink / raw)
  To: Paul Mundt
  Cc: James Bottomley, Pavel Machek, FUJITA Tomonori, benh, linux,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Thu, 2010-03-04 at 15:41 +0000, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ...
> >
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.

Are you more in favour if a PIO kmap API than inverting the meaning of
PG_arch_1? 

I'm not familiar with SH but for PIO devices the flushing shouldn't be
more aggressive. For the DMA devices, Russell suggested that we mark the
page as clean (set PG_dcache_clean) in the DMA API to avoid the default
flushing.

> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Yes, but the flush_dcache_page() semantics outlined in the same file
aren't followed by all the PIO drivers in the kernel.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 18:07                                                                 ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 18:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 15:41 +0000, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ...
> >
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.

Are you more in favour if a PIO kmap API than inverting the meaning of
PG_arch_1? 

I'm not familiar with SH but for PIO devices the flushing shouldn't be
more aggressive. For the DMA devices, Russell suggested that we mark the
page as clean (set PG_dcache_clean) in the DMA API to avoid the default
flushing.

> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Yes, but the flush_dcache_page() semantics outlined in the same file
aren't followed by all the PIO drivers in the kernel.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04  8:26                                                               ` James Bottomley
@ 2010-03-04 21:25                                                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: mdharm-kernel, linux-usb, linux, tom.leiming, x0082077,
	sshtylyov, catalin.marinas, bigeasy, oliver, linux-kernel,
	FUJITA Tomonori, santosh.shilimkar, pavel, greg,
	linux-arm-kernel


> > Still, you do need to flush I when a page cache page is recycled.
> 
> Technically not if we've got all the I flushing when mapped executable
> sorted out.  This is one of the dangers of over flushing ... if we start
> flushing where we don't need it "just to be sure" we end up papering
> over holes in the operating system and make catching actual bugs in
> operations a lot harder.

Well, ok so we are talking past each other here :-) So let me try to
summarize what we do, and then write up what I'd like to be able to do
but can't quite see how to get there just yet.

On PPC, we keep track of whether a page is "cache clean" with PG_arch1. 

We only bother with flushing it when mapping it and yes, it's an
expensive operation.

We do it from within set_pte_at() and/or ptep_set_access_flags(), at
which point w test PG_arch_1, and if clear, do the flush and set it.

On systems that support per-page exec permission, we optimize things a
bit, in that unless this is an exec fault, we "skip" the flush when
mapping the page and filter out the exec permission (so that's a read
access for example). We later do the flush when exec is attempted.

On systems that don't (earlier 32-bit powerpc), we -have- to flush any
mapped page sadly as one could be mapped for read and actually executed
from. This is -not- a case of "let userspace shoot themselves in the
foot", letting stale icache leak through to userspace here is actually a
security hole in theory (granted, unlikely but we got barked at enough
when we tried to optimize that out).

Now, when we do the flush as described above, we do both D$ and I$
passes at once.

It would be indeed nice to be able to avoid the D$ flush when the page
was the target of a DMA operation, since the D$ flush is the most
expensive part of the process.

However, I don't see how to do that without having a separate page bit
to keep track of the D$ vs. I$ state. For example, if we use PG_arch_1
exclusively for D$, and always flush I$ on mapping to userspace, we end
up with a lot of I$ spurrious flushes any time glibc text for example is
mapped into a new process.

> The other thing you might not appreciate in ppc land is that for a lot
> of other systems (well, like parisc) flushing a dirty cache line is
> incredibly expensive (because we halt the CPU to wait for the memory
> eviction), 

Same here. High end server PPCs have the I$ snoop the D$ but on all the
other ones, we pay a dear price for those flushes, which is why I'm
trying to see how I could exploit the trick of not doing the D$ side
flush at least for targets of DMA ops, but as I said, I can't see how it
can be done properly without another tracking bit in struct page.

> so ideally we want to flush as late as possible to give the
> natural operations a chance to clean most of the cache lines.  Flushing
> a clean cache line on parisc as well as invalidations are fast
> operations.  That's why the kmap makes the most sense to us for
> implementing PIO ops ... it's the farthest point we can flush the cache
> at (because beyond it we've lost the mapping the VIPT cache requires to

Cheers,
Ben.

> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 21:25                                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:25 UTC (permalink / raw)
  To: linux-arm-kernel


> > Still, you do need to flush I when a page cache page is recycled.
> 
> Technically not if we've got all the I flushing when mapped executable
> sorted out.  This is one of the dangers of over flushing ... if we start
> flushing where we don't need it "just to be sure" we end up papering
> over holes in the operating system and make catching actual bugs in
> operations a lot harder.

Well, ok so we are talking past each other here :-) So let me try to
summarize what we do, and then write up what I'd like to be able to do
but can't quite see how to get there just yet.

On PPC, we keep track of whether a page is "cache clean" with PG_arch1. 

We only bother with flushing it when mapping it and yes, it's an
expensive operation.

We do it from within set_pte_at() and/or ptep_set_access_flags(), at
which point w test PG_arch_1, and if clear, do the flush and set it.

On systems that support per-page exec permission, we optimize things a
bit, in that unless this is an exec fault, we "skip" the flush when
mapping the page and filter out the exec permission (so that's a read
access for example). We later do the flush when exec is attempted.

On systems that don't (earlier 32-bit powerpc), we -have- to flush any
mapped page sadly as one could be mapped for read and actually executed
from. This is -not- a case of "let userspace shoot themselves in the
foot", letting stale icache leak through to userspace here is actually a
security hole in theory (granted, unlikely but we got barked at enough
when we tried to optimize that out).

Now, when we do the flush as described above, we do both D$ and I$
passes at once.

It would be indeed nice to be able to avoid the D$ flush when the page
was the target of a DMA operation, since the D$ flush is the most
expensive part of the process.

However, I don't see how to do that without having a separate page bit
to keep track of the D$ vs. I$ state. For example, if we use PG_arch_1
exclusively for D$, and always flush I$ on mapping to userspace, we end
up with a lot of I$ spurrious flushes any time glibc text for example is
mapped into a new process.

> The other thing you might not appreciate in ppc land is that for a lot
> of other systems (well, like parisc) flushing a dirty cache line is
> incredibly expensive (because we halt the CPU to wait for the memory
> eviction), 

Same here. High end server PPCs have the I$ snoop the D$ but on all the
other ones, we pay a dear price for those flushes, which is why I'm
trying to see how I could exploit the trick of not doing the D$ side
flush at least for targets of DMA ops, but as I said, I can't see how it
can be done properly without another tracking bit in struct page.

> so ideally we want to flush as late as possible to give the
> natural operations a chance to clean most of the cache lines.  Flushing
> a clean cache line on parisc as well as invalidations are fast
> operations.  That's why the kmap makes the most sense to us for
> implementing PIO ops ... it's the farthest point we can flush the cache
> at (because beyond it we've lost the mapping the VIPT cache requires to

Cheers,
Ben.

> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 14:21                                                           ` James Bottomley
@ 2010-03-04 21:28                                                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pavel Machek, mdharm-kernel, oliver, linux, tom.leiming, greg,
	x0082077, sshtylyov, Catalin Marinas, bigeasy, linux-usb,
	linux-kernel, FUJITA Tomonori, santosh.shilimkar,
	linux-arm-kernel

On Thu, 2010-03-04 at 19:51 +0530, James Bottomley wrote:
> 
> Technically, he is.  In the old days, most VI architectures were high
> end enough not to require PIO transfers.  The only exception was an
> IDE driver used by sparc, which lead to the arch specific ide in/out
> string instructions, in which sparc actually did all the necessary
> flushing.

Actually, Catalin's problem is with newer PIPT ARM :-)

> So no other drivers than old IDE grew up with cache flushing in the
> PIO case (and almost no high end VI hardware had an IDE interface, so
> they rarely got implemented in the arch layer).  However, recently,
> with the transition from old IDE to libata and the prevalence of ARM
> with more commodity hardware, the deficiency is becoming exposed.
> Even the PA8000 workstations now come with an IDE CD, which means
> we're starting to have problems with them as well.

I don't think there's a core or driver problem in this specific case. As
we discussed earlier, I believe the problem is that ARM considers a
fresh page out of the page cache as "clean" instead of "dirty", and
inverting that like we do on powerpc will fix their problem too.

> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> > 
> > You could argue that performance improvement (how big is it,
> anyway?)
> > is worth it, but this should be agreed to by wider community...
> 
> Performance is always worth it provided we don't sacrifice
> correctness.
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... once that's fixed, ARM will likely not have
> the D/I problem, but we'll still have the libata (and other PIO
> systems) D flushing issue. 

You mean older VIVT ARM will grow a new issue there ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 21:28                                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 19:51 +0530, James Bottomley wrote:
> 
> Technically, he is.  In the old days, most VI architectures were high
> end enough not to require PIO transfers.  The only exception was an
> IDE driver used by sparc, which lead to the arch specific ide in/out
> string instructions, in which sparc actually did all the necessary
> flushing.

Actually, Catalin's problem is with newer PIPT ARM :-)

> So no other drivers than old IDE grew up with cache flushing in the
> PIO case (and almost no high end VI hardware had an IDE interface, so
> they rarely got implemented in the arch layer).  However, recently,
> with the transition from old IDE to libata and the prevalence of ARM
> with more commodity hardware, the deficiency is becoming exposed.
> Even the PA8000 workstations now come with an IDE CD, which means
> we're starting to have problems with them as well.

I don't think there's a core or driver problem in this specific case. As
we discussed earlier, I believe the problem is that ARM considers a
fresh page out of the page cache as "clean" instead of "dirty", and
inverting that like we do on powerpc will fix their problem too.

> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> > 
> > You could argue that performance improvement (how big is it,
> anyway?)
> > is worth it, but this should be agreed to by wider community...
> 
> Performance is always worth it provided we don't sacrifice
> correctness.
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... once that's fixed, ARM will likely not have
> the D/I problem, but we'll still have the libata (and other PIO
> systems) D flushing issue. 

You mean older VIVT ARM will grow a new issue there ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 15:25                                                               ` Catalin Marinas
@ 2010-03-04 21:31                                                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:31 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Russell King - ARM Linux, FUJITA Tomonori, mdharm-kernel, oliver,
	greg, x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel,
	James Bottomley, santosh.shilimkar, Pavel Machek, tom.leiming,
	linux-arm-kernel

On Thu, 2010-03-04 at 15:25 +0000, Catalin Marinas wrote:
> My understanding from this long discussion is that we cannot get the
> kernel modifying a page cache page which is already mapped in user space
> (well, ptrace does this but we flush the cache there already).

Well, we -can- but it appears that we don't have to provide coherency
in that case since the modification is always done as the result of
userspace explicitely requesting that change (aka read() syscall) and
thus userspace is responsible for the flushing.

Cheers,
Ben.





^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 21:31                                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 15:25 +0000, Catalin Marinas wrote:
> My understanding from this long discussion is that we cannot get the
> kernel modifying a page cache page which is already mapped in user space
> (well, ptrace does this but we flush the cache there already).

Well, we -can- but it appears that we don't have to provide coherency
in that case since the modification is always done as the result of
userspace explicitely requesting that change (aka read() syscall) and
thus userspace is responsible for the flushing.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 15:41                                                               ` Paul Mundt
@ 2010-03-04 21:34                                                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:34 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Catalin Marinas, FUJITA Tomonori, mdharm-kernel, oliver, linux,
	greg, x0082077, sshtylyov, bigeasy, linux-usb, linux-kernel,
	James Bottomley, santosh.shilimkar, Pavel Machek, tom.leiming,
	linux-arm-kernel

On Fri, 2010-03-05 at 00:41 +0900, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... 
> > 
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.
> 
> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Doing this way though is a lot more fragile... since page cache pages
are no longer dirty by default, you need to ensure that any driver
writing to one without DMA sets PG_arch_1, and as we've seen, this is
generally not the case (it's almost never the case actually).

Also, in the DMA case, you may not need to flush D$, but you -still-
need to invalidate I$, and unless you then get another bit for tracking
it, you end up doing a lot of over-invalidating of I$ no ?

Or am I missing a critical piece of the puzzle ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 21:34                                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-03-05 at 00:41 +0900, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... 
> > 
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.
> 
> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Doing this way though is a lot more fragile... since page cache pages
are no longer dirty by default, you need to ensure that any driver
writing to one without DMA sets PG_arch_1, and as we've seen, this is
generally not the case (it's almost never the case actually).

Also, in the DMA case, you may not need to flush D$, but you -still-
need to invalidate I$, and unless you then get another bit for tracking
it, you end up doing a lot of over-invalidating of I$ no ?

Or am I missing a critical piece of the puzzle ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 18:07                                                                 ` Catalin Marinas
@ 2010-03-04 21:37                                                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:37 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Paul Mundt, James Bottomley, Pavel Machek, FUJITA Tomonori,
	linux, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel

On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> 
> Are you more in favour if a PIO kmap API than inverting the meaning of
> PG_arch_1? 

My main worry with this approach is the sheer amount of drivers that
need fixing. I believe inverting PG_arch_1 is a better solution and I
somewhat fail to see how we end up doing too much flushing if we have
per-page execute permission (but maybe SH doesn't ?)

> I'm not familiar with SH but for PIO devices the flushing shouldn't be
> more aggressive. For the DMA devices, Russell suggested that we mark
> the
> page as clean (set PG_dcache_clean) in the DMA API to avoid the
> default
> flushing.

I really like that idea, as I said earlier, but I'm worried about the I$
side of things. IE. What I'm trying to say is that I can't see how to do
that optimisation without ending up with missing I$ invalidations or
doing way too many of them, unless we have a separate bit to track I$
state.

> > Note that the PG_dcache_dirty semantics are also outlined in
> > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly
> esoteric.
> 
> Yes, but the flush_dcache_page() semantics outlined in the same file
> aren't followed by all the PIO drivers in the kernel.
> 

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 21:37                                                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> 
> Are you more in favour if a PIO kmap API than inverting the meaning of
> PG_arch_1? 

My main worry with this approach is the sheer amount of drivers that
need fixing. I believe inverting PG_arch_1 is a better solution and I
somewhat fail to see how we end up doing too much flushing if we have
per-page execute permission (but maybe SH doesn't ?)

> I'm not familiar with SH but for PIO devices the flushing shouldn't be
> more aggressive. For the DMA devices, Russell suggested that we mark
> the
> page as clean (set PG_dcache_clean) in the DMA API to avoid the
> default
> flushing.

I really like that idea, as I said earlier, but I'm worried about the I$
side of things. IE. What I'm trying to say is that I can't see how to do
that optimisation without ending up with missing I$ invalidations or
doing way too many of them, unless we have a separate bit to track I$
state.

> > Note that the PG_dcache_dirty semantics are also outlined in
> > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly
> esoteric.
> 
> Yes, but the flush_dcache_page() semantics outlined in the same file
> aren't followed by all the PIO drivers in the kernel.
> 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 21:28                                                             ` Benjamin Herrenschmidt
@ 2010-03-04 21:40                                                               ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 21:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: James Bottomley, Pavel Machek, mdharm-kernel, oliver,
	tom.leiming, greg, x0082077, sshtylyov, Catalin Marinas, bigeasy,
	linux-usb, linux-kernel, FUJITA Tomonori, santosh.shilimkar,
	linux-arm-kernel

On Fri, Mar 05, 2010 at 08:28:34AM +1100, Benjamin Herrenschmidt wrote:
> I don't think there's a core or driver problem in this specific case. As
> we discussed earlier, I believe the problem is that ARM considers a
> fresh page out of the page cache as "clean" instead of "dirty", and
> inverting that like we do on powerpc will fix their problem too.

The only concern is that it means we treat anonymous pages as dirty
by default.

That's quite sub-optimal since we take care (eg) on write faults to
copy the page and take care of the cache issues while we do that -
whether that be remapping the page to be coherent with the user
address, or cleaning each cache line as we copy the data.

Of course, the simple solution is to also arrange for PG_arch_1 to be
set in this case.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 21:40                                                               ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 21:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 08:28:34AM +1100, Benjamin Herrenschmidt wrote:
> I don't think there's a core or driver problem in this specific case. As
> we discussed earlier, I believe the problem is that ARM considers a
> fresh page out of the page cache as "clean" instead of "dirty", and
> inverting that like we do on powerpc will fix their problem too.

The only concern is that it means we treat anonymous pages as dirty
by default.

That's quite sub-optimal since we take care (eg) on write faults to
copy the page and take care of the cache issues while we do that -
whether that be remapping the page to be coherent with the user
address, or cleaning each cache line as we copy the data.

Of course, the simple solution is to also arrange for PG_arch_1 to be
set in this case.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 21:37                                                                   ` Benjamin Herrenschmidt
@ 2010-03-04 22:11                                                                     ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 22:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul Mundt, James Bottomley, Pavel Machek, FUJITA Tomonori,
	linux, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel

On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > more aggressive. For the DMA devices, Russell suggested that we mark
> > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > default flushing.
> 
> I really like that idea, as I said earlier, but I'm worried about the I$
> side of things. IE. What I'm trying to say is that I can't see how to do
> that optimisation without ending up with missing I$ invalidations or
> doing way too many of them, unless we have a separate bit to track I$
> state.

But does this optimisation really matter? I think with careful checking
in set_pte_at(), you are not going to invalidate the I-cache more than
necessary. If the original page wasn't pte_present() you would need to
do the I-cache invalidation. The other cases where set_pte_at() is
called for LRU (pte_young) or COW (pte_write) we can avoid the extra
invalidation.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-04 22:11                                                                     ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-04 22:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > more aggressive. For the DMA devices, Russell suggested that we mark
> > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > default flushing.
> 
> I really like that idea, as I said earlier, but I'm worried about the I$
> side of things. IE. What I'm trying to say is that I can't see how to do
> that optimisation without ending up with missing I$ invalidations or
> doing way too many of them, unless we have a separate bit to track I$
> state.

But does this optimisation really matter? I think with careful checking
in set_pte_at(), you are not going to invalidate the I-cache more than
necessary. If the original page wasn't pte_present() you would need to
do the I-cache invalidation. The other cases where set_pte_at() is
called for LRU (pte_young) or COW (pte_write) we can avoid the extra
invalidation.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 16:30                                                                 ` Russell King - ARM Linux
  (?)
  (?)
@ 2010-03-04 22:27                                                                 ` Andreas Mohr
  -1 siblings, 0 replies; 352+ messages in thread
From: Andreas Mohr @ 2010-03-04 22:27 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Paul Mundt, Catalin Marinas, FUJITA Tomonori, mdharm-kernel,
	oliver, greg, x0082077, sshtylyov, benh, bigeasy, linux-kernel

> Here is a program which Lothar sent me some time ago (the timestamp on
> the .c is June 2004 - I can't find the original email though.)  I've
> just checked with Lothar, who has given me permission to reproduce it.
> 
> I can't guarantee that this program still shows a problem - since I
> believe I've never been able to reproduce it myself.  It might be worth
> checking how other architectures behave.

Tried this on my BCM4710 MIPSEL 2.6.31.9 OpenWrt/Debian (problematic
cache-suspected history due to possibly related USB-audio lockups),
/dev/sda3 ext2 on an USB stick, no errors here, even when increasing tenfold
to 2560 and adding a sleep(2) in between.

Will investigate these things for real sometime later.

Andreas Mohr

P.S.: KARO (Lothar...) makes very nice boards :-)

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 21:37                                                                   ` Benjamin Herrenschmidt
  (?)
@ 2010-03-05  1:17                                                                     ` Paul Mundt
  -1 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-05  1:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 08:37:40AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > Are you more in favour if a PIO kmap API than inverting the meaning of
> > PG_arch_1? 
> 
> My main worry with this approach is the sheer amount of drivers that
> need fixing. I believe inverting PG_arch_1 is a better solution and I
> somewhat fail to see how we end up doing too much flushing if we have
> per-page execute permission (but maybe SH doesn't ?)
> 
Basically we have two different MMUs on VIPT parts, the older one on all
SH-4 parts were all read-implies-exec with no ability to differentiate
between read or exec access. For these parts the PG_dcache_dirty approach
saves us from a lot of flushing, and the corner cases were isolated
enough that we could tolerate fixups at the driver level, even on a
write-allocate D-cache.

For second generation SH-4A (SH-X2) and up parts, read and exec are split
out and we could reasonably adopt the PG_dcache_clean approach there
while adopting the same sort of flushing semantics as PPC to avoid
flushing constantly. The current generation of parts far outnumber their
legacy counterparts, so it's certainly something I plan to experiment
with.

We have an additional level of complexity on some of the SMP parts with a
non-coherent I-cache, some of the early CPUs have broken broadcasting of
the cacheops in hardware and so need to rely on IPIs, while the later
parts broadcast properly. We also need to deal with D-cache IPIs when
using mixed coherency protocols on different CPUs.

For older PIPT parts we've never used the deferred flush, since the only
time we ever had to bother with cache maintenance was in the DMA ops, as
anything closer to the CPU than the PCI DMAC had no opportunity to be
snooped.

> > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > more aggressive. For the DMA devices, Russell suggested that we mark
> > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > default flushing.
> 
> I really like that idea, as I said earlier, but I'm worried about the I$
> side of things. IE. What I'm trying to say is that I can't see how to do
> that optimisation without ending up with missing I$ invalidations or
> doing way too many of them, unless we have a separate bit to track I$
> state.
> 
Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
and certainly worth experimenting with. I don't know how we would do the
I-cache optimization without a PG_arch_2, though.

In any event, if there's going to be a mass exodus to PG_dcache_clean,
Documentation/cachetlb.txt could use a considerable amount of expanding.
The read/exec and I-cache optimizations are something that would be
valuable to document, as opposed to simply being pointed at the sparc64
approach with the regular PG_dcache_dirty caveats.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
@ 2010-03-05  1:17                                                                     ` Paul Mundt
  0 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-05  1:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, James Bottomley, Pavel Machek, FUJITA Tomonori,
	linux, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel, linux-sh

On Fri, Mar 05, 2010 at 08:37:40AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > Are you more in favour if a PIO kmap API than inverting the meaning of
> > PG_arch_1? 
> 
> My main worry with this approach is the sheer amount of drivers that
> need fixing. I believe inverting PG_arch_1 is a better solution and I
> somewhat fail to see how we end up doing too much flushing if we have
> per-page execute permission (but maybe SH doesn't ?)
> 
Basically we have two different MMUs on VIPT parts, the older one on all
SH-4 parts were all read-implies-exec with no ability to differentiate
between read or exec access. For these parts the PG_dcache_dirty approach
saves us from a lot of flushing, and the corner cases were isolated
enough that we could tolerate fixups at the driver level, even on a
write-allocate D-cache.

For second generation SH-4A (SH-X2) and up parts, read and exec are split
out and we could reasonably adopt the PG_dcache_clean approach there
while adopting the same sort of flushing semantics as PPC to avoid
flushing constantly. The current generation of parts far outnumber their
legacy counterparts, so it's certainly something I plan to experiment
with.

We have an additional level of complexity on some of the SMP parts with a
non-coherent I-cache, some of the early CPUs have broken broadcasting of
the cacheops in hardware and so need to rely on IPIs, while the later
parts broadcast properly. We also need to deal with D-cache IPIs when
using mixed coherency protocols on different CPUs.

For older PIPT parts we've never used the deferred flush, since the only
time we ever had to bother with cache maintenance was in the DMA ops, as
anything closer to the CPU than the PCI DMAC had no opportunity to be
snooped.

> > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > more aggressive. For the DMA devices, Russell suggested that we mark
> > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > default flushing.
> 
> I really like that idea, as I said earlier, but I'm worried about the I$
> side of things. IE. What I'm trying to say is that I can't see how to do
> that optimisation without ending up with missing I$ invalidations or
> doing way too many of them, unless we have a separate bit to track I$
> state.
> 
Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
and certainly worth experimenting with. I don't know how we would do the
I-cache optimization without a PG_arch_2, though.

In any event, if there's going to be a mass exodus to PG_dcache_clean,
Documentation/cachetlb.txt could use a considerable amount of expanding.
The read/exec and I-cache optimizations are something that would be
valuable to document, as opposed to simply being pointed at the sparc64
approach with the regular PG_dcache_dirty caveats.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-05  1:17                                                                     ` Paul Mundt
  0 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-05  1:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 08:37:40AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > Are you more in favour if a PIO kmap API than inverting the meaning of
> > PG_arch_1? 
> 
> My main worry with this approach is the sheer amount of drivers that
> need fixing. I believe inverting PG_arch_1 is a better solution and I
> somewhat fail to see how we end up doing too much flushing if we have
> per-page execute permission (but maybe SH doesn't ?)
> 
Basically we have two different MMUs on VIPT parts, the older one on all
SH-4 parts were all read-implies-exec with no ability to differentiate
between read or exec access. For these parts the PG_dcache_dirty approach
saves us from a lot of flushing, and the corner cases were isolated
enough that we could tolerate fixups at the driver level, even on a
write-allocate D-cache.

For second generation SH-4A (SH-X2) and up parts, read and exec are split
out and we could reasonably adopt the PG_dcache_clean approach there
while adopting the same sort of flushing semantics as PPC to avoid
flushing constantly. The current generation of parts far outnumber their
legacy counterparts, so it's certainly something I plan to experiment
with.

We have an additional level of complexity on some of the SMP parts with a
non-coherent I-cache, some of the early CPUs have broken broadcasting of
the cacheops in hardware and so need to rely on IPIs, while the later
parts broadcast properly. We also need to deal with D-cache IPIs when
using mixed coherency protocols on different CPUs.

For older PIPT parts we've never used the deferred flush, since the only
time we ever had to bother with cache maintenance was in the DMA ops, as
anything closer to the CPU than the PCI DMAC had no opportunity to be
snooped.

> > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > more aggressive. For the DMA devices, Russell suggested that we mark
> > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > default flushing.
> 
> I really like that idea, as I said earlier, but I'm worried about the I$
> side of things. IE. What I'm trying to say is that I can't see how to do
> that optimisation without ending up with missing I$ invalidations or
> doing way too many of them, unless we have a separate bit to track I$
> state.
> 
Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
and certainly worth experimenting with. I don't know how we would do the
I-cache optimization without a PG_arch_2, though.

In any event, if there's going to be a mass exodus to PG_dcache_clean,
Documentation/cachetlb.txt could use a considerable amount of expanding.
The read/exec and I-cache optimizations are something that would be
valuable to document, as opposed to simply being pointed at the sparc64
approach with the regular PG_dcache_dirty caveats.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 21:40                                                               ` Russell King - ARM Linux
@ 2010-03-05  4:31                                                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:31 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: FUJITA Tomonori, mdharm-kernel, linux-usb, Catalin Marinas,
	x0082077, sshtylyov, tom.leiming, bigeasy, oliver, linux-kernel,
	James Bottomley, santosh.shilimkar, Pavel Machek, greg,
	linux-arm-kernel

On Thu, 2010-03-04 at 21:40 +0000, Russell King - ARM Linux wrote:
> On Fri, Mar 05, 2010 at 08:28:34AM +1100, Benjamin Herrenschmidt wrote:
> > I don't think there's a core or driver problem in this specific case. As
> > we discussed earlier, I believe the problem is that ARM considers a
> > fresh page out of the page cache as "clean" instead of "dirty", and
> > inverting that like we do on powerpc will fix their problem too.
> 
> The only concern is that it means we treat anonymous pages as dirty
> by default.
>
> That's quite sub-optimal since we take care (eg) on write faults to
> copy the page and take care of the cache issues while we do that -

If you do the cache handling inside your copy_user_highpage() then you
can just set PG_arch_1 stuff there.

> whether that be remapping the page to be coherent with the user
> address, or cleaning each cache line as we copy the data.
> 
> Of course, the simple solution is to also arrange for PG_arch_1 to be
> set in this case.

Right.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-05  4:31                                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 21:40 +0000, Russell King - ARM Linux wrote:
> On Fri, Mar 05, 2010 at 08:28:34AM +1100, Benjamin Herrenschmidt wrote:
> > I don't think there's a core or driver problem in this specific case. As
> > we discussed earlier, I believe the problem is that ARM considers a
> > fresh page out of the page cache as "clean" instead of "dirty", and
> > inverting that like we do on powerpc will fix their problem too.
> 
> The only concern is that it means we treat anonymous pages as dirty
> by default.
>
> That's quite sub-optimal since we take care (eg) on write faults to
> copy the page and take care of the cache issues while we do that -

If you do the cache handling inside your copy_user_highpage() then you
can just set PG_arch_1 stuff there.

> whether that be remapping the page to be coherent with the user
> address, or cleaning each cache line as we copy the data.
> 
> Of course, the simple solution is to also arrange for PG_arch_1 to be
> set in this case.

Right.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 22:11                                                                     ` Catalin Marinas
@ 2010-03-05  4:34                                                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:34 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Paul Mundt, James Bottomley, Pavel Machek, FUJITA Tomonori,
	linux, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel

On Thu, 2010-03-04 at 22:11 +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> 
> But does this optimisation really matter? I think with careful checking
> in set_pte_at(), you are not going to invalidate the I-cache more than
> necessary. If the original page wasn't pte_present() you would need to
> do the I-cache invalidation. The other cases where set_pte_at() is
> called for LRU (pte_young) or COW (pte_write) we can avoid the extra
> invalidation.

No. Not on PIPT (or non aliasing VIPT).

Take your typical glibc text page. This is a struct page that will be
mapped in almost every process in your system. You do not want to do the
icache inval every time. Once it's been cleaned once, it's clean for
subsequent mappings. Only VIVT needs such multiple invalidates I suppose
though in this case you probably do everything differently anyways.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-05  4:34                                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 22:11 +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> 
> But does this optimisation really matter? I think with careful checking
> in set_pte_at(), you are not going to invalidate the I-cache more than
> necessary. If the original page wasn't pte_present() you would need to
> do the I-cache invalidation. The other cases where set_pte_at() is
> called for LRU (pte_young) or COW (pte_write) we can avoid the extra
> invalidation.

No. Not on PIPT (or non aliasing VIPT).

Take your typical glibc text page. This is a struct page that will be
mapped in almost every process in your system. You do not want to do the
icache inval every time. Once it's been cleaned once, it's clean for
subsequent mappings. Only VIVT needs such multiple invalidates I suppose
though in this case you probably do everything differently anyways.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-05  1:17                                                                     ` Paul Mundt
  (?)
@ 2010-03-05  4:44                                                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:44 UTC (permalink / raw)
  To: linux-arm-kernel


> Basically we have two different MMUs on VIPT parts, the older one on all
> SH-4 parts were all read-implies-exec with no ability to differentiate
> between read or exec access. 

Ok, this is the same as the older ppc32 processors.

> For these parts the PG_dcache_dirty approach
> saves us from a lot of flushing, and the corner cases were isolated
> enough that we could tolerate fixups at the driver level, even on a
> write-allocate D-cache.

But how wide a range of devices do you have to support with those ? Is
this a few SoCs or people putting any random PCI device in there for
example ?

If I were to do it that way on ppc32, I worried that it would be more
than a few drivers that I would have to fix :-) All the 32-bit PowerMac
and PowerBooks for example, all of freescale 74xx based parts, etc...
those guys have PCI, and all sort of random HW plugged into them.

I would -love- to avoid that horrible amount of flushing we do on these,
it's quite high on any profile run, but I haven't found a good way to do
so. There's also a nasty issue of icache content leaking between
processes which I doubt is exploitable but I had people having a go at
me about it when I tried to avoid icache cleaning anonymous pages by
default.

> For second generation SH-4A (SH-X2) and up parts, read and exec are split
> out and we could reasonably adopt the PG_dcache_clean approach there
> while adopting the same sort of flushing semantics as PPC to avoid
> flushing constantly. The current generation of parts far outnumber their
> legacy counterparts, so it's certainly something I plan to experiment
> with.

I'd be curious to see whether you get a perf imporovement with that.

Note that we still have this additional thing that is floating around in
this thread which I thing is definitely worthwhile to do, which is to
mark clean pages that have been written to with DMA in dma_unmap and
friends.... if we can fix the icache problem. So far, I haven't found
James replies on this satisfactory :-) But maybe I just missed
something.

> We have an additional level of complexity on some of the SMP parts with a
> non-coherent I-cache,

I've that on some embedded ppc's too, where the icache flush instrutions
aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
today nobody sane (appart from Bluegene) did an SMP part with those and
so we have well localized internal hacks for them. But I've heared that
some vendors might be pumping out SoCs with that stuff too soon which
worries me.

>  some of the early CPUs have broken broadcasting of
> the cacheops in hardware and so need to rely on IPIs, while the later
> parts broadcast properly. We also need to deal with D-cache IPIs when
> using mixed coherency protocols on different CPUs.

Right, that sucks. Do those have no-exec permission support ? If they
do, then you can do what I did for BG, which is to ping pong user pages
so they are either writable or executable (since userspace code itself
will break as it will assume the cache ops -are- broadcast, since that's
what the architecture says).

> For older PIPT parts we've never used the deferred flush, since the only
> time we ever had to bother with cache maintenance was in the DMA ops, as
> anything closer to the CPU than the PCI DMAC had no opportunity to be
> snooped.

Do you also, like ARM11MP, have a case of non-cache coherent DMA and
non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
don't see how it can be dealt properly other than using load/store
tricks to bring the data into the local cache and flushing it from
there. DMA ops are called way to deep into spinlock hell to rely on IPIs
(unless your HW also provides some kind of NMI IPIs).

> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> > 
> Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> and certainly worth experimenting with. I don't know how we would do the
> I-cache optimization without a PG_arch_2, though.

Right. That's the one thing I've been trying to figure out without
success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
like it to me...

> In any event, if there's going to be a mass exodus to PG_dcache_clean,
> Documentation/cachetlb.txt could use a considerable amount of expanding.
> The read/exec and I-cache optimizations are something that would be
> valuable to document, as opposed to simply being pointed at the sparc64
> approach with the regular PG_dcache_dirty caveats.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
@ 2010-03-05  4:44                                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:44 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Catalin Marinas, James Bottomley, Pavel Machek, FUJITA Tomonori,
	linux, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel, linux-sh


> Basically we have two different MMUs on VIPT parts, the older one on all
> SH-4 parts were all read-implies-exec with no ability to differentiate
> between read or exec access. 

Ok, this is the same as the older ppc32 processors.

> For these parts the PG_dcache_dirty approach
> saves us from a lot of flushing, and the corner cases were isolated
> enough that we could tolerate fixups at the driver level, even on a
> write-allocate D-cache.

But how wide a range of devices do you have to support with those ? Is
this a few SoCs or people putting any random PCI device in there for
example ?

If I were to do it that way on ppc32, I worried that it would be more
than a few drivers that I would have to fix :-) All the 32-bit PowerMac
and PowerBooks for example, all of freescale 74xx based parts, etc...
those guys have PCI, and all sort of random HW plugged into them.

I would -love- to avoid that horrible amount of flushing we do on these,
it's quite high on any profile run, but I haven't found a good way to do
so. There's also a nasty issue of icache content leaking between
processes which I doubt is exploitable but I had people having a go at
me about it when I tried to avoid icache cleaning anonymous pages by
default.

> For second generation SH-4A (SH-X2) and up parts, read and exec are split
> out and we could reasonably adopt the PG_dcache_clean approach there
> while adopting the same sort of flushing semantics as PPC to avoid
> flushing constantly. The current generation of parts far outnumber their
> legacy counterparts, so it's certainly something I plan to experiment
> with.

I'd be curious to see whether you get a perf imporovement with that.

Note that we still have this additional thing that is floating around in
this thread which I thing is definitely worthwhile to do, which is to
mark clean pages that have been written to with DMA in dma_unmap and
friends.... if we can fix the icache problem. So far, I haven't found
James replies on this satisfactory :-) But maybe I just missed
something.

> We have an additional level of complexity on some of the SMP parts with a
> non-coherent I-cache,

I've that on some embedded ppc's too, where the icache flush instrutions
aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
today nobody sane (appart from Bluegene) did an SMP part with those and
so we have well localized internal hacks for them. But I've heared that
some vendors might be pumping out SoCs with that stuff too soon which
worries me.

>  some of the early CPUs have broken broadcasting of
> the cacheops in hardware and so need to rely on IPIs, while the later
> parts broadcast properly. We also need to deal with D-cache IPIs when
> using mixed coherency protocols on different CPUs.

Right, that sucks. Do those have no-exec permission support ? If they
do, then you can do what I did for BG, which is to ping pong user pages
so they are either writable or executable (since userspace code itself
will break as it will assume the cache ops -are- broadcast, since that's
what the architecture says).

> For older PIPT parts we've never used the deferred flush, since the only
> time we ever had to bother with cache maintenance was in the DMA ops, as
> anything closer to the CPU than the PCI DMAC had no opportunity to be
> snooped.

Do you also, like ARM11MP, have a case of non-cache coherent DMA and
non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
don't see how it can be dealt properly other than using load/store
tricks to bring the data into the local cache and flushing it from
there. DMA ops are called way to deep into spinlock hell to rely on IPIs
(unless your HW also provides some kind of NMI IPIs).

> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> > 
> Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> and certainly worth experimenting with. I don't know how we would do the
> I-cache optimization without a PG_arch_2, though.

Right. That's the one thing I've been trying to figure out without
success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
like it to me...

> In any event, if there's going to be a mass exodus to PG_dcache_clean,
> Documentation/cachetlb.txt could use a considerable amount of expanding.
> The read/exec and I-cache optimizations are something that would be
> valuable to document, as opposed to simply being pointed at the sparc64
> approach with the regular PG_dcache_dirty caveats.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-05  4:44                                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:44 UTC (permalink / raw)
  To: linux-arm-kernel


> Basically we have two different MMUs on VIPT parts, the older one on all
> SH-4 parts were all read-implies-exec with no ability to differentiate
> between read or exec access. 

Ok, this is the same as the older ppc32 processors.

> For these parts the PG_dcache_dirty approach
> saves us from a lot of flushing, and the corner cases were isolated
> enough that we could tolerate fixups at the driver level, even on a
> write-allocate D-cache.

But how wide a range of devices do you have to support with those ? Is
this a few SoCs or people putting any random PCI device in there for
example ?

If I were to do it that way on ppc32, I worried that it would be more
than a few drivers that I would have to fix :-) All the 32-bit PowerMac
and PowerBooks for example, all of freescale 74xx based parts, etc...
those guys have PCI, and all sort of random HW plugged into them.

I would -love- to avoid that horrible amount of flushing we do on these,
it's quite high on any profile run, but I haven't found a good way to do
so. There's also a nasty issue of icache content leaking between
processes which I doubt is exploitable but I had people having a go at
me about it when I tried to avoid icache cleaning anonymous pages by
default.

> For second generation SH-4A (SH-X2) and up parts, read and exec are split
> out and we could reasonably adopt the PG_dcache_clean approach there
> while adopting the same sort of flushing semantics as PPC to avoid
> flushing constantly. The current generation of parts far outnumber their
> legacy counterparts, so it's certainly something I plan to experiment
> with.

I'd be curious to see whether you get a perf imporovement with that.

Note that we still have this additional thing that is floating around in
this thread which I thing is definitely worthwhile to do, which is to
mark clean pages that have been written to with DMA in dma_unmap and
friends.... if we can fix the icache problem. So far, I haven't found
James replies on this satisfactory :-) But maybe I just missed
something.

> We have an additional level of complexity on some of the SMP parts with a
> non-coherent I-cache,

I've that on some embedded ppc's too, where the icache flush instrutions
aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
today nobody sane (appart from Bluegene) did an SMP part with those and
so we have well localized internal hacks for them. But I've heared that
some vendors might be pumping out SoCs with that stuff too soon which
worries me.

>  some of the early CPUs have broken broadcasting of
> the cacheops in hardware and so need to rely on IPIs, while the later
> parts broadcast properly. We also need to deal with D-cache IPIs when
> using mixed coherency protocols on different CPUs.

Right, that sucks. Do those have no-exec permission support ? If they
do, then you can do what I did for BG, which is to ping pong user pages
so they are either writable or executable (since userspace code itself
will break as it will assume the cache ops -are- broadcast, since that's
what the architecture says).

> For older PIPT parts we've never used the deferred flush, since the only
> time we ever had to bother with cache maintenance was in the DMA ops, as
> anything closer to the CPU than the PCI DMAC had no opportunity to be
> snooped.

Do you also, like ARM11MP, have a case of non-cache coherent DMA and
non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
don't see how it can be dealt properly other than using load/store
tricks to bring the data into the local cache and flushing it from
there. DMA ops are called way to deep into spinlock hell to rely on IPIs
(unless your HW also provides some kind of NMI IPIs).

> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> > 
> Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> and certainly worth experimenting with. I don't know how we would do the
> I-cache optimization without a PG_arch_2, though.

Right. That's the one thing I've been trying to figure out without
success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
like it to me...

> In any event, if there's going to be a mass exodus to PG_dcache_clean,
> Documentation/cachetlb.txt could use a considerable amount of expanding.
> The read/exec and I-cache optimizations are something that would be
> valuable to document, as opposed to simply being pointed at the sparc64
> approach with the regular PG_dcache_dirty caveats.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-05  4:34                                                                       ` Benjamin Herrenschmidt
@ 2010-03-05  9:27                                                                         ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-05  9:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul Mundt, James Bottomley, Pavel Machek, FUJITA Tomonori,
	linux, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel

On Fri, 2010-03-05 at 04:34 +0000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 22:11 +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> > > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > > default flushing.
> > >
> > > I really like that idea, as I said earlier, but I'm worried about the I$
> > > side of things. IE. What I'm trying to say is that I can't see how to do
> > > that optimisation without ending up with missing I$ invalidations or
> > > doing way too many of them, unless we have a separate bit to track I$
> > > state.
> >
> > But does this optimisation really matter? I think with careful checking
> > in set_pte_at(), you are not going to invalidate the I-cache more than
> > necessary. If the original page wasn't pte_present() you would need to
> > do the I-cache invalidation. The other cases where set_pte_at() is
> > called for LRU (pte_young) or COW (pte_write) we can avoid the extra
> > invalidation.
> 
> No. Not on PIPT (or non aliasing VIPT).
> 
> Take your typical glibc text page. This is a struct page that will be
> mapped in almost every process in your system. You do not want to do the
> icache inval every time. Once it's been cleaned once, it's clean for
> subsequent mappings. Only VIVT needs such multiple invalidates I suppose
> though in this case you probably do everything differently anyways.

Yes, you are right, shared libraries don't need the extra flushing with
PIPT caches.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-05  9:27                                                                         ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-05  9:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-03-05 at 04:34 +0000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 22:11 +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> > > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > > default flushing.
> > >
> > > I really like that idea, as I said earlier, but I'm worried about the I$
> > > side of things. IE. What I'm trying to say is that I can't see how to do
> > > that optimisation without ending up with missing I$ invalidations or
> > > doing way too many of them, unless we have a separate bit to track I$
> > > state.
> >
> > But does this optimisation really matter? I think with careful checking
> > in set_pte_at(), you are not going to invalidate the I-cache more than
> > necessary. If the original page wasn't pte_present() you would need to
> > do the I-cache invalidation. The other cases where set_pte_at() is
> > called for LRU (pte_young) or COW (pte_write) we can avoid the extra
> > invalidation.
> 
> No. Not on PIPT (or non aliasing VIPT).
> 
> Take your typical glibc text page. This is a struct page that will be
> mapped in almost every process in your system. You do not want to do the
> icache inval every time. Once it's been cleaned once, it's clean for
> subsequent mappings. Only VIVT needs such multiple invalidates I suppose
> though in this case you probably do everything differently anyways.

Yes, you are right, shared libraries don't need the extra flushing with
PIPT caches.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 14:27                                                             ` Russell King - ARM Linux
@ 2010-03-06 10:47                                                               ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-06 10:47 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Pavel Machek, Catalin Marinas, FUJITA Tomonori, benh,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> > > 
> > > You could argue that performance improvement (how big is it, anyway?)
> > > is worth it, but this should be agreed to by wider community...
> > 
> > Performance is always worth it provided we don't sacrifice correctness.
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... once that's fixed, ARM will likely not have
> > the D/I problem, but we'll still have the libata (and other PIO systems)
> > D flushing issue.
> 
> I think you've got that backwards.
> 
> Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> since we'll interpret '0' to mean "page is dirty, it needs flushing before
> hitting userspace", whereas '1' means "page has been cleaned; there are no
> aliases."

Yes, that looks about right ... I'll think about doing this for parisc
as well.

> This doesn not address the I/D coherency issue, where the Icache needs
> attention to get rid of speculatively loaded cache lines while old data
> was present in the cache.

No, I understand that.  However, I/D coherency is handled way after the
writes to the page in the page cache.

On a fault in of exec data, we first try to get the page out of the page
cache.  If it's not present, we put the faulting process to sleep and
fetch it in from storage.  When we do the read, on the PIO path, the
kernel alias for the page becomes dirty.  Some time later, we place the
page into the user space (updating the pte entry that caused a fault).
At this point, we'll call both flush_icache_page() and
update_mmu_cache() ... this is where the I/D resolution should be done.
Since it's after any I/O has occurred, it doesn't matter whether the CPU
speculatively moved anything in or not.  As long as you flush the kernel
alias and invalidate the user I and D aliases, we're good to go.  Using
the page arch flags is really only to optimise this process (defer
kernel D alias flushing).

James


James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-06 10:47                                                               ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-06 10:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> > > 
> > > You could argue that performance improvement (how big is it, anyway?)
> > > is worth it, but this should be agreed to by wider community...
> > 
> > Performance is always worth it provided we don't sacrifice correctness.
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... once that's fixed, ARM will likely not have
> > the D/I problem, but we'll still have the libata (and other PIO systems)
> > D flushing issue.
> 
> I think you've got that backwards.
> 
> Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> since we'll interpret '0' to mean "page is dirty, it needs flushing before
> hitting userspace", whereas '1' means "page has been cleaned; there are no
> aliases."

Yes, that looks about right ... I'll think about doing this for parisc
as well.

> This doesn not address the I/D coherency issue, where the Icache needs
> attention to get rid of speculatively loaded cache lines while old data
> was present in the cache.

No, I understand that.  However, I/D coherency is handled way after the
writes to the page in the page cache.

On a fault in of exec data, we first try to get the page out of the page
cache.  If it's not present, we put the faulting process to sleep and
fetch it in from storage.  When we do the read, on the PIO path, the
kernel alias for the page becomes dirty.  Some time later, we place the
page into the user space (updating the pte entry that caused a fault).
At this point, we'll call both flush_icache_page() and
update_mmu_cache() ... this is where the I/D resolution should be done.
Since it's after any I/O has occurred, it doesn't matter whether the CPU
speculatively moved anything in or not.  As long as you flush the kernel
alias and invalidate the user I and D aliases, we're good to go.  Using
the page arch flags is really only to optimise this process (defer
kernel D alias flushing).

James


James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  9:31                                                       ` Russell King - ARM Linux
@ 2010-03-06 10:56                                                         ` Wolfgang Mües
  2010-03-06 11:05                                                           ` Oliver Neukum
  2010-03-06 19:44                                                           ` Russell King - ARM Linux
  0 siblings, 2 replies; 352+ messages in thread
From: Wolfgang Mües @ 2010-03-06 10:56 UTC (permalink / raw)
  To: linux-arm-kernel

Russell,

Am Donnerstag, 4. M?rz 2010 10:31:17 schrieb Russell King - ARM Linux:
> You're assuming that every page is used in the same way.  Here's some
> examples where this is wrong:
> 
> 1. A page is faulted in for an application, and it is a text page.
>    - the data read in to the page needs to be visible to the instruction
>      stream, so on Harvard architecture machines, this may require cache
>      maintainence on both the D and I caches.
Yes. I think that the EXPECTED behaviour of block devices is to give the 
result of the read back in memory. So the driver should do the flush of the 
data cache.

The invalidation of the I cache should be done by the function which makes 
this piece of data executable. (Have I missed something here?)
 
> 3. A page may be read in response to an application issuing a read(2) call.
>    - the data is read from the kernel mapping, and isn't mapped into a
>      userspace address.
> 
> So, in case (3), flushing the I and D caches could be completely wasteful
But how do you AVOID the writeback of the data cache in (3)?
IMHO, the dirty data is in the cache, and the cache will writeback this data 
on its own.

regards
Wolfgang
-- 
Wahre Worte sind nicht sch?n - Sch?ne Worte sind nicht wahr. (Laotse)

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 10:56                                                         ` Wolfgang Mües
@ 2010-03-06 11:05                                                           ` Oliver Neukum
  2010-03-06 19:44                                                           ` Russell King - ARM Linux
  1 sibling, 0 replies; 352+ messages in thread
From: Oliver Neukum @ 2010-03-06 11:05 UTC (permalink / raw)
  To: linux-arm-kernel

Am Samstag, 6. M?rz 2010 11:56:41 schrieb Wolfgang M?es:
> > 1. A page is faulted in for an application, and it is a text page.
> >    - the data read in to the page needs to be visible to the instruction
> >      stream, so on Harvard architecture machines, this may require cache
> >      maintainence on both the D and I caches.
> Yes. I think that the EXPECTED behaviour of block devices is to give the 
> result of the read back in memory. So the driver should do the flush of the 
> data cache.
> 
> The invalidation of the I cache should be done by the function which makes 
> this piece of data executable. (Have I missed something here?)

What tells you that IO is happening before the page is made executable?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-06 10:47                                                               ` James Bottomley
@ 2010-03-06 19:36                                                                 ` Russell King - ARM Linux
  -1 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-06 19:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pavel Machek, Catalin Marinas, FUJITA Tomonori, benh,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache.  If it's not present, we put the faulting process to sleep and
> fetch it in from storage.  When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty.  Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.

No - this is where things get extremely icky.

The problem at this point occurs on SMP architectures.  As soon as you
update the PTE entry, it is visible to other threads of the application.
If you do I-cache handling after updating the PTE, then there is a window
where another CPU can execute the page:

CPU0			CPU1
			speculatively prefetches from page N via kernel
			mapping, loads garbage into I-cache
attempts to execute P
page fault
page N allocated
set_pte_at
			executes P
			*splat*
flush I-cache

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-06 19:36                                                                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-06 19:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache.  If it's not present, we put the faulting process to sleep and
> fetch it in from storage.  When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty.  Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.

No - this is where things get extremely icky.

The problem at this point occurs on SMP architectures.  As soon as you
update the PTE entry, it is visible to other threads of the application.
If you do I-cache handling after updating the PTE, then there is a window
where another CPU can execute the page:

CPU0			CPU1
			speculatively prefetches from page N via kernel
			mapping, loads garbage into I-cache
attempts to execute P
page fault
page N allocated
set_pte_at
			executes P
			*splat*
flush I-cache

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 10:56                                                         ` Wolfgang Mües
  2010-03-06 11:05                                                           ` Oliver Neukum
@ 2010-03-06 19:44                                                           ` Russell King - ARM Linux
  1 sibling, 0 replies; 352+ messages in thread
From: Russell King - ARM Linux @ 2010-03-06 19:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Mar 06, 2010 at 11:56:41AM +0100, Wolfgang M?es wrote:
> > 3. A page may be read in response to an application issuing a read(2) call.
> >    - the data is read from the kernel mapping, and isn't mapped into a
> >      userspace address.
> > 
> > So, in case (3), flushing the I and D caches could be completely wasteful
> But how do you AVOID the writeback of the data cache in (3)?
> IMHO, the dirty data is in the cache, and the cache will writeback this data 
> on its own.

You don't avoid the writeback - you avoid explicitly causing the
writeback _and_ having to wait for it.

If you're writing data into a page (pio) which you then access via that
same mapping (via read(2)), it is totally pointless to sit in a loop
asking the cache to write the data back to memory.

The point when you need this data written back to memory is the point
where you start to create mappings which may alias with the existing
mapping.  Up until that point, the hardware itself can deal with the
writebacks when it decides it's a good time to do so.

Also, cache replaacement policies may not decide to immediately re-use
the cache lines you've just flushed - which means that by forcing them
to be written back, you're just increasing the overall latency of the
system.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-06 10:47                                                               ` James Bottomley
@ 2010-03-06 21:03                                                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-06 21:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Russell King - ARM Linux, Pavel Machek, Catalin Marinas,
	FUJITA Tomonori, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel

On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache.  If it's not present, we put the faulting process to sleep and
> fetch it in from storage.  When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty.  Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.
> Since it's after any I/O has occurred, it doesn't matter whether the CPU
> speculatively moved anything in or not.  As long as you flush the kernel
> alias and invalidate the user I and D aliases, we're good to go.  Using
> the page arch flags is really only to optimise this process (defer
> kernel D alias flushing).

Ok, so while flush_icache_page() looks like something we could use
instead of set_pte_at() for the icache flushing, it doesn't answer all
the questions. Off the top of my mind:

- I see the calls to flush_icache_page() in mm/memory.c but I don't see
them next to all set_pte_at() that insert a valid PTE. For example, we
don't flush the icache for anonymous pages. While that might seem like a
good idea, we have been under pressure to "fix" that on powerpc to make
sure there is no stale icache content from another process leaking into
userspace.

- It needs to be done -before- set_pte_at() but I think the code does it
right, only your explanation above makes it unclear :-)

- It doesn't take the PTE pointer as an argument, so here goes our trick
on powerpc of filtering out exec permission rather than flushing when a
page is accessed by a read fault

- We -still- have the problem of tracking whether the icache has been
flushed or not yet for a given physical page on archs with PIPT (or non
aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
than necessary since we'll end up flushing things like glibc text pages
for every process they are mapped into which is totally wasteful. Thus
the idea of using a new PG bit to separate D$ from I$ tracking still
makes sense.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-06 21:03                                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-06 21:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache.  If it's not present, we put the faulting process to sleep and
> fetch it in from storage.  When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty.  Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.
> Since it's after any I/O has occurred, it doesn't matter whether the CPU
> speculatively moved anything in or not.  As long as you flush the kernel
> alias and invalidate the user I and D aliases, we're good to go.  Using
> the page arch flags is really only to optimise this process (defer
> kernel D alias flushing).

Ok, so while flush_icache_page() looks like something we could use
instead of set_pte_at() for the icache flushing, it doesn't answer all
the questions. Off the top of my mind:

- I see the calls to flush_icache_page() in mm/memory.c but I don't see
them next to all set_pte_at() that insert a valid PTE. For example, we
don't flush the icache for anonymous pages. While that might seem like a
good idea, we have been under pressure to "fix" that on powerpc to make
sure there is no stale icache content from another process leaking into
userspace.

- It needs to be done -before- set_pte_at() but I think the code does it
right, only your explanation above makes it unclear :-)

- It doesn't take the PTE pointer as an argument, so here goes our trick
on powerpc of filtering out exec permission rather than flushing when a
page is accessed by a read fault

- We -still- have the problem of tracking whether the icache has been
flushed or not yet for a given physical page on archs with PIPT (or non
aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
than necessary since we'll end up flushing things like glibc text pages
for every process they are mapped into which is totally wasteful. Thus
the idea of using a new PG bit to separate D$ from I$ tracking still
makes sense.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-06 19:36                                                                 ` Russell King - ARM Linux
@ 2010-03-06 21:07                                                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-06 21:07 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: James Bottomley, Pavel Machek, Catalin Marinas, FUJITA Tomonori,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.
> 
> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:

Right, we actually hit that bug on powerpc, however, James explanation
is misleading, ie, I think the -code- actually is right and
flush_icache_page() is called before set_pte_at(). However, see my other
email, I have other issues with it as it is, but nothing unfixable.

So for now, I keep my flush in set_pte_at() and ptep_set_access_flags(),
we'll see if I can move that to an improved flush_icache_page(). In
fact, even set_pte_at() isn't a panacea for me, as I want the fault type
as well.

Cheers,
Ben.

> CPU0			CPU1
> 			speculatively prefetches from page N via kernel
> 			mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> 			executes P
> 			*splat*
> flush I-cache



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-06 21:07                                                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-06 21:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.
> 
> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:

Right, we actually hit that bug on powerpc, however, James explanation
is misleading, ie, I think the -code- actually is right and
flush_icache_page() is called before set_pte_at(). However, see my other
email, I have other issues with it as it is, but nothing unfixable.

So for now, I keep my flush in set_pte_at() and ptep_set_access_flags(),
we'll see if I can move that to an improved flush_icache_page(). In
fact, even set_pte_at() isn't a panacea for me, as I want the fault type
as well.

Cheers,
Ben.

> CPU0			CPU1
> 			speculatively prefetches from page N via kernel
> 			mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> 			executes P
> 			*splat*
> flush I-cache

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-06 21:03                                                                 ` Benjamin Herrenschmidt
@ 2010-03-07  3:37                                                                   ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-07  3:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Pavel Machek, Catalin Marinas,
	FUJITA Tomonori, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel

On Sun, 2010-03-07 at 08:03 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> > Since it's after any I/O has occurred, it doesn't matter whether the CPU
> > speculatively moved anything in or not.  As long as you flush the kernel
> > alias and invalidate the user I and D aliases, we're good to go.  Using
> > the page arch flags is really only to optimise this process (defer
> > kernel D alias flushing).
> 
> Ok, so while flush_icache_page() looks like something we could use
> instead of set_pte_at() for the icache flushing, it doesn't answer all
> the questions. Off the top of my mind:

OK, so what I was actually trying to get across is the point that we
don't handle I cache problems in the I/O or page cache code ... we
handle them in the mm code, so the mm piece of the above was
deliberately a bit vague.

> - I see the calls to flush_icache_page() in mm/memory.c but I don't see
> them next to all set_pte_at() that insert a valid PTE. For example, we
> don't flush the icache for anonymous pages. While that might seem like a
> good idea, we have been under pressure to "fix" that on powerpc to make
> sure there is no stale icache content from another process leaking into
> userspace.

I'm not entirely sure what flush_icache_page() is supposed to do.  On
parisc it flushes the *kernel* icache ... which has got to be wrong.
According to cachetlb.txt it's an obsolete interface.

> - It needs to be done -before- set_pte_at() but I think the code does it
> right, only your explanation above makes it unclear :-)

Sorry, like I said, I only sketched the mm piece.  However, at least on
parisc, there's a technical problem with flushing before we have the
pte:  On VIPT systems, we need a mapping before the flush will work.  I
was experimenting with a mechanism whereby we set aside in the kernel an
aligned region of our congruence size and simply flushed in that region
with the correct mappings, but we haven't got around to implementing it
in the kernel yet.

> - It doesn't take the PTE pointer as an argument, so here goes our trick
> on powerpc of filtering out exec permission rather than flushing when a
> page is accessed by a read fault
> 
> - We -still- have the problem of tracking whether the icache has been
> flushed or not yet for a given physical page on archs with PIPT (or non
> aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
> than necessary since we'll end up flushing things like glibc text pages
> for every process they are mapped into which is totally wasteful. Thus
> the idea of using a new PG bit to separate D$ from I$ tracking still
> makes sense.

So, assuming full congruence of user space, can't you use the VMA as an
indicator?  i.e. if we have no user space mappings, we have to flush the
icache ... if we have one or more, the icache has been flushed and
placing the same page congruently in a different address space benefits
from that prior flush, so consequently there's no need to flush again?

I also think we've established the relevant facts for the I/O thread
(that we only need to either flush the kernel D cache or mark it as to
be flushed later on PIO reads).  We're now into deep technicalities of
how the mm system operates at the architecture level, so perhaps we
should move this to linux-arch?

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-07  3:37                                                                   ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-07  3:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-03-07 at 08:03 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> > Since it's after any I/O has occurred, it doesn't matter whether the CPU
> > speculatively moved anything in or not.  As long as you flush the kernel
> > alias and invalidate the user I and D aliases, we're good to go.  Using
> > the page arch flags is really only to optimise this process (defer
> > kernel D alias flushing).
> 
> Ok, so while flush_icache_page() looks like something we could use
> instead of set_pte_at() for the icache flushing, it doesn't answer all
> the questions. Off the top of my mind:

OK, so what I was actually trying to get across is the point that we
don't handle I cache problems in the I/O or page cache code ... we
handle them in the mm code, so the mm piece of the above was
deliberately a bit vague.

> - I see the calls to flush_icache_page() in mm/memory.c but I don't see
> them next to all set_pte_at() that insert a valid PTE. For example, we
> don't flush the icache for anonymous pages. While that might seem like a
> good idea, we have been under pressure to "fix" that on powerpc to make
> sure there is no stale icache content from another process leaking into
> userspace.

I'm not entirely sure what flush_icache_page() is supposed to do.  On
parisc it flushes the *kernel* icache ... which has got to be wrong.
According to cachetlb.txt it's an obsolete interface.

> - It needs to be done -before- set_pte_at() but I think the code does it
> right, only your explanation above makes it unclear :-)

Sorry, like I said, I only sketched the mm piece.  However, at least on
parisc, there's a technical problem with flushing before we have the
pte:  On VIPT systems, we need a mapping before the flush will work.  I
was experimenting with a mechanism whereby we set aside in the kernel an
aligned region of our congruence size and simply flushed in that region
with the correct mappings, but we haven't got around to implementing it
in the kernel yet.

> - It doesn't take the PTE pointer as an argument, so here goes our trick
> on powerpc of filtering out exec permission rather than flushing when a
> page is accessed by a read fault
> 
> - We -still- have the problem of tracking whether the icache has been
> flushed or not yet for a given physical page on archs with PIPT (or non
> aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
> than necessary since we'll end up flushing things like glibc text pages
> for every process they are mapped into which is totally wasteful. Thus
> the idea of using a new PG bit to separate D$ from I$ tracking still
> makes sense.

So, assuming full congruence of user space, can't you use the VMA as an
indicator?  i.e. if we have no user space mappings, we have to flush the
icache ... if we have one or more, the icache has been flushed and
placing the same page congruently in a different address space benefits
from that prior flush, so consequently there's no need to flush again?

I also think we've established the relevant facts for the I/O thread
(that we only need to either flush the kernel D cache or mark it as to
be flushed later on PIO reads).  We're now into deep technicalities of
how the mm system operates at the architecture level, so perhaps we
should move this to linux-arch?

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-06 19:36                                                                 ` Russell King - ARM Linux
@ 2010-03-07  5:54                                                                   ` James Bottomley
  -1 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-07  5:54 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Pavel Machek, Catalin Marinas, FUJITA Tomonori, benh,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.

OK, but the point I'm trying to make is that the page cache code,
including the I/O layer, only manages kernel D alias state (either by
flushing or marking it dirty).  The user space I/D handling is done in
the mm code (I'm not claiming it's done correctly there, just claiming
it's done there).

> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
> 
> CPU0			CPU1
> 			speculatively prefetches from page N via kernel
> 			mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> 			executes P
> 			*splat*
> flush I-cache

OK, so I can believe this.  We see extremely rare segfaults on parisc
which look to be the result of some I flush race like this.  However, I
think for a discussion of problems with the arch and mm interfaces, we
should probably move off the usb list and onto linux-arch.

Our specific problem on parisc is that being VIPT we can't do an I (or
D) user flush without a mapping.  We have two schemes for fixing this:
One is to use a PAGE_FLUSH flag for the mapping ... it allows the
flushes to work but refuses any type of RWX access (can do this because
we have a software TLB).  The other is to use a flush area within the
kernel where we flush a page congruent to the userspace address ... I
haven't got this working yet, and it's a bit wasteful of kernel address
space because our congruence modulus is 4MB.

James



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-07  5:54                                                                   ` James Bottomley
  0 siblings, 0 replies; 352+ messages in thread
From: James Bottomley @ 2010-03-07  5:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.

OK, but the point I'm trying to make is that the page cache code,
including the I/O layer, only manages kernel D alias state (either by
flushing or marking it dirty).  The user space I/D handling is done in
the mm code (I'm not claiming it's done correctly there, just claiming
it's done there).

> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
> 
> CPU0			CPU1
> 			speculatively prefetches from page N via kernel
> 			mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> 			executes P
> 			*splat*
> flush I-cache

OK, so I can believe this.  We see extremely rare segfaults on parisc
which look to be the result of some I flush race like this.  However, I
think for a discussion of problems with the arch and mm interfaces, we
should probably move off the usb list and onto linux-arch.

Our specific problem on parisc is that being VIPT we can't do an I (or
D) user flush without a mapping.  We have two schemes for fixing this:
One is to use a PAGE_FLUSH flag for the mapping ... it allows the
flushes to work but refuses any type of RWX access (can do this because
we have a software TLB).  The other is to use a flush area within the
kernel where we flush a page congruent to the userspace address ... I
haven't got this working yet, and it's a bit wasteful of kernel address
space because our congruence modulus is 4MB.

James

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-04 15:35                                                           ` Catalin Marinas
@ 2010-03-07  8:23                                                             ` Pavel Machek
  -1 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-03-07  8:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: FUJITA Tomonori, James.Bottomley, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

Hi!

> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> 
> Well, ARM is pretty similar to other architectures in this respect. And
> I'm sure other architectures have similar problems, only that they only
> become visible in some circumstances they may not have encountered (i.e.
> PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> Some other architectures may do heavier flushing
> 
> Of course, a Documentation/arm/cachetlb.txt file would make sense.

Actually, short/simple documentation for driver authors would be even
better. Then you can claim it is bug in driver :-).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-07  8:23                                                             ` Pavel Machek
  0 siblings, 0 replies; 352+ messages in thread
From: Pavel Machek @ 2010-03-07  8:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> 
> Well, ARM is pretty similar to other architectures in this respect. And
> I'm sure other architectures have similar problems, only that they only
> become visible in some circumstances they may not have encountered (i.e.
> PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> Some other architectures may do heavier flushing
> 
> Of course, a Documentation/arm/cachetlb.txt file would make sense.

Actually, short/simple documentation for driver authors would be even
better. Then you can claim it is bug in driver :-).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-07  3:37                                                                   ` James Bottomley
@ 2010-03-08  8:46                                                                     ` FUJITA Tomonori
  -1 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-08  8:46 UTC (permalink / raw)
  To: James.Bottomley
  Cc: benh, linux, pavel, catalin.marinas, fujita.tomonori,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Sun, 07 Mar 2010 09:07:17 +0530
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> So, assuming full congruence of user space, can't you use the VMA as an
> indicator?  i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

I'm not sure about this (sounds like the trick might work for some
though). As I said earlier, I think that IA64 could avoid flushing
I-cache even if the page has no user space mappings (if it did dma to
the page). ia64 needs to track pages for that.

As Ben said, I guess that we need two separate bits for D and I. I
think that it's a good idea to standardize how to use the bits for
optimization (some uses none, some uses only one, some needs both
though). And then we need to revisit I/O path (fs, the block layer,
drivers). Seems that we added flush_dcache_page() everywhere.


> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads).

We have the PIO issue about D-cache aliasing now? That's, don't mm/ or
fs/ already flush D-cache properly? I thought that Catalin has only
D/I cache consistency issue. If not, PIO doesn't also work powerpc
that handles properly D/I cache consistency.


> We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch?

Yeah, probably we should.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-08  8:46                                                                     ` FUJITA Tomonori
  0 siblings, 0 replies; 352+ messages in thread
From: FUJITA Tomonori @ 2010-03-08  8:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 07 Mar 2010 09:07:17 +0530
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> So, assuming full congruence of user space, can't you use the VMA as an
> indicator?  i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

I'm not sure about this (sounds like the trick might work for some
though). As I said earlier, I think that IA64 could avoid flushing
I-cache even if the page has no user space mappings (if it did dma to
the page). ia64 needs to track pages for that.

As Ben said, I guess that we need two separate bits for D and I. I
think that it's a good idea to standardize how to use the bits for
optimization (some uses none, some uses only one, some needs both
though). And then we need to revisit I/O path (fs, the block layer,
drivers). Seems that we added flush_dcache_page() everywhere.


> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads).

We have the PIO issue about D-cache aliasing now? That's, don't mm/ or
fs/ already flush D-cache properly? I thought that Catalin has only
D/I cache consistency issue. If not, PIO doesn't also work powerpc
that handles properly D/I cache consistency.


> We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch?

Yeah, probably we should.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-07  8:23                                                             ` Pavel Machek
@ 2010-03-08 10:57                                                               ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-08 10:57 UTC (permalink / raw)
  To: Pavel Machek
  Cc: FUJITA Tomonori, James.Bottomley, benh, linux, mdharm-kernel,
	linux-usb, x0082077, sshtylyov, tom.leiming, bigeasy, oliver,
	linux-kernel, santosh.shilimkar, greg, linux-arm-kernel

On Sun, 2010-03-07 at 08:23 +0000, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> >
> > Well, ARM is pretty similar to other architectures in this respect. And
> > I'm sure other architectures have similar problems, only that they only
> > become visible in some circumstances they may not have encountered (i.e.
> > PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> > Some other architectures may do heavier flushing
> >
> > Of course, a Documentation/arm/cachetlb.txt file would make sense.
> 
> Actually, short/simple documentation for driver authors would be even
> better. Then you can claim it is bug in driver :-).

That would help, but only once we agree whether it's a driver bug or the
arch code needs changing.

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-08 10:57                                                               ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-08 10:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-03-07 at 08:23 +0000, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> >
> > Well, ARM is pretty similar to other architectures in this respect. And
> > I'm sure other architectures have similar problems, only that they only
> > become visible in some circumstances they may not have encountered (i.e.
> > PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> > Some other architectures may do heavier flushing
> >
> > Of course, a Documentation/arm/cachetlb.txt file would make sense.
> 
> Actually, short/simple documentation for driver authors would be even
> better. Then you can claim it is bug in driver :-).

That would help, but only once we agree whether it's a driver bug or the
arch code needs changing.

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-06 19:36                                                                 ` Russell King - ARM Linux
@ 2010-03-08 11:17                                                                   ` Catalin Marinas
  -1 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-08 11:17 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: James Bottomley, Pavel Machek, FUJITA Tomonori, benh,
	mdharm-kernel, linux-usb, x0082077, sshtylyov, tom.leiming,
	bigeasy, oliver, linux-kernel, santosh.shilimkar, greg,
	linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.
> 
> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
> 
> CPU0                    CPU1
>                         speculatively prefetches from page N via kernel
>                         mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
>                         executes P
>                         *splat*
> flush I-cache

You have two choices - either invalidate the I-cache before the user pte
becomes visible or set the page as not-executable in set_pte_at() and
later mark it as executable in update_mmu_cache (via set_pte_ext).

We currently invalidate the whole I-cache for historical reasons but we
could actually only invalidate a single page. Since even on latest ARM
CPUs, the I-cache is a real VIPT (i.e. can have aliases), we would need
to invalidate on the user mapping (or create a temporary one). The
latter approach of clearing the X bit in set_pte_at may actually help
with this scenario (I haven't done any tests though).

-- 
Catalin


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-08 11:17                                                                   ` Catalin Marinas
  0 siblings, 0 replies; 352+ messages in thread
From: Catalin Marinas @ 2010-03-08 11:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.
> 
> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
> 
> CPU0                    CPU1
>                         speculatively prefetches from page N via kernel
>                         mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
>                         executes P
>                         *splat*
> flush I-cache

You have two choices - either invalidate the I-cache before the user pte
becomes visible or set the page as not-executable in set_pte_at() and
later mark it as executable in update_mmu_cache (via set_pte_ext).

We currently invalidate the whole I-cache for historical reasons but we
could actually only invalidate a single page. Since even on latest ARM
CPUs, the I-cache is a real VIPT (i.e. can have aliases), we would need
to invalidate on the user mapping (or create a temporary one). The
latter approach of clearing the X bit in set_pte_at may actually help
with this scenario (I haven't done any tests though).

-- 
Catalin

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-07  3:37                                                                   ` James Bottomley
@ 2010-03-09  2:25                                                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-09  2:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Russell King - ARM Linux, Pavel Machek, Catalin Marinas,
	FUJITA Tomonori, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel

On Sun, 2010-03-07 at 09:07 +0530, James Bottomley wrote:
> So, assuming full congruence of user space, can't you use the VMA as an
> indicator?  i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

the VMA ? Or you mean struct page -> mapping ? That would work I suppose
in the case where we want to flush the icache pages for all pages mapped
into user space. But on processors that support per-page execute
permission, we really only want to flush pages that are executed from
(lazily). In that case, we do need a dedicated bit to keep track of
whether a given page has been flushed already.

> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads).  We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch? 

No objection though moving threads after the fact is a recipe for
trouble :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-09  2:25                                                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-09  2:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-03-07 at 09:07 +0530, James Bottomley wrote:
> So, assuming full congruence of user space, can't you use the VMA as an
> indicator?  i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

the VMA ? Or you mean struct page -> mapping ? That would work I suppose
in the case where we want to flush the icache pages for all pages mapped
into user space. But on processors that support per-page execute
permission, we really only want to flush pages that are executed from
(lazily). In that case, we do need a dedicated bit to keep track of
whether a given page has been flushed already.

> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads).  We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch? 

No objection though moving threads after the fact is a recipe for
trouble :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-05  4:44                                                                       ` Benjamin Herrenschmidt
  (?)
@ 2010-03-10  3:52                                                                         ` Paul Mundt
  -1 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-10  3:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 03:44:55PM +1100, Benjamin Herrenschmidt wrote:
> > For these parts the PG_dcache_dirty approach
> > saves us from a lot of flushing, and the corner cases were isolated
> > enough that we could tolerate fixups at the driver level, even on a
> > write-allocate D-cache.
> 
> But how wide a range of devices do you have to support with those ? Is
> this a few SoCs or people putting any random PCI device in there for
> example ?
> 
> If I were to do it that way on ppc32, I worried that it would be more
> than a few drivers that I would have to fix :-) All the 32-bit PowerMac
> and PowerBooks for example, all of freescale 74xx based parts, etc...
> those guys have PCI, and all sort of random HW plugged into them.
> 
Many of those parts do support PCI, but are rarely used with arbitrary
devices. The PCI controller on those parts also permits one to establish
coherency for any transactions between PCI and memory through a rudimentary
snoop controller that requires the CPU to avoid entering any sleep
states. This works ok in practice since that series of host controllers
doesn't really support power management anyways (nor do any of the cores
of that generation implement any of the more complex sleep states).

> > For second generation SH-4A (SH-X2) and up parts, read and exec are split
> > out and we could reasonably adopt the PG_dcache_clean approach there
> > while adopting the same sort of flushing semantics as PPC to avoid
> > flushing constantly. The current generation of parts far outnumber their
> > legacy counterparts, so it's certainly something I plan to experiment
> > with.
> 
> I'd be curious to see whether you get a perf imporovement with that.
> 
> Note that we still have this additional thing that is floating around in
> this thread which I thing is definitely worthwhile to do, which is to
> mark clean pages that have been written to with DMA in dma_unmap and
> friends.... if we can fix the icache problem. So far, I haven't found
> James replies on this satisfactory :-) But maybe I just missed
> something.
> 
I'll start in on profiling some of this once I start on 2.6.35 stuff. I
think I still have my old numbers from when we did the PG_mapped to
PG_dcache_dirty transition, so it will be interesting to see how
PG_dcache_clean stacks up against both of those.

> > We have an additional level of complexity on some of the SMP parts with a
> > non-coherent I-cache,
> 
> I've that on some embedded ppc's too, where the icache flush instrutions
> aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
> today nobody sane (appart from Bluegene) did an SMP part with those and
> so we have well localized internal hacks for them. But I've heared that
> some vendors might be pumping out SoCs with that stuff too soon which
> worries me.
> 
I-cache invalidations are broadcast on all mass produced SH-4A SMP parts,
but we do have some early proto chips that screwed that up. For the case
of mainline, we ought to be able to assume hardware broadcast though.

> >  some of the early CPUs have broken broadcasting of
> > the cacheops in hardware and so need to rely on IPIs, while the later
> > parts broadcast properly. We also need to deal with D-cache IPIs when
> > using mixed coherency protocols on different CPUs.
> 
> Right, that sucks. Do those have no-exec permission support ? If they
> do, then you can do what I did for BG, which is to ping pong user pages
> so they are either writable or executable (since userspace code itself
> will break as it will assume the cache ops -are- broadcast, since that's
> what the architecture says).
> 
Yes, these all support no-exec. I'll give the ping ponging thing a try,
thanks for the tip.

> Do you also, like ARM11MP, have a case of non-cache coherent DMA and
> non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
> don't see how it can be dealt properly other than using load/store
> tricks to bring the data into the local cache and flushing it from
> there. DMA ops are called way to deep into spinlock hell to rely on IPIs

The only thing we really lack is I-cache coherency, which isn't such a
big deal with invalidations being broadcast. All DMA accesses are
snooped, and the D-cache is fully coherent.

> (unless your HW also provides some kind of NMI IPIs).
> 
While we don't have anything like FIQs to work with, we do have IRQ
priority levels to play with. I'd toyed with this idea in the past of
simply having a reserved level that never gets masked, particularly for
things like broadcast backtraces.

> > Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> > and certainly worth experimenting with. I don't know how we would do the
> > I-cache optimization without a PG_arch_2, though.
> 
> Right. That's the one thing I've been trying to figure out without
> success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
> like it to me...
> 
Well, it does start to get a bit painful with sparsemem section or NUMA
node IDs also digging in to the page flags on 32-bit.. the benefits would
have to be pretty compelling to offset the pain.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
@ 2010-03-10  3:52                                                                         ` Paul Mundt
  0 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-10  3:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Catalin Marinas, James Bottomley, Pavel Machek, FUJITA Tomonori,
	linux, mdharm-kernel, linux-usb, x0082077, sshtylyov,
	tom.leiming, bigeasy, oliver, linux-kernel, santosh.shilimkar,
	greg, linux-arm-kernel, linux-sh

On Fri, Mar 05, 2010 at 03:44:55PM +1100, Benjamin Herrenschmidt wrote:
> > For these parts the PG_dcache_dirty approach
> > saves us from a lot of flushing, and the corner cases were isolated
> > enough that we could tolerate fixups at the driver level, even on a
> > write-allocate D-cache.
> 
> But how wide a range of devices do you have to support with those ? Is
> this a few SoCs or people putting any random PCI device in there for
> example ?
> 
> If I were to do it that way on ppc32, I worried that it would be more
> than a few drivers that I would have to fix :-) All the 32-bit PowerMac
> and PowerBooks for example, all of freescale 74xx based parts, etc...
> those guys have PCI, and all sort of random HW plugged into them.
> 
Many of those parts do support PCI, but are rarely used with arbitrary
devices. The PCI controller on those parts also permits one to establish
coherency for any transactions between PCI and memory through a rudimentary
snoop controller that requires the CPU to avoid entering any sleep
states. This works ok in practice since that series of host controllers
doesn't really support power management anyways (nor do any of the cores
of that generation implement any of the more complex sleep states).

> > For second generation SH-4A (SH-X2) and up parts, read and exec are split
> > out and we could reasonably adopt the PG_dcache_clean approach there
> > while adopting the same sort of flushing semantics as PPC to avoid
> > flushing constantly. The current generation of parts far outnumber their
> > legacy counterparts, so it's certainly something I plan to experiment
> > with.
> 
> I'd be curious to see whether you get a perf imporovement with that.
> 
> Note that we still have this additional thing that is floating around in
> this thread which I thing is definitely worthwhile to do, which is to
> mark clean pages that have been written to with DMA in dma_unmap and
> friends.... if we can fix the icache problem. So far, I haven't found
> James replies on this satisfactory :-) But maybe I just missed
> something.
> 
I'll start in on profiling some of this once I start on 2.6.35 stuff. I
think I still have my old numbers from when we did the PG_mapped to
PG_dcache_dirty transition, so it will be interesting to see how
PG_dcache_clean stacks up against both of those.

> > We have an additional level of complexity on some of the SMP parts with a
> > non-coherent I-cache,
> 
> I've that on some embedded ppc's too, where the icache flush instrutions
> aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
> today nobody sane (appart from Bluegene) did an SMP part with those and
> so we have well localized internal hacks for them. But I've heared that
> some vendors might be pumping out SoCs with that stuff too soon which
> worries me.
> 
I-cache invalidations are broadcast on all mass produced SH-4A SMP parts,
but we do have some early proto chips that screwed that up. For the case
of mainline, we ought to be able to assume hardware broadcast though.

> >  some of the early CPUs have broken broadcasting of
> > the cacheops in hardware and so need to rely on IPIs, while the later
> > parts broadcast properly. We also need to deal with D-cache IPIs when
> > using mixed coherency protocols on different CPUs.
> 
> Right, that sucks. Do those have no-exec permission support ? If they
> do, then you can do what I did for BG, which is to ping pong user pages
> so they are either writable or executable (since userspace code itself
> will break as it will assume the cache ops -are- broadcast, since that's
> what the architecture says).
> 
Yes, these all support no-exec. I'll give the ping ponging thing a try,
thanks for the tip.

> Do you also, like ARM11MP, have a case of non-cache coherent DMA and
> non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
> don't see how it can be dealt properly other than using load/store
> tricks to bring the data into the local cache and flushing it from
> there. DMA ops are called way to deep into spinlock hell to rely on IPIs

The only thing we really lack is I-cache coherency, which isn't such a
big deal with invalidations being broadcast. All DMA accesses are
snooped, and the D-cache is fully coherent.

> (unless your HW also provides some kind of NMI IPIs).
> 
While we don't have anything like FIQs to work with, we do have IRQ
priority levels to play with. I'd toyed with this idea in the past of
simply having a reserved level that never gets masked, particularly for
things like broadcast backtraces.

> > Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> > and certainly worth experimenting with. I don't know how we would do the
> > I-cache optimization without a PG_arch_2, though.
> 
> Right. That's the one thing I've been trying to figure out without
> success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
> like it to me...
> 
Well, it does start to get a bit painful with sparsemem section or NUMA
node IDs also digging in to the page flags on 32-bit.. the benefits would
have to be pretty compelling to offset the pain.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-10  3:52                                                                         ` Paul Mundt
  0 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-03-10  3:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 03:44:55PM +1100, Benjamin Herrenschmidt wrote:
> > For these parts the PG_dcache_dirty approach
> > saves us from a lot of flushing, and the corner cases were isolated
> > enough that we could tolerate fixups at the driver level, even on a
> > write-allocate D-cache.
> 
> But how wide a range of devices do you have to support with those ? Is
> this a few SoCs or people putting any random PCI device in there for
> example ?
> 
> If I were to do it that way on ppc32, I worried that it would be more
> than a few drivers that I would have to fix :-) All the 32-bit PowerMac
> and PowerBooks for example, all of freescale 74xx based parts, etc...
> those guys have PCI, and all sort of random HW plugged into them.
> 
Many of those parts do support PCI, but are rarely used with arbitrary
devices. The PCI controller on those parts also permits one to establish
coherency for any transactions between PCI and memory through a rudimentary
snoop controller that requires the CPU to avoid entering any sleep
states. This works ok in practice since that series of host controllers
doesn't really support power management anyways (nor do any of the cores
of that generation implement any of the more complex sleep states).

> > For second generation SH-4A (SH-X2) and up parts, read and exec are split
> > out and we could reasonably adopt the PG_dcache_clean approach there
> > while adopting the same sort of flushing semantics as PPC to avoid
> > flushing constantly. The current generation of parts far outnumber their
> > legacy counterparts, so it's certainly something I plan to experiment
> > with.
> 
> I'd be curious to see whether you get a perf imporovement with that.
> 
> Note that we still have this additional thing that is floating around in
> this thread which I thing is definitely worthwhile to do, which is to
> mark clean pages that have been written to with DMA in dma_unmap and
> friends.... if we can fix the icache problem. So far, I haven't found
> James replies on this satisfactory :-) But maybe I just missed
> something.
> 
I'll start in on profiling some of this once I start on 2.6.35 stuff. I
think I still have my old numbers from when we did the PG_mapped to
PG_dcache_dirty transition, so it will be interesting to see how
PG_dcache_clean stacks up against both of those.

> > We have an additional level of complexity on some of the SMP parts with a
> > non-coherent I-cache,
> 
> I've that on some embedded ppc's too, where the icache flush instrutions
> aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
> today nobody sane (appart from Bluegene) did an SMP part with those and
> so we have well localized internal hacks for them. But I've heared that
> some vendors might be pumping out SoCs with that stuff too soon which
> worries me.
> 
I-cache invalidations are broadcast on all mass produced SH-4A SMP parts,
but we do have some early proto chips that screwed that up. For the case
of mainline, we ought to be able to assume hardware broadcast though.

> >  some of the early CPUs have broken broadcasting of
> > the cacheops in hardware and so need to rely on IPIs, while the later
> > parts broadcast properly. We also need to deal with D-cache IPIs when
> > using mixed coherency protocols on different CPUs.
> 
> Right, that sucks. Do those have no-exec permission support ? If they
> do, then you can do what I did for BG, which is to ping pong user pages
> so they are either writable or executable (since userspace code itself
> will break as it will assume the cache ops -are- broadcast, since that's
> what the architecture says).
> 
Yes, these all support no-exec. I'll give the ping ponging thing a try,
thanks for the tip.

> Do you also, like ARM11MP, have a case of non-cache coherent DMA and
> non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
> don't see how it can be dealt properly other than using load/store
> tricks to bring the data into the local cache and flushing it from
> there. DMA ops are called way to deep into spinlock hell to rely on IPIs

The only thing we really lack is I-cache coherency, which isn't such a
big deal with invalidations being broadcast. All DMA accesses are
snooped, and the D-cache is fully coherent.

> (unless your HW also provides some kind of NMI IPIs).
> 
While we don't have anything like FIQs to work with, we do have IRQ
priority levels to play with. I'd toyed with this idea in the past of
simply having a reserved level that never gets masked, particularly for
things like broadcast backtraces.

> > Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> > and certainly worth experimenting with. I don't know how we would do the
> > I-cache optimization without a PG_arch_2, though.
> 
> Right. That's the one thing I've been trying to figure out without
> success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
> like it to me...
> 
Well, it does start to get a bit painful with sparsemem section or NUMA
node IDs also digging in to the page flags on 32-bit.. the benefits would
have to be pretty compelling to offset the pain.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-03-10  3:52                                                                         ` Paul Mundt
  (?)
@ 2010-03-11 21:44                                                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-11 21:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-10 at 12:52 +0900, Paul Mundt wrote:
> Well, it does start to get a bit painful with sparsemem section or
> NUMA
> node IDs also digging in to the page flags on 32-bit.. the benefits
> would
> have to be pretty compelling to offset the pain. 

Unless we play a dangerous trick and re-use another flag that isn't
meaningful for allocated pages... maybe PG_buddy ? Or do I miss
something about that guy semantics ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
@ 2010-03-11 21:44                                                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-11 21:44 UTC (permalink / raw)
  To: Paul Mundt
  Cc: FUJITA Tomonori, mdharm-kernel, oliver, linux, greg, x0082077,
	sshtylyov, Catalin Marinas, bigeasy, linux-usb, linux-kernel,
	James Bottomley, linux-sh, santosh.shilimkar, Pavel Machek,
	tom.leiming, linux-arm-kernel

On Wed, 2010-03-10 at 12:52 +0900, Paul Mundt wrote:
> Well, it does start to get a bit painful with sparsemem section or
> NUMA
> node IDs also digging in to the page flags on 32-bit.. the benefits
> would
> have to be pretty compelling to offset the pain. 

Unless we play a dangerous trick and re-use another flag that isn't
meaningful for allocated pages... maybe PG_buddy ? Or do I miss
something about that guy semantics ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 352+ messages in thread

* USB mass storage and ARM cache coherency
@ 2010-03-11 21:44                                                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 352+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-11 21:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-10 at 12:52 +0900, Paul Mundt wrote:
> Well, it does start to get a bit painful with sparsemem section or
> NUMA
> node IDs also digging in to the page flags on 32-bit.. the benefits
> would
> have to be pretty compelling to offset the pain. 

Unless we play a dangerous trick and re-use another flag that isn't
meaningful for allocated pages... maybe PG_buddy ? Or do I miss
something about that guy semantics ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
  2010-02-03 23:56 George Spelvin
@ 2010-02-04  4:39 ` Paul Mundt
  0 siblings, 0 replies; 352+ messages in thread
From: Paul Mundt @ 2010-02-04  4:39 UTC (permalink / raw)
  To: George Spelvin; +Cc: catalin.marinas, linux-kernel

On Wed, Feb 03, 2010 at 06:56:44PM -0500, George Spelvin wrote:
> > Apart from that, flush_dcache_page() doesn't have any data flow
> > information. Optimisations could be done on ARM if we know that the
> > kernel only intends to read from a page (no flushing necessary with a
> > non-aliasing D-cache).
> 
> Already done in flush_dcache_page().  If possible (uniprocessor), it just
> flags the page as PG_dcache_dirty, and defers the actual flush operation
> until it's mapped somewhere else (either a virtual alias or executable).
> 
Try reading the thread again, as you seem to have missed the point
completely. The issue isn't with lazy dcache writeback, the issue is that
flush_dcache_page() is a bit of a sledgehammer for cases when directional
information is available. The DMA mapping operations conversely are aware
of data flow and optimize accordingly.

Additionally, with something like a flush_dcache_range() it's possible
to optimize for large ranges as opposed to page-at-a-time looping for
anything that needs to flag PG_dcache_dirty on a bulk group of pages.

^ permalink raw reply	[flat|nested] 352+ messages in thread

* Re: USB mass storage and ARM cache coherency
@ 2010-02-03 23:56 George Spelvin
  2010-02-04  4:39 ` Paul Mundt
  0 siblings, 1 reply; 352+ messages in thread
From: George Spelvin @ 2010-02-03 23:56 UTC (permalink / raw)
  To: catalin.marinas; +Cc: linux, linux-kernel

> Apart from that, flush_dcache_page() doesn't have any data flow
> information. Optimisations could be done on ARM if we know that the
> kernel only intends to read from a page (no flushing necessary with a
> non-aliasing D-cache).

Already done in flush_dcache_page().  If possible (uniprocessor), it just
flags the page as PG_dcache_dirty, and defers the actual flush operation
until it's mapped somewhere else (either a virtual alias or executable).

See Documentation/cachetlb.txt.  (Really, all PIO drivers should
be calling flush_dcache_page.)

^ permalink raw reply	[flat|nested] 352+ messages in thread

end of thread, other threads:[~2010-03-11 21:49 UTC | newest]

Thread overview: 352+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-29 14:34 USB mass storage and ARM cache coherency Catalin Marinas
2010-01-29 16:10 ` Oliver Neukum
2010-01-29 16:23 ` Ming Lei
2010-01-29 16:34   ` Catalin Marinas
2010-01-29 16:41     ` Oliver Neukum
2010-01-29 17:14       ` Catalin Marinas
2010-01-29 17:51     ` Sergei Shtylyov
2010-01-29 18:54       ` Matthew Dharm
2010-01-29 19:35         ` Greg KH
2010-02-01 13:49         ` Catalin Marinas
2010-02-01 17:29         ` Catalin Marinas
2010-02-01 20:14           ` Alan Stern
2010-02-02  4:24             ` Paul Mundt
2010-02-02  9:58               ` Catalin Marinas
2010-02-01 22:30           ` Andreas Mohr
2010-02-02  6:58             ` Oliver Neukum
2010-02-02  9:31               ` Florian Fainelli
2010-02-02  6:39           ` Paul Mundt
2010-02-02 11:05             ` Catalin Marinas
2010-02-02 11:15               ` Paul Mundt
2010-02-02  9:11           ` Sebastian Andrzej Siewior
2010-02-02 11:09             ` Catalin Marinas
2010-02-02 11:48           ` Oliver Neukum
2010-02-02 12:01             ` Catalin Marinas
2010-02-02 12:07               ` Oliver Neukum
2010-02-02 12:11                 ` Andreas Mohr
2010-02-02 14:42                   ` Clemens Ladisch
2010-02-02 14:52                     ` Oliver Neukum
2010-02-02 15:10                       ` Andreas Mohr
2010-02-02 15:34                         ` Catalin Marinas
2010-02-02 20:38                     ` Andreas Mohr
2010-02-02 12:39                 ` Catalin Marinas
2010-02-02 13:08                   ` Oliver Neukum
2010-02-02 14:34                     ` Catalin Marinas
2010-02-02 17:11                     ` Alan Stern
2010-02-02 17:20                       ` Catalin Marinas
2010-02-02 21:52                         ` Andreas Mohr
2010-02-03 15:15                           ` Alan Stern
2010-02-08  6:55                       ` Pavel Machek
2010-02-02 13:36                   ` Ming Lei
2010-02-02 14:35                     ` Catalin Marinas
2010-02-08  6:55           ` Pavel Machek
2010-02-08  6:55             ` Pavel Machek
2010-02-08  7:33             ` Andreas Mohr
2010-02-08  7:33               ` Andreas Mohr
2010-02-08 10:19               ` Catalin Marinas
2010-02-08 10:19                 ` Catalin Marinas
2010-02-08  9:51             ` Catalin Marinas
2010-02-08  9:51               ` Catalin Marinas
2010-02-08 10:03               ` Andy Green
2010-02-08 10:03                 ` Andy Green
2010-02-17  9:50                 ` Sascha Hauer
2010-02-17  9:50                   ` Sascha Hauer
2010-02-17  9:57                   ` Andy Green
2010-02-17  9:57                     ` Andy Green
2010-02-08 10:52               ` Pavel Machek
2010-02-08 10:52                 ` Pavel Machek
2010-02-08 11:28                 ` Catalin Marinas
2010-02-08 11:28                   ` Catalin Marinas
2010-02-16  7:57                   ` Shilimkar, Santosh
2010-02-16  7:57                     ` Shilimkar, Santosh
2010-02-16  8:22                     ` Oliver Neukum
2010-02-16  8:22                       ` Oliver Neukum
2010-02-16  8:55                       ` Shilimkar, Santosh
2010-02-16  8:55                         ` Shilimkar, Santosh
2010-02-16  9:07                         ` Oliver Neukum
2010-02-16  9:07                           ` Oliver Neukum
2010-02-16  9:39                           ` Russell King - ARM Linux
2010-02-16  9:39                             ` Russell King - ARM Linux
2010-02-16 13:32                             ` Oliver Neukum
2010-02-16 13:32                               ` Oliver Neukum
2010-02-16 13:40                               ` Shilimkar, Santosh
2010-02-16 13:40                                 ` Shilimkar, Santosh
2010-02-16 13:46                                 ` Oliver Neukum
2010-02-16 13:46                                   ` Oliver Neukum
2010-02-16 14:12                                   ` Shilimkar, Santosh
2010-02-16 14:12                                     ` Shilimkar, Santosh
2010-02-16 14:22                                     ` Oliver Neukum
2010-02-16 14:22                                       ` Oliver Neukum
2010-02-16 14:45                                       ` Shilimkar, Santosh
2010-02-16 14:45                                         ` Shilimkar, Santosh
2010-02-16 15:44                                         ` Alan Stern
2010-02-16 15:44                                           ` Alan Stern
2010-02-17  8:55                                       ` Shilimkar, Santosh
2010-02-17  8:55                                         ` Shilimkar, Santosh
2010-02-17  9:10                                         ` Oliver Neukum
2010-02-17  9:10                                           ` Oliver Neukum
2010-02-17  9:17                                           ` Shilimkar, Santosh
2010-02-17  9:17                                             ` Shilimkar, Santosh
2010-02-17 17:02                                         ` Alan Stern
2010-02-17 17:02                                           ` Alan Stern
2010-02-17 20:26                                           ` Russell King - ARM Linux
2010-02-17 20:26                                             ` Russell King - ARM Linux
2010-02-17 20:30                                           ` Gadiyar, Anand
2010-02-17 20:30                                             ` Gadiyar, Anand
2010-02-18  6:56                                             ` Oliver Neukum
2010-02-18  6:56                                               ` Oliver Neukum
2010-02-18  7:14                                               ` Gadiyar, Anand
2010-02-18  7:14                                                 ` Gadiyar, Anand
2010-02-17 12:29                             ` Jamie Lokier
2010-02-17 12:29                               ` Jamie Lokier
2010-02-17  3:21                         ` Ming Lei
2010-02-17  3:21                           ` Ming Lei
2010-02-17  9:05                       ` Benjamin Herrenschmidt
2010-02-17  9:05                         ` Benjamin Herrenschmidt
2010-02-17  9:15                         ` Oliver Neukum
2010-02-17  9:15                           ` Oliver Neukum
2010-02-17  9:40                           ` Benjamin Herrenschmidt
2010-02-17  9:40                             ` Benjamin Herrenschmidt
2010-02-17 10:09                             ` Oliver Neukum
2010-02-17 10:09                               ` Oliver Neukum
2010-02-17 10:18                               ` Benjamin Herrenschmidt
2010-02-17 10:18                                 ` Benjamin Herrenschmidt
2010-02-17 10:23                                 ` Oliver Neukum
2010-02-17 10:23                                   ` Oliver Neukum
2010-02-17 12:15                                   ` Benjamin Herrenschmidt
2010-02-17 12:15                                     ` Benjamin Herrenschmidt
2010-02-17  9:55                         ` Russell King - ARM Linux
2010-02-17  9:55                           ` Russell King - ARM Linux
2010-02-17 10:05                           ` Benjamin Herrenschmidt
2010-02-17 10:05                             ` Benjamin Herrenschmidt
2010-02-17 15:27                         ` Catalin Marinas
2010-02-17 15:27                           ` Catalin Marinas
2010-02-17 20:37                           ` Benjamin Herrenschmidt
2010-02-17 20:37                             ` Benjamin Herrenschmidt
2010-02-17 20:44                             ` Russell King - ARM Linux
2010-02-17 20:44                               ` Russell King - ARM Linux
2010-02-17 22:31                               ` Benjamin Herrenschmidt
2010-02-17 22:31                                 ` Benjamin Herrenschmidt
2010-02-19 17:15                                 ` Catalin Marinas
2010-02-19 17:15                                   ` Catalin Marinas
2010-02-19 17:36                                   ` Catalin Marinas
2010-02-19 17:36                                     ` Catalin Marinas
2010-02-19 20:53                                     ` Oliver Neukum
2010-02-19 20:53                                       ` Oliver Neukum
2010-02-24  2:48                                       ` Benjamin Herrenschmidt
2010-02-24  2:48                                         ` Benjamin Herrenschmidt
2010-02-24  7:16                                         ` Oliver Neukum
2010-02-24  7:16                                           ` Oliver Neukum
2010-02-24 21:12                                           ` Benjamin Herrenschmidt
2010-02-24 21:12                                             ` Benjamin Herrenschmidt
2010-02-25  3:48                                             ` Oliver Neukum
2010-02-25  3:48                                               ` Oliver Neukum
2010-02-26  0:22                                               ` Benjamin Herrenschmidt
2010-02-26  0:22                                                 ` Benjamin Herrenschmidt
2010-02-25 12:36                                             ` James Bottomley
2010-02-25 12:36                                               ` James Bottomley
2010-02-24  2:47                                     ` Benjamin Herrenschmidt
2010-02-24  2:47                                       ` Benjamin Herrenschmidt
2010-02-24 16:19                                       ` Alan Stern
2010-02-24 16:19                                         ` Alan Stern
2010-02-24 21:13                                         ` Benjamin Herrenschmidt
2010-02-24 21:13                                           ` Benjamin Herrenschmidt
2010-02-24 21:50                                           ` Alan Stern
2010-02-24 21:50                                             ` Alan Stern
2010-02-25 20:52                                             ` Benjamin Herrenschmidt
2010-02-25 20:52                                               ` Benjamin Herrenschmidt
2010-02-26 16:00                                           ` Catalin Marinas
2010-02-26 16:00                                             ` Catalin Marinas
2010-02-26 21:36                                             ` Benjamin Herrenschmidt
2010-02-26 21:36                                               ` Benjamin Herrenschmidt
2010-02-26 16:25                                       ` Catalin Marinas
2010-02-26 16:25                                         ` Catalin Marinas
2010-02-26 16:52                                         ` Alan Stern
2010-02-26 16:52                                           ` Alan Stern
2010-02-26 21:51                                           ` Benjamin Herrenschmidt
2010-02-26 21:51                                             ` Benjamin Herrenschmidt
2010-02-26 21:00                                         ` Russell King - ARM Linux
2010-02-26 21:00                                           ` Russell King - ARM Linux
2010-02-28  0:14                                           ` Benjamin Herrenschmidt
2010-02-28  0:14                                             ` Benjamin Herrenschmidt
2010-02-28  5:01                                             ` James Bottomley
2010-02-28  5:01                                               ` James Bottomley
2010-03-01 10:39                                               ` Catalin Marinas
2010-03-01 10:39                                                 ` Catalin Marinas
2010-03-01 11:06                                                 ` Russell King - ARM Linux
2010-03-01 11:06                                                   ` Russell King - ARM Linux
2010-03-02 12:11                                               ` FUJITA Tomonori
2010-03-02 12:11                                                 ` FUJITA Tomonori
2010-03-02 17:05                                                 ` Catalin Marinas
2010-03-02 17:05                                                   ` Catalin Marinas
2010-03-02 17:47                                                   ` Catalin Marinas
2010-03-02 17:47                                                     ` Catalin Marinas
2010-03-02 23:33                                                     ` Benjamin Herrenschmidt
2010-03-02 23:33                                                       ` Benjamin Herrenschmidt
2010-03-03 10:21                                                       ` Catalin Marinas
2010-03-03 10:21                                                         ` Catalin Marinas
2010-03-02 23:29                                                   ` Benjamin Herrenschmidt
2010-03-02 23:29                                                     ` Benjamin Herrenschmidt
2010-03-03  3:47                                                     ` FUJITA Tomonori
2010-03-03  3:47                                                       ` FUJITA Tomonori
2010-03-03  5:10                                                       ` Benjamin Herrenschmidt
2010-03-03  5:10                                                         ` Benjamin Herrenschmidt
2010-03-03  5:40                                                         ` James Bottomley
2010-03-03  5:40                                                           ` James Bottomley
2010-03-03  9:36                                                           ` Russell King - ARM Linux
2010-03-03  9:36                                                             ` Russell King - ARM Linux
2010-03-03 10:24                                                             ` James Bottomley
2010-03-03 10:24                                                               ` James Bottomley
2010-03-03 19:41                                                               ` Russell King - ARM Linux
2010-03-03 19:41                                                                 ` Russell King - ARM Linux
2010-03-04  2:00                                                           ` Benjamin Herrenschmidt
2010-03-04  2:00                                                             ` Benjamin Herrenschmidt
2010-03-04  8:26                                                             ` James Bottomley
2010-03-04  8:26                                                               ` James Bottomley
2010-03-04 21:25                                                               ` Benjamin Herrenschmidt
2010-03-04 21:25                                                                 ` Benjamin Herrenschmidt
2010-03-03  6:35                                                         ` FUJITA Tomonori
2010-03-03  6:35                                                           ` FUJITA Tomonori
2010-03-03 10:43                                                       ` Catalin Marinas
2010-03-03 10:43                                                         ` Catalin Marinas
2010-03-03 10:40                                                     ` Catalin Marinas
2010-03-03 10:40                                                       ` Catalin Marinas
2010-03-03 21:54                                                   ` Pavel Machek
2010-03-03 21:54                                                     ` Pavel Machek
2010-03-04  6:54                                                     ` Wolfgang Mües
2010-03-04  9:31                                                       ` Russell King - ARM Linux
2010-03-06 10:56                                                         ` Wolfgang Mües
2010-03-06 11:05                                                           ` Oliver Neukum
2010-03-06 19:44                                                           ` Russell King - ARM Linux
2010-03-04 13:47                                                       ` Catalin Marinas
2010-03-04 13:35                                                     ` Catalin Marinas
2010-03-04 13:35                                                       ` Catalin Marinas
2010-03-04 13:51                                                       ` Pavel Machek
2010-03-04 13:51                                                         ` Pavel Machek
2010-03-04 14:21                                                         ` James Bottomley
2010-03-04 14:21                                                           ` James Bottomley
2010-03-04 14:27                                                           ` Russell King - ARM Linux
2010-03-04 14:27                                                             ` Russell King - ARM Linux
2010-03-04 15:25                                                             ` Catalin Marinas
2010-03-04 15:25                                                               ` Catalin Marinas
2010-03-04 15:34                                                               ` Russell King - ARM Linux
2010-03-04 15:34                                                                 ` Russell King - ARM Linux
2010-03-04 21:31                                                               ` Benjamin Herrenschmidt
2010-03-04 21:31                                                                 ` Benjamin Herrenschmidt
2010-03-06 10:47                                                             ` James Bottomley
2010-03-06 10:47                                                               ` James Bottomley
2010-03-06 19:36                                                               ` Russell King - ARM Linux
2010-03-06 19:36                                                                 ` Russell King - ARM Linux
2010-03-06 21:07                                                                 ` Benjamin Herrenschmidt
2010-03-06 21:07                                                                   ` Benjamin Herrenschmidt
2010-03-07  5:54                                                                 ` James Bottomley
2010-03-07  5:54                                                                   ` James Bottomley
2010-03-08 11:17                                                                 ` Catalin Marinas
2010-03-08 11:17                                                                   ` Catalin Marinas
2010-03-06 21:03                                                               ` Benjamin Herrenschmidt
2010-03-06 21:03                                                                 ` Benjamin Herrenschmidt
2010-03-07  3:37                                                                 ` James Bottomley
2010-03-07  3:37                                                                   ` James Bottomley
2010-03-08  8:46                                                                   ` FUJITA Tomonori
2010-03-08  8:46                                                                     ` FUJITA Tomonori
2010-03-09  2:25                                                                   ` Benjamin Herrenschmidt
2010-03-09  2:25                                                                     ` Benjamin Herrenschmidt
2010-03-04 15:29                                                           ` Catalin Marinas
2010-03-04 15:29                                                             ` Catalin Marinas
2010-03-04 15:41                                                             ` Paul Mundt
2010-03-04 15:41                                                               ` Paul Mundt
2010-03-04 16:30                                                               ` Russell King - ARM Linux
2010-03-04 16:30                                                                 ` Russell King - ARM Linux
2010-03-04 17:34                                                                 ` Catalin Marinas
2010-03-04 17:34                                                                   ` Catalin Marinas
2010-03-04 17:54                                                                   ` Russell King - ARM Linux
2010-03-04 17:54                                                                     ` Russell King - ARM Linux
2010-03-04 22:27                                                                 ` Andreas Mohr
2010-03-04 18:07                                                               ` Catalin Marinas
2010-03-04 18:07                                                                 ` Catalin Marinas
2010-03-04 21:37                                                                 ` Benjamin Herrenschmidt
2010-03-04 21:37                                                                   ` Benjamin Herrenschmidt
2010-03-04 22:11                                                                   ` Catalin Marinas
2010-03-04 22:11                                                                     ` Catalin Marinas
2010-03-05  4:34                                                                     ` Benjamin Herrenschmidt
2010-03-05  4:34                                                                       ` Benjamin Herrenschmidt
2010-03-05  9:27                                                                       ` Catalin Marinas
2010-03-05  9:27                                                                         ` Catalin Marinas
2010-03-05  1:17                                                                   ` Paul Mundt
2010-03-05  1:17                                                                     ` Paul Mundt
2010-03-05  1:17                                                                     ` Paul Mundt
2010-03-05  4:44                                                                     ` Benjamin Herrenschmidt
2010-03-05  4:44                                                                       ` Benjamin Herrenschmidt
2010-03-05  4:44                                                                       ` Benjamin Herrenschmidt
2010-03-10  3:52                                                                       ` Paul Mundt
2010-03-10  3:52                                                                         ` Paul Mundt
2010-03-10  3:52                                                                         ` Paul Mundt
2010-03-11 21:44                                                                         ` Benjamin Herrenschmidt
2010-03-11 21:44                                                                           ` Benjamin Herrenschmidt
2010-03-11 21:44                                                                           ` Benjamin Herrenschmidt
2010-03-04 21:34                                                               ` Benjamin Herrenschmidt
2010-03-04 21:34                                                                 ` Benjamin Herrenschmidt
2010-03-04 21:28                                                           ` Benjamin Herrenschmidt
2010-03-04 21:28                                                             ` Benjamin Herrenschmidt
2010-03-04 21:40                                                             ` Russell King - ARM Linux
2010-03-04 21:40                                                               ` Russell King - ARM Linux
2010-03-05  4:31                                                               ` Benjamin Herrenschmidt
2010-03-05  4:31                                                                 ` Benjamin Herrenschmidt
2010-03-04 15:35                                                         ` Catalin Marinas
2010-03-04 15:35                                                           ` Catalin Marinas
2010-03-07  8:23                                                           ` Pavel Machek
2010-03-07  8:23                                                             ` Pavel Machek
2010-03-08 10:57                                                             ` Catalin Marinas
2010-03-08 10:57                                                               ` Catalin Marinas
2010-03-02 23:26                                                 ` Benjamin Herrenschmidt
2010-03-02 23:26                                                   ` Benjamin Herrenschmidt
2010-03-01 10:42                                             ` Catalin Marinas
2010-03-01 10:42                                               ` Catalin Marinas
2010-03-03 20:24                                               ` Jamie Lokier
2010-03-03 20:24                                                 ` Jamie Lokier
2010-02-26 21:40                                         ` Benjamin Herrenschmidt
2010-02-26 21:40                                           ` Benjamin Herrenschmidt
2010-02-26 21:49                                           ` Russell King - ARM Linux
2010-02-26 21:49                                             ` Russell King - ARM Linux
2010-02-28  0:24                                             ` Benjamin Herrenschmidt
2010-02-28  0:24                                               ` Benjamin Herrenschmidt
2010-02-28 19:17                                               ` Pavel Machek
2010-02-28 19:17                                                 ` Pavel Machek
2010-03-01 11:10                                               ` Catalin Marinas
2010-03-01 11:10                                                 ` Catalin Marinas
2010-03-02  4:11                                                 ` Benjamin Herrenschmidt
2010-03-02  4:11                                                   ` Benjamin Herrenschmidt
2010-02-24  2:39                                   ` Benjamin Herrenschmidt
2010-02-24  2:39                                     ` Benjamin Herrenschmidt
2010-02-26 16:44                                     ` Catalin Marinas
2010-02-26 16:44                                       ` Catalin Marinas
2010-02-26 21:49                                       ` Benjamin Herrenschmidt
2010-02-26 21:49                                         ` Benjamin Herrenschmidt
2010-02-26 22:03                                         ` Russell King - ARM Linux
2010-02-26 22:03                                           ` Russell King - ARM Linux
2010-02-28  0:29                                           ` Benjamin Herrenschmidt
2010-02-28  0:29                                             ` Benjamin Herrenschmidt
2010-02-28 23:20                                           ` Catalin Marinas
2010-02-28 23:20                                             ` Catalin Marinas
2010-02-28 23:17                                         ` Catalin Marinas
2010-02-28 23:17                                           ` Catalin Marinas
2010-02-17 15:27                         ` Catalin Marinas
2010-02-17 15:27                           ` Catalin Marinas
2010-02-17 15:39                         ` Catalin Marinas
2010-02-17 15:39                           ` Catalin Marinas
2010-02-17 15:40                         ` Catalin Marinas
2010-02-17 15:40                           ` Catalin Marinas
2010-02-17 15:40                         ` Catalin Marinas
2010-02-17 15:40                           ` Catalin Marinas
2010-02-17 16:19                           ` Catalin Marinas
2010-02-17 16:19                             ` Catalin Marinas
2010-02-17 16:19                           ` Re: " Catalin Marinas
2010-02-17 16:19                             ` Catalin Marinas
2010-02-16  8:44                     ` Russell King - ARM Linux
2010-02-16  8:44                       ` Russell King - ARM Linux
2010-02-16  8:51                       ` Gadiyar, Anand
2010-02-16  8:51                         ` Gadiyar, Anand
2010-02-20  7:21                         ` Pete Zaitcev
2010-02-20  7:21                           ` Pete Zaitcev
2010-02-03 23:56 George Spelvin
2010-02-04  4:39 ` Paul Mundt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.