* LKMM: Read dependencies of writes ordered by dma_wmb()? @ 2021-08-16 10:12 Marco Elver 2021-08-16 14:59 ` Alan Stern 0 siblings, 1 reply; 11+ messages in thread From: Marco Elver @ 2021-08-16 10:12 UTC (permalink / raw) To: Paul E. McKenney, Boqun Feng, Alan Stern, Andrea Parri, Will Deacon, Mark Rutland Cc: Dmitry Vyukov, kasan-dev, linux-kernel Hello, Commit c58a801701693 added a paragraph to the LKMM: +Although we said that plain accesses are not linked by the ppo +relation, they do contribute to it indirectly. Namely, when there is +an address dependency from a marked load R to a plain store W, +followed by smp_wmb() and then a marked store W', the LKMM creates a +ppo link from R to W'. Defining that certain _marked reads_ will also be ordered by smp_wmb(). But otherwise, other reads (especially plain reads!) will _never_ be ordered by smp_wmb(). Is my understanding correct? I am asking because KCSAN is growing limited support for weak memory modeling and memory barriers, and I'm trying to figure out if I'm seeing a false positive or genuinely allowed race. One caveat is the case I'm trying to understand doesn't involve just 2 CPUs but also a device. And for now, I'm assuming that dma_wmb() is as strong as smp_wmb() also wrt other CPUs (but my guess is this assumption is already too strong). The whole area of the memory model that includes talking to devices and devices talking back to CPUs seems quite murky, and need to confirm that I either got it right or wrong. :-) The report (explained below): | assert no accesses to 0xffff8880077b5500 of 232 bytes by interrupt on cpu 1: | __cache_free mm/slab.c:3450 [inline] | kmem_cache_free+0x4b/0xe0 mm/slab.c:3740 | kfree_skbmem net/core/skbuff.c:709 [inline] | __kfree_skb+0x145/0x190 net/core/skbuff.c:745 | consume_skb+0x6d/0x190 net/core/skbuff.c:900 | __dev_kfree_skb_any+0xb8/0xc0 net/core/dev.c:3195 | dev_kfree_skb_any include/linux/netdevice.h:3979 [inline] | e1000_unmap_and_free_tx_resource drivers/net/ethernet/intel/e1000/e1000_main.c:1969 [inline] | e1000_clean_tx_irq drivers/net/ethernet/intel/e1000/e1000_main.c:3859 [inline] | e1000_clean+0x302/0x2080 drivers/net/ethernet/intel/e1000/e1000_main.c:3800 | __napi_poll+0x81/0x430 net/core/dev.c:7019 | napi_poll net/core/dev.c:7086 [inline] | net_rx_action+0x2cf/0x6b0 net/core/dev.c:7173 | __do_softirq+0x12c/0x275 kernel/softirq.c:558 | [...] | | read (reordered) to 0xffff8880077b5570 of 4 bytes by task 1985 on cpu 0: | skb_headlen include/linux/skbuff.h:2139 [inline] | e1000_tx_map drivers/net/ethernet/intel/e1000/e1000_main.c:2829 [inline] | e1000_xmit_frame+0x12fd/0x2720 drivers/net/ethernet/intel/e1000/e1000_main.c:3243 | __netdev_start_xmit include/linux/netdevice.h:4944 [inline] | netdev_start_xmit include/linux/netdevice.h:4958 [inline] | xmit_one+0x103/0x2c0 net/core/dev.c:3658 | dev_hard_start_xmit+0x70/0x130 net/core/dev.c:3674 | sch_direct_xmit+0x1e5/0x600 net/sched/sch_generic.c:342 | __dev_xmit_skb net/core/dev.c:3874 [inline] | __dev_queue_xmit+0xd26/0x1990 net/core/dev.c:4241 | dev_queue_xmit+0x1d/0x30 net/core/dev.c:4306 | [...] | | | +-> reordered to: e1000_xmit_frame+0x2294/0x2720 drivers/net/ethernet/intel/e1000/e1000_main.c:3282 KCSAN is saying there is a potential use-after-free read of an skb due to the read to 0xffff8880077b5570 potentially being delayed/reordered later. If the memory was reallocated and reused concurrently, the read could read garbage data: 1. The e1000 driver is being instructed to transmit in e1000_xmit_frame(). Here it uses the data in the skb in various places (e.g. in skb_headlen() above) to set up a new element in the ring buffer to be consumed by the device via DMA. 2. Eventually it calls e1000_tx_queue(), which seems to publish the next entry into the ring buffer and finally calls dma_wmb(). Until this point I see no other barriers (although there's a writel(), but it doesn't always seem to be called). 3. e1000_clean_tx_irq() is called on another CPU after transmit completes, and we know the device has consumed that entry from the ring buffer. At this point the driver then says that the associated skb can be kfree()'d. 4. If I interpreted dma_wmb() (and smp_wmb()) right, plain reads may be reordered after it, irrespective if a write that depended on such reads was ordered by the wmb(). Which means the reordering of the plain reads accessing the skb before it may in fact happen concurrently with the kfree() of skb if reordered after. For example reordered to the very end of e1000_xmit_frame() (line 3282) as KCSAN simulated in this case. Is the above result allowed by the kernel's memory model? In practice, my guess is no compiler and architecture combination would allow this today; or is there an arch where it could? Thanks, -- Marco ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-16 10:12 LKMM: Read dependencies of writes ordered by dma_wmb()? Marco Elver @ 2021-08-16 14:59 ` Alan Stern 2021-08-16 17:23 ` Marco Elver 0 siblings, 1 reply; 11+ messages in thread From: Alan Stern @ 2021-08-16 14:59 UTC (permalink / raw) To: Marco Elver Cc: Paul E. McKenney, Boqun Feng, Andrea Parri, Will Deacon, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Mon, Aug 16, 2021 at 12:12:01PM +0200, Marco Elver wrote: > Hello, > > Commit c58a801701693 added a paragraph to the LKMM: > > +Although we said that plain accesses are not linked by the ppo > +relation, they do contribute to it indirectly. Namely, when there is > +an address dependency from a marked load R to a plain store W, > +followed by smp_wmb() and then a marked store W', the LKMM creates a > +ppo link from R to W'. > > Defining that certain _marked reads_ will also be ordered by smp_wmb(). > But otherwise, other reads (especially plain reads!) will _never_ be > ordered by smp_wmb(). Is my understanding correct? The ordering is indirect, but yes. > I am asking because KCSAN is growing limited support for weak memory > modeling and memory barriers, and I'm trying to figure out if I'm seeing > a false positive or genuinely allowed race. > > One caveat is the case I'm trying to understand doesn't involve just 2 > CPUs but also a device. And for now, I'm assuming that dma_wmb() is as > strong as smp_wmb() also wrt other CPUs (but my guess is this > assumption is already too strong). I'm not sure that is right. dma_wmb affects the visibility of writes to a DMA buffer from the point of view of the device, not necessarily from the point of view of other CPUs. At least, there doesn't seem to be any claim in memory-barriers.txt that it does so. > The whole area of the memory model that includes talking to devices and > devices talking back to CPUs seems quite murky, and need to confirm that > I either got it right or wrong. :-) The LKMM itself doesn't include anything about device I/O. So you're already going beyond the known limits. :-) ... > KCSAN is saying there is a potential use-after-free read of an skb due > to the read to 0xffff8880077b5570 potentially being delayed/reordered > later. If the memory was reallocated and reused concurrently, the read > could read garbage data: > > 1. The e1000 driver is being instructed to transmit in > e1000_xmit_frame(). Here it uses the data in the skb in various > places (e.g. in skb_headlen() above) to set up a new element in > the ring buffer to be consumed by the device via DMA. You mean here the driver reads some stuff from the skb, right? And various writes depend on the data that was read, but these dependencies aren't evident to the memory model because they all involve plain accesses. > 2. Eventually it calls e1000_tx_queue(), which seems to publish the > next entry into the ring buffer and finally calls dma_wmb(). > Until this point I see no other barriers (although there's a > writel(), but it doesn't always seem to be called). And potentially those reads from above could be delayed (or repeated) after this point. But you're missing something. What matters isn't the dma_wmb. Rather, it's the call which transfers ownership of the buffer to the device. That call must certainly include its own memory barrier, meaning that the reads must complete before the call returns. We don't depend on a dma_wmb which might or might not be present to enforce this ordering. Unless this buffer mapping is supposed to be coherent, of course, in which case there would be no ownership transfers. > 3. e1000_clean_tx_irq() is called on another CPU after transmit > completes, and we know the device has consumed that entry from > the ring buffer. At this point the driver then says that the > associated skb can be kfree()'d. > > 4. If I interpreted dma_wmb() (and smp_wmb()) right, plain reads > may be reordered after it, irrespective if a write that depended > on such reads was ordered by the wmb(). Which means the > reordering of the plain reads accessing the skb before it may in > fact happen concurrently with the kfree() of skb if reordered > after. For example reordered to the very end of > e1000_xmit_frame() (line 3282) as KCSAN simulated in this case. > > Is the above result allowed by the kernel's memory model? This can't happen, for the reason explained above, if the buffer is non-coherent. But if the DMA mapping is coherent, this does sound like a bug. > In practice, my guess is no compiler and architecture combination would > allow this today; or is there an arch where it could? Probably not; reordering of reads tends to take place over time scales a lot shorter than lengthy I/O operations. Alan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-16 14:59 ` Alan Stern @ 2021-08-16 17:23 ` Marco Elver 2021-08-16 19:21 ` Alan Stern 0 siblings, 1 reply; 11+ messages in thread From: Marco Elver @ 2021-08-16 17:23 UTC (permalink / raw) To: Alan Stern Cc: Paul E. McKenney, Boqun Feng, Andrea Parri, Will Deacon, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Mon, Aug 16, 2021 at 10:59AM -0400, Alan Stern wrote: [...] > > One caveat is the case I'm trying to understand doesn't involve just 2 > > CPUs but also a device. And for now, I'm assuming that dma_wmb() is as > > strong as smp_wmb() also wrt other CPUs (but my guess is this > > assumption is already too strong). > > I'm not sure that is right. dma_wmb affects the visibility of writes to > a DMA buffer from the point of view of the device, not necessarily from > the point of view of other CPUs. At least, there doesn't seem to be any > claim in memory-barriers.txt that it does so. Thanks, I thought so. While I could just not instrument dma_*mb() at all, because KCSAN obviously can't instrument what devices do, I wonder if the resulting reports are at all interesting. For example, if I do not make the assumption that dma_wmb==smp_smb, and don't instrument dma_*mb() at all, I also get racy UAF reordered writes: I could imagine some architecture where dma_wmb() propagates the write to devices from CPU 0; but CPU 1 then does the kfree(), reallocates, reuses the data, but then gets its data overwritten by CPU 0. What would be more useful? 1. Let the architecture decide how they want KCSAN to instrument non-smp barriers, given it's underspecified. This means KCSAN would report different races on different architectures, but keep the noise down. 2. Assume the weakest possible model, where non-smp barriers just do nothing wrt other CPUs. > > The whole area of the memory model that includes talking to devices and > > devices talking back to CPUs seems quite murky, and need to confirm that > > I either got it right or wrong. :-) > > The LKMM itself doesn't include anything about device I/O. So you're > already going beyond the known limits. :-) > > ... > > > KCSAN is saying there is a potential use-after-free read of an skb due > > to the read to 0xffff8880077b5570 potentially being delayed/reordered > > later. If the memory was reallocated and reused concurrently, the read > > could read garbage data: > > > > 1. The e1000 driver is being instructed to transmit in > > e1000_xmit_frame(). Here it uses the data in the skb in various > > places (e.g. in skb_headlen() above) to set up a new element in > > the ring buffer to be consumed by the device via DMA. > > You mean here the driver reads some stuff from the skb, right? And > various writes depend on the data that was read, but these dependencies > aren't evident to the memory model because they all involve plain > accesses. Yes, correct. > > 2. Eventually it calls e1000_tx_queue(), which seems to publish the > > next entry into the ring buffer and finally calls dma_wmb(). > > Until this point I see no other barriers (although there's a > > writel(), but it doesn't always seem to be called). > > And potentially those reads from above could be delayed (or repeated) > after this point. > > But you're missing something. What matters isn't the dma_wmb. Rather, > it's the call which transfers ownership of the buffer to the device. > That call must certainly include its own memory barrier, meaning that > the reads must complete before the call returns. We don't depend on a > dma_wmb which might or might not be present to enforce this ordering. > > Unless this buffer mapping is supposed to be coherent, of course, in > which case there would be no ownership transfers. I think it's coherent: txdr->desc = dma_alloc_coherent(&pdev->dev, txdr->size, &txdr->dma, GFP_KERNEL); and then in: static void e1000_tx_queue(...) { ... writes to desc ... /* Force memory writes to complete before letting h/w * know there are new descriptors to fetch. (Only * applicable for weak-ordered memory model archs, * such as IA-64). */ dma_wmb(); tx_ring->next_to_use = i; } used by static netdev_tx_t e1000_xmit_frame(...) { ... e1000_tx_queue(..., tx_ring, ...); ... if (!netdev_xmit_more() || netif_xmit_stopped(netdev_get_tx_queue(netdev, 0))) { writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt); } ... } My best guess: as long as the device is fetching from the ring, the driver can just append to it and does not do the writel(). > > 3. e1000_clean_tx_irq() is called on another CPU after transmit > > completes, and we know the device has consumed that entry from > > the ring buffer. At this point the driver then says that the > > associated skb can be kfree()'d. > > > > 4. If I interpreted dma_wmb() (and smp_wmb()) right, plain reads > > may be reordered after it, irrespective if a write that depended > > on such reads was ordered by the wmb(). Which means the > > reordering of the plain reads accessing the skb before it may in > > fact happen concurrently with the kfree() of skb if reordered > > after. For example reordered to the very end of > > e1000_xmit_frame() (line 3282) as KCSAN simulated in this case. > > > > Is the above result allowed by the kernel's memory model? > > This can't happen, for the reason explained above, if the buffer is > non-coherent. But if the DMA mapping is coherent, this does sound like > a bug. Makes sense. > > In practice, my guess is no compiler and architecture combination would > > allow this today; or is there an arch where it could? > > Probably not; reordering of reads tends to take place over time > scales a lot shorter than lengthy I/O operations. Which might be an argument to make KCSAN's non-smp barrier instrumentation arch-dependent, because some drivers might in fact be written with some target architectures and their properties in mind. At least it would help keep the noise down, and those architecture that want to see such races certainly still could. Any preferences? Thanks, -- Marco ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-16 17:23 ` Marco Elver @ 2021-08-16 19:21 ` Alan Stern 2021-08-16 20:50 ` Paul E. McKenney 0 siblings, 1 reply; 11+ messages in thread From: Alan Stern @ 2021-08-16 19:21 UTC (permalink / raw) To: Marco Elver Cc: Paul E. McKenney, Boqun Feng, Andrea Parri, Will Deacon, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Mon, Aug 16, 2021 at 07:23:51PM +0200, Marco Elver wrote: > On Mon, Aug 16, 2021 at 10:59AM -0400, Alan Stern wrote: > [...] > > > One caveat is the case I'm trying to understand doesn't involve just 2 > > > CPUs but also a device. And for now, I'm assuming that dma_wmb() is as > > > strong as smp_wmb() also wrt other CPUs (but my guess is this > > > assumption is already too strong). > > > > I'm not sure that is right. dma_wmb affects the visibility of writes to > > a DMA buffer from the point of view of the device, not necessarily from > > the point of view of other CPUs. At least, there doesn't seem to be any > > claim in memory-barriers.txt that it does so. > > Thanks, I thought so. > > While I could just not instrument dma_*mb() at all, because KCSAN > obviously can't instrument what devices do, I wonder if the resulting > reports are at all interesting. > > For example, if I do not make the assumption that dma_wmb==smp_smb, and > don't instrument dma_*mb() at all, I also get racy UAF reordered writes: > I could imagine some architecture where dma_wmb() propagates the write > to devices from CPU 0; but CPU 1 then does the kfree(), reallocates, > reuses the data, but then gets its data overwritten by CPU 0. Access ordering of devices is difficult to describe. How do you tell a memory model (either a theoretical one or one embedded in code like KCSAN) that a particular interrupt handler routine can't be called until after a particular write has enabled the device to generate an IRQ? In the case you mention, how do you tell the memory model that the code on CPU 1 can't run until after CPU 0 has executed a particular write, one which is forced by some memory barrier to occur _after_ all the potential overwrites its worried about? > What would be more useful? > > 1. Let the architecture decide how they want KCSAN to instrument non-smp > barriers, given it's underspecified. This means KCSAN would report > different races on different architectures, but keep the noise down. > > 2. Assume the weakest possible model, where non-smp barriers just do > nothing wrt other CPUs. I don't think either of those would work out very well. The problem isn't how you handle the non-smp barriers; the problem is how you describe to the memory model the way devices behave. ... > > > In practice, my guess is no compiler and architecture combination would > > > allow this today; or is there an arch where it could? > > > > Probably not; reordering of reads tends to take place over time > > scales a lot shorter than lengthy I/O operations. > > Which might be an argument to make KCSAN's non-smp barrier > instrumentation arch-dependent, because some drivers might in fact be > written with some target architectures and their properties in mind. At > least it would help keep the noise down, and those architecture that > want to see such races certainly still could. > > Any preferences? I'm not a good person to ask; I have never used KCSAN. However... While some drivers are indeed written for particular architectures or systems, I doubt that they rely very heavily on the special properties of their target architectures/systems to avoid races. Rather, they rely on the hardware to behave correctly, just as non-arch-specific drivers do. Furthermore, the kernel tries pretty hard to factor out arch-specific synchronization mechanisms and related concepts into general-purpose abstractions (in the way that smp_mb() is generally available but is defined differently for different architectures, for example). Drivers tend to rely on these abstractions rather than on the arch-specific properties directly. In short, trying to make KCSAN's handling of device I/O into something arch-specific doesn't seem (to me) like a particular advantageous approach. Other people are likely to have different opinions. Alan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-16 19:21 ` Alan Stern @ 2021-08-16 20:50 ` Paul E. McKenney 2021-08-17 12:14 ` Marco Elver 2021-08-17 12:28 ` Will Deacon 0 siblings, 2 replies; 11+ messages in thread From: Paul E. McKenney @ 2021-08-16 20:50 UTC (permalink / raw) To: Alan Stern Cc: Marco Elver, Boqun Feng, Andrea Parri, Will Deacon, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Mon, Aug 16, 2021 at 03:21:09PM -0400, Alan Stern wrote: > On Mon, Aug 16, 2021 at 07:23:51PM +0200, Marco Elver wrote: > > On Mon, Aug 16, 2021 at 10:59AM -0400, Alan Stern wrote: > > [...] > > > > One caveat is the case I'm trying to understand doesn't involve just 2 > > > > CPUs but also a device. And for now, I'm assuming that dma_wmb() is as > > > > strong as smp_wmb() also wrt other CPUs (but my guess is this > > > > assumption is already too strong). > > > > > > I'm not sure that is right. dma_wmb affects the visibility of writes to > > > a DMA buffer from the point of view of the device, not necessarily from > > > the point of view of other CPUs. At least, there doesn't seem to be any > > > claim in memory-barriers.txt that it does so. > > > > Thanks, I thought so. > > > > While I could just not instrument dma_*mb() at all, because KCSAN > > obviously can't instrument what devices do, I wonder if the resulting > > reports are at all interesting. > > > > For example, if I do not make the assumption that dma_wmb==smp_smb, and > > don't instrument dma_*mb() at all, I also get racy UAF reordered writes: > > I could imagine some architecture where dma_wmb() propagates the write > > to devices from CPU 0; but CPU 1 then does the kfree(), reallocates, > > reuses the data, but then gets its data overwritten by CPU 0. > > Access ordering of devices is difficult to describe. How do you tell a > memory model (either a theoretical one or one embedded in code like > KCSAN) that a particular interrupt handler routine can't be called until > after a particular write has enabled the device to generate an IRQ? > > In the case you mention, how do you tell the memory model that the code > on CPU 1 can't run until after CPU 0 has executed a particular write, one > which is forced by some memory barrier to occur _after_ all the potential > overwrites its worried about? What Alan said on the difficulty! However, KCSAN has the advantage of not needing to specify the outcomes, which is much of the complexity. For LKMM to do a good job of handling devices, we would need a model of each device(!). > > What would be more useful? > > > > 1. Let the architecture decide how they want KCSAN to instrument non-smp > > barriers, given it's underspecified. This means KCSAN would report > > different races on different architectures, but keep the noise down. > > > > 2. Assume the weakest possible model, where non-smp barriers just do > > nothing wrt other CPUs. > > I don't think either of those would work out very well. The problem > isn't how you handle the non-smp barriers; the problem is how you > describe to the memory model the way devices behave. There are some architecture-independent ordering guarantees for MMIO which go something like this: 0. MMIO readX() and writeX() accesses to the same device are implicitly ordered, whether relaxed or not. 1. Locking partitions non-relaxed MMIO accesses in the manner that you would expect. For example, if CPU 0 does an MMIO write, then releases a lock, and later CPU 1 acquires that same lock and does an MMIO read, CPU 0's MMIO write is guaranteed to happen before CPU 1's MMIO read. PowerPC has to jump through a few hoops to make this happen. Relaxed MMIO accesses such as readb_relaxed() can be reordered with locking primitives on some architectures. 2. smp_*() memory barriers are not guaranteed to affect MMIO accesses, especially not in kernels built with CONFIG_SMP=n. 3. The mb() memory barrier is required to order prior MMIO accesses against subsequent MMIO accesses. The wmb() and rmb() memory barriers are required to order prior order prior MMIO write/reads against later MMIO writes/reads, respectively. These memory barriers also order normal memory accesses in the same way as their smp_*() counterparts. 4. The mmiowb() memory barrier can be slightly weaker than wmb(), as it is in ia64, but I have lost track of the details. 5. The dma_mb(), dma_rmb(), and dma_wmb() appear to be specific to ARMv8. 6. Non-relaxed MMIO writeX() accesses force ordering of prior normal memory writes before any DMA initiated by the writeX(). 7. Non-relaxed MMIO readX() accesses force ordering of later normal memory reads after any DMA whose completion is reported by the readX(). These readX() accesses are also ordered before any subsequent delay loops. Some more detail is available in memory-barriers.txt and in this LWN article: https://lwn.net/Articles/698014/ I wish I could promise you that these are both fully up to date, but it is almost certain that updates are needed. > ... > > > > > In practice, my guess is no compiler and architecture combination would > > > > allow this today; or is there an arch where it could? > > > > > > Probably not; reordering of reads tends to take place over time > > > scales a lot shorter than lengthy I/O operations. > > > > Which might be an argument to make KCSAN's non-smp barrier > > instrumentation arch-dependent, because some drivers might in fact be > > written with some target architectures and their properties in mind. At > > least it would help keep the noise down, and those architecture that > > want to see such races certainly still could. > > > > Any preferences? > > I'm not a good person to ask; I have never used KCSAN. However... > > While some drivers are indeed written for particular architectures or > systems, I doubt that they rely very heavily on the special properties of > their target architectures/systems to avoid races. Rather, they rely on > the hardware to behave correctly, just as non-arch-specific drivers do. > > Furthermore, the kernel tries pretty hard to factor out arch-specific > synchronization mechanisms and related concepts into general-purpose > abstractions (in the way that smp_mb() is generally available but is > defined differently for different architectures, for example). Drivers > tend to rely on these abstractions rather than on the arch-specific > properties directly. > > In short, trying to make KCSAN's handling of device I/O into something > arch-specific doesn't seem (to me) like a particular advantageous > approach. Other people are likely to have different opinions. No preconceived notions here, at least not on this topic. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-16 20:50 ` Paul E. McKenney @ 2021-08-17 12:14 ` Marco Elver 2021-08-17 12:28 ` Will Deacon 1 sibling, 0 replies; 11+ messages in thread From: Marco Elver @ 2021-08-17 12:14 UTC (permalink / raw) To: Paul E. McKenney Cc: Alan Stern, Boqun Feng, Andrea Parri, Will Deacon, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Mon, Aug 16, 2021 at 01:50PM -0700, Paul E. McKenney wrote: > On Mon, Aug 16, 2021 at 03:21:09PM -0400, Alan Stern wrote: [...] > > Access ordering of devices is difficult to describe. How do you tell a > > memory model (either a theoretical one or one embedded in code like > > KCSAN) that a particular interrupt handler routine can't be called until > > after a particular write has enabled the device to generate an IRQ? > > > > In the case you mention, how do you tell the memory model that the code > > on CPU 1 can't run until after CPU 0 has executed a particular write, one > > which is forced by some memory barrier to occur _after_ all the potential > > overwrites its worried about? > > What Alan said on the difficulty! > > However, KCSAN has the advantage of not needing to specify the outcomes, > which is much of the complexity. For LKMM to do a good job of handling > devices, we would need a model of each device(!). For full models, like the formal LKMM, I agree it's extremely difficult! KCSAN has the advantage that it's a "dynamic analysis" tool, relying on merely instrumenting real code and running on the real HW. The real HW is still in charge of generating interrupts, and real devices (like that E1000 device, though in this case virtualized by QEMU) aren't in any way abstracted or modeled. KCSAN's (and any other sanitizer's) primary goals is to just _detect_ certain classes of bugs by making these detectable via instrumentation but otherwise run real code and HW. Thus far, for KCSAN this has been trivial because all it does is keep an eye on reads and writes, and observes if accesses race; and then, per rules for data races (it needs to know about plain and marked accesses), it decides if something is a reportable data race. The real HW is entirely in charge of when and if something executes concurrently. One problem with instrumentation, however, is that it adds certain overheads which make some effects of the hardware very unlikely to observe. For example, the effects of weak memory. Therefore, I'm teaching KCSAN a limited set of weak memory effects allowed by the LKMM by pretending the current CPU reordered an access (currently just "load/store buffering"). To avoid false positives, however, the tool now has to know about memory barriers, otherwise it might simulate reordering too aggressively. Because KCSAN relies on compiler instrumentation, we are simply limited to analyzing what is happening on CPUs, but devices are invisible, and just observe what happens as a result on other CPUs if a device is involved. The case with E1000 and dma_wmb() came about because KCSAN is now able to detect races between 2 CPUs because dma_wmb() doesn't seem to say anything about ordering among CPUs. The main points are: 1. KCSAN doesn't need a model for devices because it's still running on real HW with real devices that are in charge of generating interrupts. 2. In the case with the E1000 driver, a real device causes CPU 1 to run the interrupt, which does a free to memory that might still be read/written to if CPU 0 reordered its accesses (simulated by KCSAN). That reordering can be inhibited by the right barrier, but we haven't found it in the code yet. At least the dma_wmb() isn't required to order the writes between the 2 CPUs (AFAIK). > > > What would be more useful? > > > > > > 1. Let the architecture decide how they want KCSAN to instrument non-smp > > > barriers, given it's underspecified. This means KCSAN would report > > > different races on different architectures, but keep the noise down. > > > > > > 2. Assume the weakest possible model, where non-smp barriers just do > > > nothing wrt other CPUs. > > > > I don't think either of those would work out very well. The problem > > isn't how you handle the non-smp barriers; the problem is how you > > describe to the memory model the way devices behave. > > There are some architecture-independent ordering guarantees for MMIO > which go something like this: > > 0. MMIO readX() and writeX() accesses to the same device are > implicitly ordered, whether relaxed or not. > > 1. Locking partitions non-relaxed MMIO accesses in the manner that > you would expect. For example, if CPU 0 does an MMIO write, > then releases a lock, and later CPU 1 acquires that same lock and > does an MMIO read, CPU 0's MMIO write is guaranteed to happen > before CPU 1's MMIO read. PowerPC has to jump through a few > hoops to make this happen. > > Relaxed MMIO accesses such as readb_relaxed() can be reordered > with locking primitives on some architectures. > > 2. smp_*() memory barriers are not guaranteed to affect MMIO > accesses, especially not in kernels built with CONFIG_SMP=n. > > 3. The mb() memory barrier is required to order prior MMIO > accesses against subsequent MMIO accesses. The wmb() and rmb() > memory barriers are required to order prior order prior MMIO > write/reads against later MMIO writes/reads, respectively. > These memory barriers also order normal memory accesses in > the same way as their smp_*() counterparts. > > 4. The mmiowb() memory barrier can be slightly weaker than wmb(), > as it is in ia64, but I have lost track of the details. > > 5. The dma_mb(), dma_rmb(), and dma_wmb() appear to be specific > to ARMv8. > > 6. Non-relaxed MMIO writeX() accesses force ordering of prior > normal memory writes before any DMA initiated by the writeX(). > > 7. Non-relaxed MMIO readX() accesses force ordering of later > normal memory reads after any DMA whose completion is reported > by the readX(). These readX() accesses are also ordered before > any subsequent delay loops. > > Some more detail is available in memory-barriers.txt and in this LWN > article: https://lwn.net/Articles/698014/ > > I wish I could promise you that these are both fully up to date, but > it is almost certain that updates are needed. Thanks, that's useful. What I can tell is that most I/O ops and barriers have no effect on other CPUs (except for mb() etc.). For KCSAN that's all that matters. [...] > > > Which might be an argument to make KCSAN's non-smp barrier > > > instrumentation arch-dependent, because some drivers might in fact be > > > written with some target architectures and their properties in mind. At > > > least it would help keep the noise down, and those architecture that > > > want to see such races certainly still could. > > > > > > Any preferences? > > > > I'm not a good person to ask; I have never used KCSAN. However... > > > > While some drivers are indeed written for particular architectures or > > systems, I doubt that they rely very heavily on the special properties of > > their target architectures/systems to avoid races. Rather, they rely on > > the hardware to behave correctly, just as non-arch-specific drivers do. > > > > Furthermore, the kernel tries pretty hard to factor out arch-specific > > synchronization mechanisms and related concepts into general-purpose > > abstractions (in the way that smp_mb() is generally available but is > > defined differently for different architectures, for example). Drivers > > tend to rely on these abstractions rather than on the arch-specific > > properties directly. > > > > In short, trying to make KCSAN's handling of device I/O into something > > arch-specific doesn't seem (to me) like a particular advantageous > > approach. Other people are likely to have different opinions. As explained above, KCSAN just instruments C code but still runs on real HW with real devices. All I'm trying to figure out is what I/O ops and barriers say about making accesses visible to other CPUs to avoid false positives. However, it seems by this discussing I'm starting to conclude that the E1000 race might in fact be something allowed, although very unlikely. The main question I was trying to answer is "should such cases be reported or not?", since KCSAN's goal is not to model the system faithfully, but to detect bugs. Either way is possible, and I don't have a preference. I'm leaning towards "no assumptions, report everything" now, because the "access reordering" mode won't be enabled by default. Thanks, -- Marco ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-16 20:50 ` Paul E. McKenney 2021-08-17 12:14 ` Marco Elver @ 2021-08-17 12:28 ` Will Deacon 2021-08-17 13:27 ` Marco Elver 2021-08-17 13:53 ` Paul E. McKenney 1 sibling, 2 replies; 11+ messages in thread From: Will Deacon @ 2021-08-17 12:28 UTC (permalink / raw) To: Paul E. McKenney Cc: Alan Stern, Marco Elver, Boqun Feng, Andrea Parri, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel Just on this bit... On Mon, Aug 16, 2021 at 01:50:57PM -0700, Paul E. McKenney wrote: > 5. The dma_mb(), dma_rmb(), and dma_wmb() appear to be specific > to ARMv8. These are useful on other architectures too! IIRC, they were added by x86 in the first place. They're designed to be used with dma_alloc_coherent() allocations where you're sharing something like a ring buffer with a device and they guarantee accesses won't be reordered before they become visible to the device. They _also_ provide the same ordering to other CPUs. I gave a talk at LPC about some of this, which might help (or might make things worse...): https://www.youtube.com/watch?v=i6DayghhA8Q Ignore the bits about mmiowb() as we got rid of that. Will ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-17 12:28 ` Will Deacon @ 2021-08-17 13:27 ` Marco Elver 2021-08-17 13:53 ` Paul E. McKenney 1 sibling, 0 replies; 11+ messages in thread From: Marco Elver @ 2021-08-17 13:27 UTC (permalink / raw) To: Will Deacon Cc: Paul E. McKenney, Alan Stern, Boqun Feng, Andrea Parri, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Tue, 17 Aug 2021 at 14:28, Will Deacon <will@kernel.org> wrote: > Just on this bit... > > On Mon, Aug 16, 2021 at 01:50:57PM -0700, Paul E. McKenney wrote: > > 5. The dma_mb(), dma_rmb(), and dma_wmb() appear to be specific > > to ARMv8. > > These are useful on other architectures too! IIRC, they were added by x86 in > the first place. They're designed to be used with dma_alloc_coherent() > allocations where you're sharing something like a ring buffer with a device > and they guarantee accesses won't be reordered before they become visible > to the device. They _also_ provide the same ordering to other CPUs. Ah, good you pointed it out again. Re-reading memory-barriers.txt and it does also say these provide order for other CPUs... > I gave a talk at LPC about some of this, which might help (or might make > things worse...): > > https://www.youtube.com/watch?v=i6DayghhA8Q Nice, thank you! > Ignore the bits about mmiowb() as we got rid of that. > > Will ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-17 12:28 ` Will Deacon 2021-08-17 13:27 ` Marco Elver @ 2021-08-17 13:53 ` Paul E. McKenney 2021-08-18 11:39 ` Will Deacon 1 sibling, 1 reply; 11+ messages in thread From: Paul E. McKenney @ 2021-08-17 13:53 UTC (permalink / raw) To: Will Deacon Cc: Alan Stern, Marco Elver, Boqun Feng, Andrea Parri, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Tue, Aug 17, 2021 at 01:28:16PM +0100, Will Deacon wrote: > Just on this bit... > > On Mon, Aug 16, 2021 at 01:50:57PM -0700, Paul E. McKenney wrote: > > 5. The dma_mb(), dma_rmb(), and dma_wmb() appear to be specific > > to ARMv8. > > These are useful on other architectures too! IIRC, they were added by x86 in > the first place. They're designed to be used with dma_alloc_coherent() > allocations where you're sharing something like a ring buffer with a device > and they guarantee accesses won't be reordered before they become visible > to the device. They _also_ provide the same ordering to other CPUs. > > I gave a talk at LPC about some of this, which might help (or might make > things worse...): > > https://www.youtube.com/watch?v=i6DayghhA8Q The slides are here, correct? Nice summary and examples! https://elinux.org/images/a/a8/Uh-oh-Its-IO-Ordering-Will-Deacon-Arm.pdf And this is all I see for dma_mb(): arch/arm64/include/asm/barrier.h:#define dma_mb() dmb(osh) arch/arm64/include/asm/io.h:#define __iomb() dma_mb() And then for __iomb(): arch/arm64/include/asm/io.h:#define __iomb() dma_mb() drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c: __iomb(); But yes, dma_rmb() and dma_wmb() do look to have a few hundred uses between them, and not just within ARMv8. I gave up too soon, so thank you! > Ignore the bits about mmiowb() as we got rid of that. Should the leftovers in current mainline be replaced by wmb()? Or are patches to that effect on their way in somewhere? $ git grep 'mmiowb()' arch/ia64/include/asm/mmiowb.h:#define mmiowb() ia64_mfa() arch/ia64/include/asm/spinlock.h: mmiowb(); arch/mips/include/asm/mmiowb.h:#define mmiowb() iobarrier_w() arch/mips/include/asm/spinlock.h: mmiowb(); arch/mips/kernel/gpio_txx9.c: mmiowb(); arch/mips/kernel/gpio_txx9.c: mmiowb(); arch/mips/kernel/gpio_txx9.c: mmiowb(); arch/mips/kernel/irq_txx9.c: mmiowb(); arch/mips/loongson2ef/common/bonito-irq.c: mmiowb(); arch/mips/loongson2ef/common/bonito-irq.c: mmiowb(); arch/mips/loongson2ef/common/mem.c: mmiowb(); arch/mips/loongson2ef/common/pm.c: mmiowb(); arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); arch/mips/pci/ops-bonito64.c: mmiowb(); arch/mips/pci/ops-loongson2.c: mmiowb(); arch/mips/txx9/generic/irq_tx4939.c: mmiowb(); arch/mips/txx9/generic/setup.c: mmiowb(); arch/mips/txx9/rbtx4927/irq.c: mmiowb(); arch/mips/txx9/rbtx4938/irq.c: mmiowb(); arch/mips/txx9/rbtx4938/irq.c: mmiowb(); arch/mips/txx9/rbtx4938/setup.c: mmiowb(); arch/mips/txx9/rbtx4939/irq.c: mmiowb(); arch/powerpc/include/asm/mmiowb.h:#define mmiowb() mb() arch/riscv/include/asm/mmiowb.h:#define mmiowb() __asm__ __volatile__ ("fence o,w" : : : "memory"); arch/s390/include/asm/io.h:#define mmiowb() zpci_barrier() arch/sh/include/asm/mmiowb.h:#define mmiowb() wmb() arch/sh/include/asm/spinlock-llsc.h: mmiowb(); include/asm-generic/mmiowb.h: * Generic implementation of mmiowb() tracking for spinlocks. include/asm-generic/mmiowb.h: * 1. Implement mmiowb() (and arch_mmiowb_state() if you're fancy) include/asm-generic/mmiowb.h: mmiowb(); Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-17 13:53 ` Paul E. McKenney @ 2021-08-18 11:39 ` Will Deacon 2021-08-18 23:17 ` Paul E. McKenney 0 siblings, 1 reply; 11+ messages in thread From: Will Deacon @ 2021-08-18 11:39 UTC (permalink / raw) To: Paul E. McKenney Cc: Alan Stern, Marco Elver, Boqun Feng, Andrea Parri, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel Hi Paul. On Tue, Aug 17, 2021 at 06:53:08AM -0700, Paul E. McKenney wrote: > On Tue, Aug 17, 2021 at 01:28:16PM +0100, Will Deacon wrote: > > Just on this bit... > > > > On Mon, Aug 16, 2021 at 01:50:57PM -0700, Paul E. McKenney wrote: > > > 5. The dma_mb(), dma_rmb(), and dma_wmb() appear to be specific > > > to ARMv8. > > > > These are useful on other architectures too! IIRC, they were added by x86 in > > the first place. They're designed to be used with dma_alloc_coherent() > > allocations where you're sharing something like a ring buffer with a device > > and they guarantee accesses won't be reordered before they become visible > > to the device. They _also_ provide the same ordering to other CPUs. > > > > I gave a talk at LPC about some of this, which might help (or might make > > things worse...): > > > > https://www.youtube.com/watch?v=i6DayghhA8Q > > The slides are here, correct? Nice summary and examples! > > https://elinux.org/images/a/a8/Uh-oh-Its-IO-Ordering-Will-Deacon-Arm.pdf Yes, that looks like them. I've also put them up here: https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/slides/elce-2018.pdf (turns out it was ELCE not LPC!) > And this is all I see for dma_mb(): > > arch/arm64/include/asm/barrier.h:#define dma_mb() dmb(osh) > arch/arm64/include/asm/io.h:#define __iomb() dma_mb() > > And then for __iomb(): > > arch/arm64/include/asm/io.h:#define __iomb() dma_mb() > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c: __iomb(); > > But yes, dma_rmb() and dma_wmb() do look to have a few hundred uses > between them, and not just within ARMv8. I gave up too soon, so > thank you! No problem, and yes, dma_mb() is an arm64-internal thing which we should probably rename. > > Ignore the bits about mmiowb() as we got rid of that. > > Should the leftovers in current mainline be replaced by wmb()? Or are > patches to that effect on their way in somewhere? I already got rid of the non-arch usage of mmiowb(), but I wasn't bravei enough to change the arch code as it may well be that they're relying on some specific instruction semantics. Despite my earlier comment, mmiowb() still exists, but only as a part of ARCH_HAS_MMIOWB where it is used to add additional spinlock ordering so that the rest of the kernel doesn't need to use mmiowb() at all. So I suppose for these: > arch/mips/kernel/gpio_txx9.c: mmiowb(); > arch/mips/kernel/gpio_txx9.c: mmiowb(); > arch/mips/kernel/gpio_txx9.c: mmiowb(); > arch/mips/kernel/irq_txx9.c: mmiowb(); > arch/mips/loongson2ef/common/bonito-irq.c: mmiowb(); > arch/mips/loongson2ef/common/bonito-irq.c: mmiowb(); > arch/mips/loongson2ef/common/mem.c: mmiowb(); > arch/mips/loongson2ef/common/pm.c: mmiowb(); > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > arch/mips/pci/ops-bonito64.c: mmiowb(); > arch/mips/pci/ops-loongson2.c: mmiowb(); > arch/mips/txx9/generic/irq_tx4939.c: mmiowb(); > arch/mips/txx9/generic/setup.c: mmiowb(); > arch/mips/txx9/rbtx4927/irq.c: mmiowb(); > arch/mips/txx9/rbtx4938/irq.c: mmiowb(); > arch/mips/txx9/rbtx4938/irq.c: mmiowb(); > arch/mips/txx9/rbtx4938/setup.c: mmiowb(); > arch/mips/txx9/rbtx4939/irq.c: mmiowb(); we could replace mmiowb() with iobarrier_w(). Will ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LKMM: Read dependencies of writes ordered by dma_wmb()? 2021-08-18 11:39 ` Will Deacon @ 2021-08-18 23:17 ` Paul E. McKenney 0 siblings, 0 replies; 11+ messages in thread From: Paul E. McKenney @ 2021-08-18 23:17 UTC (permalink / raw) To: Will Deacon Cc: Alan Stern, Marco Elver, Boqun Feng, Andrea Parri, Mark Rutland, Dmitry Vyukov, kasan-dev, linux-kernel On Wed, Aug 18, 2021 at 12:39:36PM +0100, Will Deacon wrote: > Hi Paul. Hello, Will, > On Tue, Aug 17, 2021 at 06:53:08AM -0700, Paul E. McKenney wrote: > > On Tue, Aug 17, 2021 at 01:28:16PM +0100, Will Deacon wrote: [ . . . ] > > > Ignore the bits about mmiowb() as we got rid of that. > > > > Should the leftovers in current mainline be replaced by wmb()? Or are > > patches to that effect on their way in somewhere? > > I already got rid of the non-arch usage of mmiowb(), but I wasn't bravei > enough to change the arch code as it may well be that they're relying on > some specific instruction semantics. > > Despite my earlier comment, mmiowb() still exists, but only as a part of > ARCH_HAS_MMIOWB where it is used to add additional spinlock ordering so > that the rest of the kernel doesn't need to use mmiowb() at all. > > So I suppose for these: > > > arch/mips/kernel/gpio_txx9.c: mmiowb(); > > arch/mips/kernel/gpio_txx9.c: mmiowb(); > > arch/mips/kernel/gpio_txx9.c: mmiowb(); > > arch/mips/kernel/irq_txx9.c: mmiowb(); > > arch/mips/loongson2ef/common/bonito-irq.c: mmiowb(); > > arch/mips/loongson2ef/common/bonito-irq.c: mmiowb(); > > arch/mips/loongson2ef/common/mem.c: mmiowb(); > > arch/mips/loongson2ef/common/pm.c: mmiowb(); > > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > > arch/mips/loongson2ef/lemote-2f/reset.c: mmiowb(); > > arch/mips/pci/ops-bonito64.c: mmiowb(); > > arch/mips/pci/ops-loongson2.c: mmiowb(); > > arch/mips/txx9/generic/irq_tx4939.c: mmiowb(); > > arch/mips/txx9/generic/setup.c: mmiowb(); > > arch/mips/txx9/rbtx4927/irq.c: mmiowb(); > > arch/mips/txx9/rbtx4938/irq.c: mmiowb(); > > arch/mips/txx9/rbtx4938/irq.c: mmiowb(); > > arch/mips/txx9/rbtx4938/setup.c: mmiowb(); > > arch/mips/txx9/rbtx4939/irq.c: mmiowb(); > > we could replace mmiowb() with iobarrier_w(). Not having MIPS hardware at my disposal, I will leave these to those who do. I would suggest adding iobarrier_*() to memory-barriers.txt, but they appear to be specific to MIPS and PowerPC. Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-08-18 23:19 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-16 10:12 LKMM: Read dependencies of writes ordered by dma_wmb()? Marco Elver 2021-08-16 14:59 ` Alan Stern 2021-08-16 17:23 ` Marco Elver 2021-08-16 19:21 ` Alan Stern 2021-08-16 20:50 ` Paul E. McKenney 2021-08-17 12:14 ` Marco Elver 2021-08-17 12:28 ` Will Deacon 2021-08-17 13:27 ` Marco Elver 2021-08-17 13:53 ` Paul E. McKenney 2021-08-18 11:39 ` Will Deacon 2021-08-18 23:17 ` Paul E. McKenney
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.