qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* PCI memory sync question (kvm,dpdk,e1000,packet stalled)
       [not found] <CAMmAVbWzrYWZBXwKxSd-f5SXmq6qP1ok8abvyKJhp3=REEaMPA@mail.gmail.com>
@ 2019-11-20 17:36 ` ASM
  2019-11-21 14:05   ` Stefan Hajnoczi
  0 siblings, 1 reply; 7+ messages in thread
From: ASM @ 2019-11-20 17:36 UTC (permalink / raw)
  To: qemu-devel

Hi folks!

I trying solve the problem, with packets stopping (e1000,tap,kvm).
My studies led to the following:
1. From flatview_write_continue() I see, what e1000 writes the number
"7" to the STAT register.
2. The driver from target OS reads STAT register with number "7" and
writes to the register the number "0".
3. From flatview_write_continue() (I make edits):
            memcpy(ptr, buf, l);
            new1=ptr[0xc];
            usleep(100);
            new2=ptr[0xc];
            invalidate_and_set_dirty(mr, addr1, l);
            new3=ptr[0xc];
printf("Old: %i, new1, %i, new2: %i, new3: %i\n", old,new1,new2,new3);

I see what memory in first printf is "7", but after usleep() is "0".
Do I understand correctly that this should not be? Or RCU lock
suggests the ability to the multiple writers?

The problem is that qemu(e1000) writes the number 7, after which
target(dpdk driver) reads 7, on the basis of this it writes the number
0, but as a result (extremely rarely), the value STATUS still remains
7. Therefore, packet processing is interrupted. This behavior is
observed only on kvm (it is not observed on tcg).

Please help with advice or ideas.

P.S. I started a bug report earlier. There is more detailed
information. In my broken English:
bugs.launchpad.net/qemu/+bug/1853123

---
Best regards,
Leonid Myravjev


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PCI memory sync question (kvm,dpdk,e1000,packet stalled)
  2019-11-20 17:36 ` PCI memory sync question (kvm,dpdk,e1000,packet stalled) ASM
@ 2019-11-21 14:05   ` Stefan Hajnoczi
  2019-11-27 12:39     ` ASM
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2019-11-21 14:05 UTC (permalink / raw)
  To: ASM; +Cc: Paolo Bonzini, qemu-devel, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 2635 bytes --]

On Wed, Nov 20, 2019 at 08:36:32PM +0300, ASM wrote:
> I trying solve the problem, with packets stopping (e1000,tap,kvm).
> My studies led to the following:
> 1. From flatview_write_continue() I see, what e1000 writes the number
> "7" to the STAT register.
> 2. The driver from target OS reads STAT register with number "7" and
> writes to the register the number "0".
> 3. From flatview_write_continue() (I make edits):
>             memcpy(ptr, buf, l);
>             new1=ptr[0xc];
>             usleep(100);
>             new2=ptr[0xc];
>             invalidate_and_set_dirty(mr, addr1, l);
>             new3=ptr[0xc];
> printf("Old: %i, new1, %i, new2: %i, new3: %i\n", old,new1,new2,new3);
> 
> I see what memory in first printf is "7", but after usleep() is "0".
> Do I understand correctly that this should not be? Or RCU lock
> suggests the ability to the multiple writers?
> 
> The problem is that qemu(e1000) writes the number 7, after which
> target(dpdk driver) reads 7, on the basis of this it writes the number
> 0, but as a result (extremely rarely), the value STATUS still remains
> 7. Therefore, packet processing is interrupted. This behavior is
> observed only on kvm (it is not observed on tcg).
> 
> Please help with advice or ideas.

Hi Leonid,
Could you be seeing weird behavior with KVM due to MMIO write
coalescing?

  static void e1000_mmio_setup(E1000State *d)
  {
      int i;
      const uint32_t excluded_regs[] = {
          E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
          E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
      };

      memory_region_init_io(&d->mmio, OBJECT(d), &e1000_mmio_ops, d,
                            "e1000-mmio", PNPMMIO_SIZE);
      memory_region_add_coalescing(&d->mmio, 0, excluded_regs[0]);
      for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
          memory_region_add_coalescing(&d->mmio, excluded_regs[i] + 4,
                                       excluded_regs[i+1] - excluded_regs[i] - 4);
      memory_region_init_io(&d->io, OBJECT(d), &e1000_io_ops, d, "e1000-io", IOPORT_SIZE);
  }

MMIO write coalescing means that QEMU doesn't see the register writes
immediately.  Instead kvm.ko records them into a ring buffer and QEMU
processes the ring when the next ioctl(KVM_RUN) exit occurs.

See Linux Documentation/virt/kvm/api.txt "4.116
KVM_(UN)REGISTER_COALESCED_MMIO" for more details.

I don't really understand your printf debugging explanation.  It would
help to see the DPDK code and the exact printf() output.

Also, is DPDK accessing the e1000 device from more than 1 vCPU?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PCI memory sync question (kvm,dpdk,e1000,packet stalled)
  2019-11-21 14:05   ` Stefan Hajnoczi
@ 2019-11-27 12:39     ` ASM
  2019-12-19 14:58       ` Stefan Hajnoczi
  0 siblings, 1 reply; 7+ messages in thread
From: ASM @ 2019-11-27 12:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini, dmitry.fleytman, qemu-devel, Michael S. Tsirkin

Stefan, thanks for answering.

When the packet is received, e1000 writes it to memory directrly
without any RCU.
The address of memory for writing is set by the driver from dpdk driver.
Driver writes to RDBA (RDBAH,RDBAL) base address of ring.

It turns out that MMIO RCU (mentioned from e1000_mmio_setup) does not
protect, and can't protect the ring descriptors.
The area for protection may be any area of operational memory. And it
becomes famous when writing to registers RDBA by driver.
(see datasheet 82574 GbE Controller "7.1.8 Receive Descriptor Queue Structure")

How can this memory be protected? As I understand it, the e1000 should
track the record in RDBA and enable memory protection in this region.
But how to do it right?

Source code qemu:
hw/net/e1000.c:954 (version master)

 954         base = rx_desc_base(s) + sizeof(desc) * s->mac_reg[RDH];
where rx_desc_base -- address RDBAH regs. It address no have RCU protect.
...
955         pci_dma_read(d, base, &desc, sizeof(desc));
...
957         desc.status |= (vlan_status | E1000_RXD_STAT_DD);
...
990         pci_dma_write(d, base, &desc, sizeof(desc));
->
exec.c:
3111 static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr,
3112                                            MemTxAttrs attrs,
3113                                            const uint8_t *buf,
3114                                            hwaddr len, hwaddr addr1,
3115                                            hwaddr l, MemoryRegion *mr)
3116 {
...
3123         if (!memory_access_is_direct(mr, true)) {
(false)
3131         } else {
3132             /* RAM case */
3133             ptr = qemu_ram_ptr_length(mr->ram_block, addr1, &l, false);
3134             memcpy(ptr, buf, l);

where I be seeing weird behavior with KVM due to MMIO write coalescing

3135             invalidate_and_set_dirty(mr, addr1, l);
3136         }
3137

Source code dpdk(e1000): (version dpdk-stable-17.11.9)
drivers/net/e1000/em_rxtx.c:

699 uint16_t
700 eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
701                 uint16_t nb_pkts)
...
718         rxq = rx_queue
...
722         rx_id = rxq->rx_tail;
723         rx_ring = rxq->rx_ring
...
734                 rxdp = &rx_ring[rx_id];
735                 status = rxdp->status;
736                 if (! (status & E1000_RXD_STAT_DD))
737                         break;
...
807                 rxdp->buffer_addr = dma_addr;
808                 rxdp->status = 0;
where I be seeing weird behavior with KVM due to MMIO write
coalescing


P.S.
> Also, is DPDK accessing the e1000 device from more than 1 vCPU?
 All tests on single virtual CPU.

I created github project for quick reproduction of this error:
https://github.com/BASM/qemu_dpdk_e1000_test

---
Best regards,
Leonid Myravjev

On Thu, 21 Nov 2019 at 17:05, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, Nov 20, 2019 at 08:36:32PM +0300, ASM wrote:
> > I trying solve the problem, with packets stopping (e1000,tap,kvm).
> > My studies led to the following:
> > 1. From flatview_write_continue() I see, what e1000 writes the number
> > "7" to the STAT register.
> > 2. The driver from target OS reads STAT register with number "7" and
> > writes to the register the number "0".
> > 3. From flatview_write_continue() (I make edits):
> >             memcpy(ptr, buf, l);
> >             new1=ptr[0xc];
> >             usleep(100);
> >             new2=ptr[0xc];
> >             invalidate_and_set_dirty(mr, addr1, l);
> >             new3=ptr[0xc];
> > printf("Old: %i, new1, %i, new2: %i, new3: %i\n", old,new1,new2,new3);
> >
> > I see what memory in first printf is "7", but after usleep() is "0".
> > Do I understand correctly that this should not be? Or RCU lock
> > suggests the ability to the multiple writers?
> >
> > The problem is that qemu(e1000) writes the number 7, after which
> > target(dpdk driver) reads 7, on the basis of this it writes the number
> > 0, but as a result (extremely rarely), the value STATUS still remains
> > 7. Therefore, packet processing is interrupted. This behavior is
> > observed only on kvm (it is not observed on tcg).
> >
> > Please help with advice or ideas.
>
> Hi Leonid,
> Could you be seeing weird behavior with KVM due to MMIO write
> coalescing?
>
>   static void e1000_mmio_setup(E1000State *d)
>   {
>       int i;
>       const uint32_t excluded_regs[] = {
>           E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
>           E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
>       };
>
>       memory_region_init_io(&d->mmio, OBJECT(d), &e1000_mmio_ops, d,
>                             "e1000-mmio", PNPMMIO_SIZE);
>       memory_region_add_coalescing(&d->mmio, 0, excluded_regs[0]);
>       for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
>           memory_region_add_coalescing(&d->mmio, excluded_regs[i] + 4,
>                                        excluded_regs[i+1] - excluded_regs[i] - 4);
>       memory_region_init_io(&d->io, OBJECT(d), &e1000_io_ops, d, "e1000-io", IOPORT_SIZE);
>   }
>
> MMIO write coalescing means that QEMU doesn't see the register writes
> immediately.  Instead kvm.ko records them into a ring buffer and QEMU
> processes the ring when the next ioctl(KVM_RUN) exit occurs.
>
> See Linux Documentation/virt/kvm/api.txt "4.116
> KVM_(UN)REGISTER_COALESCED_MMIO" for more details.
>
> I don't really understand your printf debugging explanation.  It would
> help to see the DPDK code and the exact printf() output.
>
> Also, is DPDK accessing the e1000 device from more than 1 vCPU?
>
> Stefan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PCI memory sync question (kvm,dpdk,e1000,packet stalled)
  2019-11-27 12:39     ` ASM
@ 2019-12-19 14:58       ` Stefan Hajnoczi
  2019-12-30 10:10         ` ASM
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2019-12-19 14:58 UTC (permalink / raw)
  To: ASM; +Cc: Paolo Bonzini, dmitry.fleytman, qemu-devel, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 6555 bytes --]

On Wed, Nov 27, 2019 at 03:39:03PM +0300, ASM wrote:
> When the packet is received, e1000 writes it to memory directrly
> without any RCU.
> The address of memory for writing is set by the driver from dpdk driver.
> Driver writes to RDBA (RDBAH,RDBAL) base address of ring.
> 
> It turns out that MMIO RCU (mentioned from e1000_mmio_setup) does not
> protect, and can't protect the ring descriptors.
> The area for protection may be any area of operational memory. And it
> becomes famous when writing to registers RDBA by driver.
> (see datasheet 82574 GbE Controller "7.1.8 Receive Descriptor Queue Structure")
> 
> How can this memory be protected? As I understand it, the e1000 should
> track the record in RDBA and enable memory protection in this region.
> But how to do it right?

I misunderstood the issue and you can probably ignore my comments about
coalesced MMIO.  You quoted descriptor DMA code below so coalesced MMIO
shouldn't be relevant since desc->status isn't an MMIO register.

> 
> Source code qemu:
> hw/net/e1000.c:954 (version master)
> 
>  954         base = rx_desc_base(s) + sizeof(desc) * s->mac_reg[RDH];
> where rx_desc_base -- address RDBAH regs. It address no have RCU protect.
> ...
> 955         pci_dma_read(d, base, &desc, sizeof(desc));
> ...
> 957         desc.status |= (vlan_status | E1000_RXD_STAT_DD);
> ...
> 990         pci_dma_write(d, base, &desc, sizeof(desc));
> ->
> exec.c:
> 3111 static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr,
> 3112                                            MemTxAttrs attrs,
> 3113                                            const uint8_t *buf,
> 3114                                            hwaddr len, hwaddr addr1,
> 3115                                            hwaddr l, MemoryRegion *mr)
> 3116 {
> ...
> 3123         if (!memory_access_is_direct(mr, true)) {
> (false)
> 3131         } else {
> 3132             /* RAM case */
> 3133             ptr = qemu_ram_ptr_length(mr->ram_block, addr1, &l, false);
> 3134             memcpy(ptr, buf, l);
> 
> where I be seeing weird behavior with KVM due to MMIO write coalescing
> 
> 3135             invalidate_and_set_dirty(mr, addr1, l);
> 3136         }
> 3137
> 
> Source code dpdk(e1000): (version dpdk-stable-17.11.9)
> drivers/net/e1000/em_rxtx.c:
> 
> 699 uint16_t
> 700 eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> 701                 uint16_t nb_pkts)
> ...
> 718         rxq = rx_queue
> ...
> 722         rx_id = rxq->rx_tail;
> 723         rx_ring = rxq->rx_ring
> ...
> 734                 rxdp = &rx_ring[rx_id];
> 735                 status = rxdp->status;
> 736                 if (! (status & E1000_RXD_STAT_DD))
> 737                         break;
> ...
> 807                 rxdp->buffer_addr = dma_addr;
> 808                 rxdp->status = 0;
> where I be seeing weird behavior with KVM due to MMIO write
> coalescing

It could be a bug in QEMU's e1000 emulation - maybe it's not doing
things in the correct order and causes a race condition with the DPDK
polling driver - or it could be a bug in the DPDK e1000 driver regarding
the order in which the descriptor ring and RX Head/Tail MMIO registers
are updated.

Did you find the root cause?

> P.S.
> > Also, is DPDK accessing the e1000 device from more than 1 vCPU?
>  All tests on single virtual CPU.
> 
> I created github project for quick reproduction of this error:
> https://github.com/BASM/qemu_dpdk_e1000_test
> 
> ---
> Best regards,
> Leonid Myravjev
> 
> On Thu, 21 Nov 2019 at 17:05, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, Nov 20, 2019 at 08:36:32PM +0300, ASM wrote:
> > > I trying solve the problem, with packets stopping (e1000,tap,kvm).
> > > My studies led to the following:
> > > 1. From flatview_write_continue() I see, what e1000 writes the number
> > > "7" to the STAT register.
> > > 2. The driver from target OS reads STAT register with number "7" and
> > > writes to the register the number "0".
> > > 3. From flatview_write_continue() (I make edits):
> > >             memcpy(ptr, buf, l);
> > >             new1=ptr[0xc];
> > >             usleep(100);
> > >             new2=ptr[0xc];
> > >             invalidate_and_set_dirty(mr, addr1, l);
> > >             new3=ptr[0xc];
> > > printf("Old: %i, new1, %i, new2: %i, new3: %i\n", old,new1,new2,new3);
> > >
> > > I see what memory in first printf is "7", but after usleep() is "0".
> > > Do I understand correctly that this should not be? Or RCU lock
> > > suggests the ability to the multiple writers?
> > >
> > > The problem is that qemu(e1000) writes the number 7, after which
> > > target(dpdk driver) reads 7, on the basis of this it writes the number
> > > 0, but as a result (extremely rarely), the value STATUS still remains
> > > 7. Therefore, packet processing is interrupted. This behavior is
> > > observed only on kvm (it is not observed on tcg).
> > >
> > > Please help with advice or ideas.
> >
> > Hi Leonid,
> > Could you be seeing weird behavior with KVM due to MMIO write
> > coalescing?
> >
> >   static void e1000_mmio_setup(E1000State *d)
> >   {
> >       int i;
> >       const uint32_t excluded_regs[] = {
> >           E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
> >           E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
> >       };
> >
> >       memory_region_init_io(&d->mmio, OBJECT(d), &e1000_mmio_ops, d,
> >                             "e1000-mmio", PNPMMIO_SIZE);
> >       memory_region_add_coalescing(&d->mmio, 0, excluded_regs[0]);
> >       for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
> >           memory_region_add_coalescing(&d->mmio, excluded_regs[i] + 4,
> >                                        excluded_regs[i+1] - excluded_regs[i] - 4);
> >       memory_region_init_io(&d->io, OBJECT(d), &e1000_io_ops, d, "e1000-io", IOPORT_SIZE);
> >   }
> >
> > MMIO write coalescing means that QEMU doesn't see the register writes
> > immediately.  Instead kvm.ko records them into a ring buffer and QEMU
> > processes the ring when the next ioctl(KVM_RUN) exit occurs.
> >
> > See Linux Documentation/virt/kvm/api.txt "4.116
> > KVM_(UN)REGISTER_COALESCED_MMIO" for more details.
> >
> > I don't really understand your printf debugging explanation.  It would
> > help to see the DPDK code and the exact printf() output.
> >
> > Also, is DPDK accessing the e1000 device from more than 1 vCPU?
> >
> > Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PCI memory sync question (kvm,dpdk,e1000,packet stalled)
  2019-12-19 14:58       ` Stefan Hajnoczi
@ 2019-12-30 10:10         ` ASM
  2020-01-02 11:05           ` Stefan Hajnoczi
  0 siblings, 1 reply; 7+ messages in thread
From: ASM @ 2019-12-30 10:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini, dmitry.fleytman, qemu-devel, Michael S. Tsirkin

> It could be a bug in QEMU's e1000 emulation - maybe it's not doing
> things in the correct order and causes a race condition with the DPDK
> polling driver - or it could be a bug in the DPDK e1000 driver regarding
> the order in which the descriptor ring and RX Head/Tail MMIO registers
> are updated.


What did I understand:
* DPDK and Kernel drivers work like simular with ring. It don't
analize Head, but check STATUS.
This is a bit strange but completely correct driver behavior. If the
driver writes to memory, it expects
this value to be written. The problem is definitely not in the DPDK
and in the kernel driver.
* This problem appears on KVM, but not appears on tcg.
* Most similar to a bug in QEMU e1000 emulation. The e1000 emulation
read and writes to some
memory and same times, same as dpdk driver.


As I understand it, KVM explicitly prohibits access to shared memory.
It is obvious that us need to
protect (RCU) all STATUS registers of all buffers. There can be a lot
of buffers and they can be
scattered throughout the memory.

>
> Did you find the root cause?

I think yes, see above, but I can't understand how I can fix it.

For those who are interested in this problem, I made a project that
easily repeats this error:
https://github.com/BASM/qemu_dpdk_e1000_test

Unfortunately I don’t think that I can fix it on my own, without any help.

---
Best regards,
Leonid Myravjev


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PCI memory sync question (kvm,dpdk,e1000,packet stalled)
  2019-12-30 10:10         ` ASM
@ 2020-01-02 11:05           ` Stefan Hajnoczi
  2022-05-23 12:47             ` Ding Hui
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2020-01-02 11:05 UTC (permalink / raw)
  To: ASM; +Cc: Paolo Bonzini, dmitry.fleytman, qemu-devel, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3251 bytes --]

On Mon, Dec 30, 2019 at 01:10:15PM +0300, ASM wrote:
> > It could be a bug in QEMU's e1000 emulation - maybe it's not doing
> > things in the correct order and causes a race condition with the DPDK
> > polling driver - or it could be a bug in the DPDK e1000 driver regarding
> > the order in which the descriptor ring and RX Head/Tail MMIO registers
> > are updated.
> 
> 
> What did I understand:
> * DPDK and Kernel drivers work like simular with ring. It don't
> analize Head, but check STATUS.
> This is a bit strange but completely correct driver behavior. If the
> driver writes to memory, it expects
> this value to be written. The problem is definitely not in the DPDK
> and in the kernel driver.
> * This problem appears on KVM, but not appears on tcg.
> * Most similar to a bug in QEMU e1000 emulation. The e1000 emulation
> read and writes to some
> memory and same times, same as dpdk driver.
> 
> 
> As I understand it, KVM explicitly prohibits access to shared memory.
> It is obvious that us need to
> protect (RCU) all STATUS registers of all buffers. There can be a lot
> of buffers and they can be
> scattered throughout the memory.

I don't agree with this reasoning because these descriptor rings are
designed to support simultaneous access by the driver and the device (if
done with care!).  What QEMU and the driver are attempting is typical.

The important thing for safe shared memory access is that both the
driver and the device know who is allowed to write to a memory location
at a point in time.  As long as both the driver and device don't write
to the same location it is possible to safely use the data structure.

The typical pattern that a driver or device uses to publish information
is:

  1. Fill in all fields but don't set the STATUS bit yet.
  2. Memory barrier or other memory ordering directive to ensure that
     fields have been written.  This operation might be implicit.
  3. Set the STATUS bit in a separate operation.

On the read side there should be:

  1. Load the STATUS field and check the bit.  If it's clear then try
     again.
  2. Memory barrier or other memory ordering directive to ensure that
     fields have been written.  This operation might be implicit.
  3. Load all the fields.

Can you explain why the STATUS fields need to be protected?  My
understanding is that the STATUS field is owned by the device at this
point in time.  The device writes to the STATUS field and the driver
polls it - this is safe.

I think the issue is bugs in the code.  The DPDK code looks questionable
(https://git.dpdk.org/dpdk/tree/drivers/net/e1000/em_rxtx.c#n715):

  volatile struct e1000_rx_desc *rxdp;
  ...
  rxdp = &rx_ring[rx_id];
  status = rxdp->status;
  if (! (status & E1000_RXD_STAT_DD))
      break;
  rxd = *rxdp;

Although there is a conditional branch on rxdp->status, the CPU may load
*rxdp before loading rxdp->status.  These pointers are volatile but
that's not enough to prevent the CPU from reordering memory accesses
(the compiler emits regular load instructions without memory ordering
directives).

This is probably not the bug that you're seeing, but it suggests there
are more problems.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PCI memory sync question (kvm,dpdk,e1000,packet stalled)
  2020-01-02 11:05           ` Stefan Hajnoczi
@ 2022-05-23 12:47             ` Ding Hui
  0 siblings, 0 replies; 7+ messages in thread
From: Ding Hui @ 2022-05-23 12:47 UTC (permalink / raw)
  To: Stefan Hajnoczi, ASM
  Cc: qemu-devel, Michael S. Tsirkin, Paolo Bonzini, dmitry.fleytman

Hi ASM

I think I meet the almost exactly same issue with ASM (qemu e1000 tap, 
guest run dpdk), in our environment, rather than your github project

It seems that the desc (not only status filed, we also traced 
buffer_addr and csum) changed to old value suddenly after guest dpdk set 
status to 0

So I googled the issue, and found the bug at 
https://bugs.launchpad.net/qemu/+bug/1853123 and this mail list.

Can you tell me what's your workaround, and do you find out the root cause?

Thanks a lot

On 2020/1/2 19:05, Stefan Hajnoczi wrote:
> On Mon, Dec 30, 2019 at 01:10:15PM +0300, ASM wrote:
>>> It could be a bug in QEMU's e1000 emulation - maybe it's not doing
>>> things in the correct order and causes a race condition with the DPDK
>>> polling driver - or it could be a bug in the DPDK e1000 driver regarding
>>> the order in which the descriptor ring and RX Head/Tail MMIO registers
>>> are updated.
>>
>>
>> What did I understand:
>> * DPDK and Kernel drivers work like simular with ring. It don't
>> analize Head, but check STATUS.
>> This is a bit strange but completely correct driver behavior. If the
>> driver writes to memory, it expects
>> this value to be written. The problem is definitely not in the DPDK
>> and in the kernel driver.
>> * This problem appears on KVM, but not appears on tcg.
>> * Most similar to a bug in QEMU e1000 emulation. The e1000 emulation
>> read and writes to some
>> memory and same times, same as dpdk driver.
>>
>>
>> As I understand it, KVM explicitly prohibits access to shared memory.
>> It is obvious that us need to
>> protect (RCU) all STATUS registers of all buffers. There can be a lot
>> of buffers and they can be
>> scattered throughout the memory.
> 
> I don't agree with this reasoning because these descriptor rings are
> designed to support simultaneous access by the driver and the device (if
> done with care!).  What QEMU and the driver are attempting is typical.
> 
> The important thing for safe shared memory access is that both the
> driver and the device know who is allowed to write to a memory location
> at a point in time.  As long as both the driver and device don't write
> to the same location it is possible to safely use the data structure.
> 
> The typical pattern that a driver or device uses to publish information
> is:
> 
>    1. Fill in all fields but don't set the STATUS bit yet.
>    2. Memory barrier or other memory ordering directive to ensure that
>       fields have been written.  This operation might be implicit.
>    3. Set the STATUS bit in a separate operation.
> 
> On the read side there should be:
> 
>    1. Load the STATUS field and check the bit.  If it's clear then try
>       again.
>    2. Memory barrier or other memory ordering directive to ensure that
>       fields have been written.  This operation might be implicit.
>    3. Load all the fields.
> 
> Can you explain why the STATUS fields need to be protected?  My
> understanding is that the STATUS field is owned by the device at this
> point in time.  The device writes to the STATUS field and the driver
> polls it - this is safe.
> 
> I think the issue is bugs in the code.  The DPDK code looks questionable
> (https://git.dpdk.org/dpdk/tree/drivers/net/e1000/em_rxtx.c#n715):
> 
>    volatile struct e1000_rx_desc *rxdp;
>    ...
>    rxdp = &rx_ring[rx_id];
>    status = rxdp->status;
>    if (! (status & E1000_RXD_STAT_DD))
>        break;
>    rxd = *rxdp;
> 
> Although there is a conditional branch on rxdp->status, the CPU may load
> *rxdp before loading rxdp->status.  These pointers are volatile but
> that's not enough to prevent the CPU from reordering memory accesses
> (the compiler emits regular load instructions without memory ordering
> directives).
> 
> This is probably not the bug that you're seeing, but it suggests there
> are more problems.
> 
> Stefan
> 


-- 
Thanks,
- Ding Hui


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-05-23 13:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAMmAVbWzrYWZBXwKxSd-f5SXmq6qP1ok8abvyKJhp3=REEaMPA@mail.gmail.com>
2019-11-20 17:36 ` PCI memory sync question (kvm,dpdk,e1000,packet stalled) ASM
2019-11-21 14:05   ` Stefan Hajnoczi
2019-11-27 12:39     ` ASM
2019-12-19 14:58       ` Stefan Hajnoczi
2019-12-30 10:10         ` ASM
2020-01-02 11:05           ` Stefan Hajnoczi
2022-05-23 12:47             ` Ding Hui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).