On Mon, Mar 22, 2021 at 04:55:13PM +0100, Eugenio Perez Martin wrote: > On Mon, Mar 22, 2021 at 11:51 AM Stefan Hajnoczi wrote: > > > > On Thu, Mar 11, 2021 at 07:53:53PM +0100, Eugenio Perez Martin wrote: > > > On Fri, Jan 22, 2021 at 7:18 PM Eugenio Perez Martin > > > wrote: > > > > > > > > On Thu, Dec 10, 2020 at 12:55 PM Stefan Hajnoczi wrote: > > > > > > > > > > On Wed, Dec 09, 2020 at 07:41:23PM +0100, Eugenio Perez Martin wrote: > > > > > > On Tue, Dec 8, 2020 at 9:16 AM Stefan Hajnoczi wrote: > > > > > > > On Fri, Nov 20, 2020 at 07:50:51PM +0100, Eugenio Pérez wrote: > > > > > > > > + while (true) { > > > > > > > > + int r; > > > > > > > > + if (virtio_queue_full(vq)) { > > > > > > > > + break; > > > > > > > > + } > > > > > > > > > > > > > > Why is this check necessary? The guest cannot provide more descriptors > > > > > > > than there is ring space. If that happens somehow then it's a driver > > > > > > > error that is already reported in virtqueue_pop() below. > > > > > > > > > > > > > > > > > > > It's just checked because virtqueue_pop prints an error on that case, > > > > > > and there is no way to tell the difference between a regular error and > > > > > > another caused by other causes. Maybe the right thing to do is just to > > > > > > not to print that error? Caller should do the error printing in that > > > > > > case. Should we return an error code? > > > > > > > > > > The reason an error is printed today is because it's a guest error that > > > > > never happens with correct guest drivers. Something is broken and the > > > > > user should know about it. > > > > > > > > > > Why is "virtio_queue_full" (I already forgot what that actually means, > > > > > it's not clear whether this is referring to avail elements or used > > > > > elements) a condition that should be silently ignored in shadow vqs? > > > > > > > > > > > > > TL;DR: It can be changed to a check of the number of available > > > > descriptors in shadow vq, instead of returning as a regular operation. > > > > However, I think that making it a special return of virtqueue_pop > > > > could help in devices that run to completion, avoiding having to > > > > duplicate the count logic in them. > > > > > > > > The function virtio_queue_empty checks if the vq has all descriptors > > > > available, so the device has no more work to do until the driver makes > > > > another descriptor available. I can see how it can be a bad name > > > > choice, but virtio_queue_full means the opposite: device has pop() > > > > every descriptor available, and it has not returned any, so the driver > > > > cannot progress until the device marks some descriptors as used. > > > > > > > > As I understand, if vq->in_use >vq->num would mean we have a bug in > > > > the device vq code, not in the driver. virtio_queue_full could even be > > > > changed to "assert(vq->inuse <= vq->vring.num); return vq->inuse == > > > > vq->vring.num", as long as vq->in_use is operated right. > > > > > > > > If we hit vq->in_use == vq->num in virtqueue_pop it means the device > > > > tried to pop() one more buffer after having all of them available and > > > > pop'ed. This would be invalid if the device is counting right the > > > > number of in_use descriptors, but then we are duplicating that logic > > > > in the device and the vq. > > > > Devices call virtqueue_pop() until it returns NULL. They don't need to > > count virtqueue buffers explicitly. It returns NULL when vq->num > > virtqueue buffers have already been popped (either because > > virtio_queue_empty() is true or because an invalid driver state is > > detected by checking vq->num in virtqueue_pop()). > > If I understood you right, the virtio_queue_full addresses the reverse > problem: it controls when the virtqueue is out of buffers to make > available for the device because the latter has not consumed any, not > when the driver does not offer more buffers to the device because it > has no more data to offer. > > I find it easier to explain with the virtio-net rx queue (and I think > it's the easier way to trigger this issue). You are describing it's > regular behavior: The guest fills it (let's say 100%), and the device > picks buffers one by one: > > virtio_net_receive_rcu: > while (offset < size) { > elem = virtqueue_pop(q->rx_vq, sizeof(VirtQueueElement)); The lines before this loop return early when the virtqueue does not have sufficient buffer space: if (!virtio_net_has_buffers(q, size + n->guest_hdr_len - n->host_hdr_len)) { return 0; } When entering this loop we know that we can pop the buffers needed to fill one rx packet. > if (!elem) { > virtio_error("unexpected empty queue"); > } > /* [1] */ > /* fill elem with rx packet */ > virtqueue_fill(virtqueue, elem); > ... > virtqueue_flush(q->rx_vq, i); > } > > Every device as far as I know does this buffer by buffer, there is > just processing code in [1], and it never tries to pop more than one > buffers/chain of buffers at the same time. In the case of a queue > empty (no more available buffers), we hit an error, because there are > no more buffers to write. It's an error because we already checked that the virtqueue has buffer space. This should never happen. > In other devices (or tx queue), empty > buffers means there is no more work to do, not an error. > > In the case of shadow virtqueue, we cannot limit the number of exposed > rx buffers to 1 buffer/chain of buffers in [1], since it will affect > batching. We have the opposite problem: All devices (but rx queue) > want to queue "as empty as possible", or "to mark all available > buffers empty". Net rx queue is ok as long as it has a buffer/buffer > chain big enough to write to, but it will fetch them on demand, so > "queue full" (as in all buffers are available) is not a problem for > the device. > > However, the part of the shadow virtqueue that forwards the available > buffer seeks the opposite: It wants as many buffers as possible to be > available. That means that there is no [1] code that fills/read & > flush/detach the buffer immediately: Shadow virtqueue wants to make > available as many buffers as possible, but the device may not use them > until it has more data available. To the extreme (virtio-net rx queue > full), shadow virtqueue may make available all buffers, so in a > while(true) loop, it will try to make them available until it hits > that all the buffers are already available (vq->in_use == vq->num). > > The solution is to check the number of buffers already available > before calling virtio_queue_pop(). We could duplicate in_use in shadow > virtqueue, of course, but everything we need is already done in > VirtQueue code, so I think to reuse it is a better solution. Another > solution could be to treat vq->in_use == vq->num as an special return > code with no printed error in virtqueue_pop, but to expose if the > queue is full (as vq->in_use == vq->num) sounds less invasive to me. > > > > > > > In shadow vq this situation happens with the correct guest network > > > > driver, since the rx queue is filled for the device to write. Network > > > > device in qemu fetch descriptors on demand, but shadow vq fetch all > > > > available in batching. If the driver just happens to fill the queue of > > > > available descriptors, the log will raise, so we need to check in > > > > handle_sw_lm_vq before calling pop(). Of course the shadow vq can > > > > duplicate guest_vq->in_use instead of checking virtio_queue_full, but > > > > then it needs to check two things for every virtqueue_pop() [1]. > > > > I don't understand this scenario. It sounds like you are saying the > > guest and shadow rx vq are not in sync so there is a case where > > vq->in_use > vq->num is triggered? > > Sorry if I explain it bad, what I meant is that there is a case where > SVQ (as device code) will call virtqueue_pop() when vq->in_use == > vq->num. virtio_queue_full maintains the check as >=, I think it > should be safe to even to code virtio_queue_full to: > > assert(vq->in_use > vq->num); > return vq->inuse == vq->num; > > Please let me know if this is not clear enough. I don't get it. When virtqueue_split_pop() has popped all requests virtio_queue_empty_rcu() should return true and we shouldn't reach if (vq->inuse >= vq->vring.num). The guest driver cannot submit more available buffers at this point. I only checked split rings, not packed rings. Can you point to the SVQ code which has this problem? It may be easier to re-read the code than try to describe it in an email. Stefan