On Tue, Oct 04, 2016 at 12:24:02PM +0000, gong lei wrote: > On 2016/10/4 17:05, Stefan Hajnoczi wrote: > > On Tue, Oct 04, 2016 at 02:53:25AM +0000, gong lei wrote: > >> Hi Stefan, > >> > >> Thanks for your comments. > >> > >>> On Wed, Sep 28, 2016 at 05:08:24PM +0800, Gonglei wrote: > >>>> /+For scatter/gather list support, a buffer can be represented by / > >>>> /virtio_crypto_iovec structure./ > >>>> /+/ > >>>> /+The structure is defined as follows:/ > >>>> /+/ > >>>> /+\begin{lstlisting}/ > >>>> /+struct virtio_crypto_iovec {/ > >>>> /+ /* Guest physical address *// > >>>> /+ le64 addr;/ > >>>> /+ /* Length of guest physical address *// > >>>> /+ le32 len;/ > >>>> /+/* This marks a buffer as continuing via the next field *// > >>>> /+#define VIRTIO_CRYPTO_IOVEC_F_NEXT 1/ > >>>> /+ /* The flags as indicated above VIRTIO_CRYPTO_IOVEC_F_*. *// > >>>> /+ le32 flags;/ > >>>> /+ /* Pointer to next struct virtio_crypto_iovec if flags & NEXT *// > >>>> /+ le64 next_iovec;/ > >>>> /+};/ > >>>> /+\end{lstlisting}/ > >>> The vring already provides scatter-gather I/O. It is usually not > >>> necessary to define scatter-gather I/O at the device level. Addresses > >>> can be translated by the virtio transport (PCI, MMIO, CCW). For example > >>> PCI bus addresses with IO-MMU. Therefore it's messy to use guest > >>> physical addresses in the device specification. > >>> > >>>> /+Each data request uses virtio_crypto_hash_data_req structure to > >>>> store / > >>>> /information/ > >>>> /+used to run the HASH operations. The request only occupies one entry/ > >>>> /+in the Vring Descriptor Table in the virtio crypto device's dataq, > >>>> which / > >>>> /improves/ > >>>> /+the throughput of data transmitted for the HASH service, so that the > >>> virtio / > >>>> /crypto/ > >>>> /+device can be better accelerated./ > >>> Indirect vrings also only require one entry in the descriptor table. I > >>> don't understand why you are reinventing scatter-gather I/O. > >>> > >>> Am I missing something? > >> Yes, I knew indirect vring. But for virtio-crypto device' request, each > >> crypto request include > >> many buffers. Take algorithm chain' request as an examle, the driver > >> need to transmit source > >> data, dstination data, initializaion vector, additional authentication > >> data, digest result > >> data etc. to the device. In those data, the source data and destionation > >> data etc. may be composed > >> by scatter-gather I/O. That's the background. > >> > >> In virtio-crypto spec, we use a structure to store all those contents so > >> that we can put all those data > >> into one entry in the Descriptor Table, otherwise the effect of > >> acceleration is not good. We > >> thought other methods to inprove the performance as well, such as > >> increasing the virng size > >> of dataq, but the maxinum is 1024 at present, and it maybe influence the > >> latency if the vring > >> size is too big. > >> > >> For the indirect ving in existing Virtio spec, is only used for one > >> buffer, which can't cover > >> our situation. > > Indirect vring uses 1 descriptor in the vring descriptor table, but that > > descriptor points to a separate table that can have many descriptors. > > You are not limited to just 1 scatter-gather element. > > > > Also, I've noticed that the structs in the spec are mixing > > device-readable and device-writable fields in structs. This is not > > allowed since virtio requires that all device-readable data comes before > > all all device-writable data. > The spec 2.4.5 says something related, but not forced. > > A device MUST NOT write to a device-readable buffer, and a device SHOULD > NOT read a device-writable > buffer (it MAY do so for debugging or diagnostic purposes). Please see: 3.2.1.1 Placing Buffers Into The Descriptor Table A buffer consists of zero or more read-only physically-contiguous elements followed by zero or more physically-contiguous write-only elements (it must have at least one element). You must lay out the request (read-only) and response (write-only) parts in order. It's not possible to have read-only elements after a write-only element. > > I think you'll be able to use indirect vring with the same performance > > as customer scatter-gather I/O once you think through the layout of > > device-readable and device-writable data. > > > > In order to lay out requests correctly the device-readable headers > > struct must contain the length of all variable-sized fields. > > > > For example, the header would have an iv_len field and the device will > > expect that number of bytes after the header struct: > > > > Device-readable: > > [header] > > [iv] > > [input data] > > > > Device-writeable: > > [output data] > > [return/error values] > > > > One more thing to keep in mind is that according to the VIRTIO > > specification the device must not rely on the framing (or boundaries of > > scatter-gather elements). In the example above it means that the [iv] > > could either be a single vring_desc or it could be multiple descs, or it > > could share vring_descs with [header] and/or [input data]. > > > > In other words, the device emulation code must not assume elem->in[0] is > > [header], elem->in[1] is [iv], and elem->in[2] is [input data]. It has > > to process without giving special meaning to scatter-gather buffer size. > > > > Stefan > > Yes, I knew that and I did that last year, but I didn't get the good > performance unfortunately. > Because we need to handle multiple buffers respectively on one request, > which add the overhead > of software layer. It's obviously for vhost-user cryptodev backend in DPDK. > > Maybe I didn't express my meaning clearly. Let's assume that a have 6 > buffers > in one crypto request: > > BufA: output data, a single buffer > BufB: output data, a scatter-gather I/O > BufC: output data, a single buffer > BufD: in data, a scatter-gather I/O > BufE: in data, a single buffer > BufF: in data, a single buffer > > We need following steps to translate the request to the device: > > 1. add_outbuf() for BufA, which will occupy one entry in the descryptor > table > 2.add_outbuf() for BufB, which will occupy one entry in the descryptor > table if using > indirect vring, or multiple entries if not using indirecting vring > 3. add_outbuf() for BufC, which will occupy one entry in the descryptor > table > 4. add_inbuf() for BufD, which will occupy one entry in the descryptor > table if using > indirect vring, or multiple entries if not using indirecting vring > 5. add_inbuf() for BufE, which will occupy one entry in the descryptor table > 6. add_outbuf() for BufF, which will occupy one entry in the descryptor > table > > It means that one request will occupy 6 entries at least. Am I right? No. If these 6 buffers belong to 1 request then you only need a single indirect descriptor in the vring. The indirect descriptor table will have: [header] [BufA] [BufB1] [BufB2] [BufB...] [BufBN] [BufC] [BufD1] [BufD2] [BufD...] [BufDN] [BufE] [BufF] [footer] Stefan