All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
@ 2012-12-18 14:31 Konrad Rzeszutek Wilk
  2012-12-18 14:49 ` Jan Beulich
  2013-01-18 16:00 ` Roger Pau Monné
  0 siblings, 2 replies; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-12-18 14:31 UTC (permalink / raw)
  To: xen-devel, martin.petersen, felipe.franciosi, matthew, axboe

[-- Attachment #1: Type: text/plain, Size: 12656 bytes --]

Hey,

I am including some folks that are not always on Xen-devel to see if they have
some extra ideas or can correct my misunderstandings.

This is very much RFC - so there is bound to be some bugs.
The original is here
https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit
in case one wishes to modify/provide comment on that one.


There are outstanding issues we have now with the block protocol:
Note: I am assuming 64-bit guest/host - as the size’s of the structures
change on 32-bit. I am also attaching the header for the blkif ring
as of today.

A) Segment size is limited to 11 pages. It means we can at most squeeze
in 44kB per request. The ring can hold 32 (next power of two below 36)
requests, meaning we can do 1.4M of outstanding requests.

B). Producer and consumer index is on the same cache line. In present hardware
that means the reader and writer will compete for the same cacheline causing a
ping-pong between sockets.

C). The requests and responses are on the same ring. This again causes the
ping-pong between sockets as the ownership of the cache line will shift between
sockets.

D). Cache alignment. Currently the protocol is 16-bit aligned. This is awkward
as the request and responses sometimes fit within a cacheline or sometimes
straddle them.

E). Interrupt mitigation. We are currently doing a kick whenever we are
done “processing” the ring. There are better ways to do this - and
we could use existing network interrupt mitigation techniques to make the
code poll when there is a lot of data. 

F). Latency. The processing of the request limits us to only do 44kB - which means
that a 1MB chunk of data - which on contemporary devices would only use I/O request
- would be split up in multiple ‘requests’ inadvertently delaying the processing of
said block. 

G) Future extensions. DIF/DIX for integrity. There might
be other in the future and it would be good to leave space for extra
flags TBD. 

H). Separate the response and request rings. The current
implementation has one thread for one block ring. There is no reason why
there could not be two threads - one for responses and one for requests -
and especially if they are scheduled on different CPUs. Furthermore this
could also be split in multi-queues - so two queues (response and request)
on each vCPU. 

I). We waste a lot of space on the ring - as we use the
ring for both requests and responses. The response structure needs to
occupy the same amount of space as the request structure (112 bytes). If
the request structure is expanded to be able to fit more segments (say
the ‘struct blkif_sring_entry is expanded to ~1500 bytes) that still
requires us to have a matching size response structure. We do not need
to use that much space for one response. Having a separate response ring
would simplify the structures. 

J). 32 bit vs 64 bit. Right now the size
of the request structure is 112 bytes under 64-bit guest and 102 bytes
under 32-bit guest. It is confusing and furthermore requires the host
to do extra accounting and processing.

The crude drawing displays memory that the ring occupies in offset of
64 bytes (cache line). Of course future CPUs could have different cache
lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is
a bit different as the ‘struct blkif_sring_entry’ is of 102 bytes.


The A) has two solutions to this (look at
http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
details). One proposed by Justin from Spectralogic has to negotiate
the segment size. This means that the ‘struct blkif_sring_entry’
is now a variable size. It can expand from 112 bytes (cover 11 pages of
data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
extension by just making the array in the request expand from 11 to a
variable size negotiated.


The math is as follow.


        struct blkif_request_segment {
                uint32 grant;                         // 4 bytes uint8_t
                first_sect, last_sect;// 1, 1 = 6 bytes
        }
(6 bytes for each segment) - the above structure is in an array of size
11 in the request. The ‘struct blkif_sring_entry’ is 112 bytes. The
change is to expand the array - in this example we would tack on 245 extra
‘struct blkif_request_segment’ - 245*6 + 112 = 1582. If we were to
use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages).


The other solution (from Intel - Ronghui) was to create one extra
ring that only has the ‘struct blkif_request_segment’ in them. The
‘struct blkif_request’ would be changed to have an index in said
‘segment ring’. There is only one segment ring. This means that the
size of the initial ring is still the same. The requests would point
to the segment and enumerate out how many of the indexes it wants to
use. The limit is of course the size of the segment. If one assumes a
one-page segment this means we can in one request cover ~4MB. The math
is as follow:


First request uses the half of the segment ring - so index 0 up
to 341 (out of 682). Each entry in the segment ring is a ‘struct
blkif_request_segment’ so it occupies 6 bytes. The other requests on
the ring (so there are 35 left) can use either the remaining 341 indexes
of the sgement ring or use the old style request. The old style request
can address use up to 44kB. For example:


 sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832
 sring[1]->[use sthe old style request] = 11*4096 = 45056
 sring[2]->[uses 342->682 indexes in the segment ring] = 1392640
 sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624


Total: 4145152 bytes Naturally this could be extended to have a bigger
segment ring to cover more.





The problem with this extension is that we use 6 bytes and end up
straddling a cache line. Using 8 bytes will fix the cache straddling. This
would mean fitting only 512 segments per page.


There is yet another mechanism that could be employed  - and it borrows
from VirtIO protocol. And that is the ‘indirect descriptors’. This
very similar to what Intel suggests, but with a twist.


We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the ‘struct
blkif_sring’ (each entry can be up to 112 bytes if needed - so the
old style request would fit). It would look like:


/* so 64 bytes under 64-bit. If necessary, the array (seg) can be
expanded to fit 11 segments as the old style request did */ struct
blkif_request_indirect {
        uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
// 1 blkif_vdev_t   handle;       /* only for read/write requests         */ // 2
#ifdef CONFIG_X86_64
        uint32_t       _pad1;             /* offsetof(blkif_request,u.rw.id) == 8 */ // 2
#endif
        uint64_t       id;           /* private guest value, echoed in resp  */
	grant_ref_t    gref;        /* reference to indirect buffer frame  if used*/
            struct blkif_request_segment_aligned seg[4]; // each is 8 bytes
} __attribute__((__packed__));


struct blkif_request {
        uint8_t        operation;    /* BLKIF_OP_???  */
	union {
                struct blkif_request_rw rw;
		struct blkif_request_indirect
                indirect; … other ..
        } u;
} __attribute__((__packed__));




The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
etc operation would now be in indirect.op. The indirect.gref points to
a page that is filled with:


struct blkif_request_indirect_entry {
        blkif_sector_t sector_number;
	struct blkif_request_segment seg;
} __attribute__((__packed__));
//16 bytes, so we can fit in a page 256 of these structures.


This means that with the existing 36 slots in the ring (single page)
we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
~= 32M. If we don’t want to use indirect descriptor we can still use
up to 4 pages of the request (as it has enough space to contain four
segments and the structure will still be cache-aligned).




B). Both the producer (req_* and rsp_*) values are in the same
cache-line. This means that we end up with the same cacheline being
modified by two different guests. Depending on the architecture and
placement of the guest this could be bad - as each logical CPU would
try to write and read from the same cache-line. A mechanism where
the req_* and rsp_ values are separated and on a different cache line
could be used. The value of the correct cache-line and alignment could
be negotiated via XenBus - in case future technologies start using 128
bytes for cache or such. Or the the producer and consumer indexes are in
separate rings. Meaning we have a ‘request ring’ and a ‘response
ring’ - and only the ‘req_prod’, ‘req_event’ are modified in
the ‘request ring’. The opposite (resp_*) are only modified in the
‘response ring’.


C). Similar to B) problem but with a bigger payload. Each
‘blkif_sring_entry’ occupies 112 bytes which does not lend itself
to a nice cache line size. If the indirect descriptors are to be used
for everything we could ‘slim-down’ the blkif_request/response to
be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST
to 5 as that would slim the largest of the structures to 64-bytes.
Naturally this means negotiating a new size of the structure via XenBus.


D). The first picture shows the problem. We now aligning everything
on the wrong cachelines. Worst in ⅓ of the cases we straddle
three cache-lines. We could negotiate a proper alignment for each
request/response structure.


E). The network stack has showed that going in a polling mode does improve
performance. The current mechanism of kicking the guest and or block
backend is not always clear.  [TODO: Konrad to explain it in details]


F). The current block protocol for big I/Os that the backend devices can
handle ends up doing extra work by splitting the I/O in smaller chunks,
and then reassembling them. With the solutions outlined in A) this can
be fixed. This is easily seen with 1MB I/Os. Since each request can
only handle 44kB that means we have to split a 1MB I/O in 24 requests
(23 * 4096 * 11 = 1081344). Then the backend ends up sending them in
sector-sizes- which with contemporary devices (such as SSD) ends up with
more processing. The SSDs are comfortable handling 128kB or higher I/Os
in one go.


G). DIF/DIX. This a protocol to carry extra ‘checksum’ information
for each I/O. The I/O can be a sector size, page-size or an I/O size
(most popular are 1MB). The DIF/DIX needs 8 bytes of information for
each I/O. It would be worth considering putting/reserving that amount of
space in each request/response. Also putting in extra flags for future
extensions would be worth it - however the author is not aware of any
right now.


H). Separate response/request. Potentially even multi-queue per-VCPU
queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was
flawed. There is no similar concept in the storage world were the
operating system can put a food down and say: “everything before this
has to be on the disk.” There are ligther versions of this - called
‘FUA’ and ‘FLUSH’. Depending on the internal implementation
of the storage they are either ignored or do the right thing. The
filesystems determine the viability of these flags and change writing
tactics depending on this. From a protocol level, this means that the
WRITE/READ/SYNC requests can be intermixed - the storage by itself
determines the order of the operation. The filesystem is the one that
determines whether the WRITE should be with a FLUSH to preserve some form
of atomicity. This means we do not have to preserve an order of operations
- so we can have multiple queues for request and responses. This has
show in the network world to improve performance considerably.


I). Wastage of response/request on the same ring. Currently each response
MUST occupy the same amount of space that the request occupies - as the
ring can have both responses and requests. Separating the request and
response ring would remove the wastage.


J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring
entries is different if the guest is in 32-bit or 64-bit mode. Cleaning
this up to be the same size would save considerable accounting that the
host has to do (extra memcpy for each response/request).

[-- Attachment #2: blkif_sring_struct.png --]
[-- Type: image/png, Size: 31610 bytes --]

[-- Attachment #3: blkif_segment_struct.png --]
[-- Type: image/png, Size: 19027 bytes --]

[-- Attachment #4: blkif.h --]
[-- Type: text/plain, Size: 8279 bytes --]

/******************************************************************************
 * blkif.h
 *
 * Unified block-device I/O interface for Xen guest OSes.
 *
 * Copyright (c) 2003-2004, Keir Fraser
 */

#ifndef __XEN_PUBLIC_IO_BLKIF_H__
#define __XEN_PUBLIC_IO_BLKIF_H__

#include <xen/interface/io/ring.h>
#include <xen/interface/grant_table.h>

/*
 * Front->back notifications: When enqueuing a new request, sending a
 * notification can be made conditional on req_event (i.e., the generic
 * hold-off mechanism provided by the ring macros). Backends must set
 * req_event appropriately (e.g., using RING_FINAL_CHECK_FOR_REQUESTS()).
 *
 * Back->front notifications: When enqueuing a new response, sending a
 * notification can be made conditional on rsp_event (i.e., the generic
 * hold-off mechanism provided by the ring macros). Frontends must set
 * rsp_event appropriately (e.g., using RING_FINAL_CHECK_FOR_RESPONSES()).
 */

typedef uint16_t blkif_vdev_t;
typedef uint64_t blkif_sector_t;

/*
 * REQUEST CODES.
 */
#define BLKIF_OP_READ              0
#define BLKIF_OP_WRITE             1
/*
 * Recognised only if "feature-barrier" is present in backend xenbus info.
 * The "feature_barrier" node contains a boolean indicating whether barrier
 * requests are likely to succeed or fail. Either way, a barrier request
 * may fail at any time with BLKIF_RSP_EOPNOTSUPP if it is unsupported by
 * the underlying block-device hardware. The boolean simply indicates whether
 * or not it is worthwhile for the frontend to attempt barrier requests.
 * If a backend does not recognise BLKIF_OP_WRITE_BARRIER, it should *not*
 * create the "feature-barrier" node!
 */
#define BLKIF_OP_WRITE_BARRIER     2

/*
 * Recognised if "feature-flush-cache" is present in backend xenbus
 * info.  A flush will ask the underlying storage hardware to flush its
 * non-volatile caches as appropriate.  The "feature-flush-cache" node
 * contains a boolean indicating whether flush requests are likely to
 * succeed or fail. Either way, a flush request may fail at any time
 * with BLKIF_RSP_EOPNOTSUPP if it is unsupported by the underlying
 * block-device hardware. The boolean simply indicates whether or not it
 * is worthwhile for the frontend to attempt flushes.  If a backend does
 * not recognise BLKIF_OP_WRITE_FLUSH_CACHE, it should *not* create the
 * "feature-flush-cache" node!
 */
#define BLKIF_OP_FLUSH_DISKCACHE   3

/*
 * Recognised only if "feature-discard" is present in backend xenbus info.
 * The "feature-discard" node contains a boolean indicating whether trim
 * (ATA) or unmap (SCSI) - conviently called discard requests are likely
 * to succeed or fail. Either way, a discard request
 * may fail at any time with BLKIF_RSP_EOPNOTSUPP if it is unsupported by
 * the underlying block-device hardware. The boolean simply indicates whether
 * or not it is worthwhile for the frontend to attempt discard requests.
 * If a backend does not recognise BLKIF_OP_DISCARD, it should *not*
 * create the "feature-discard" node!
 *
 * Discard operation is a request for the underlying block device to mark
 * extents to be erased. However, discard does not guarantee that the blocks
 * will be erased from the device - it is just a hint to the device
 * controller that these blocks are no longer in use. What the device
 * controller does with that information is left to the controller.
 * Discard operations are passed with sector_number as the
 * sector index to begin discard operations at and nr_sectors as the number of
 * sectors to be discarded. The specified sectors should be discarded if the
 * underlying block device supports trim (ATA) or unmap (SCSI) operations,
 * or a BLKIF_RSP_EOPNOTSUPP  should be returned.
 * More information about trim/unmap operations at:
 * http://t13.org/Documents/UploadedDocuments/docs2008/
 *     e07154r6-Data_Set_Management_Proposal_for_ATA-ACS2.doc
 * http://www.seagate.com/staticfiles/support/disc/manuals/
 *     Interface%20manuals/100293068c.pdf
 * The backend can optionally provide three extra XenBus attributes to
 * further optimize the discard functionality:
 * 'discard-aligment' - Devices that support discard functionality may
 * internally allocate space in units that are bigger than the exported
 * logical block size. The discard-alignment parameter indicates how many bytes
 * the beginning of the partition is offset from the internal allocation unit's
 * natural alignment.
 * 'discard-granularity'  - Devices that support discard functionality may
 * internally allocate space using units that are bigger than the logical block
 * size. The discard-granularity parameter indicates the size of the internal
 * allocation unit in bytes if reported by the device. Otherwise the
 * discard-granularity will be set to match the device's physical block size.
 * 'discard-secure' - All copies of the discarded sectors (potentially created
 * by garbage collection) must also be erased.  To use this feature, the flag
 * BLKIF_DISCARD_SECURE must be set in the blkif_request_trim.
 */
#define BLKIF_OP_DISCARD           5

/*
 * Maximum scatter/gather segments per request.
 * This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE.
 * NB. This could be 12 if the ring indexes weren't stored in the same page.
 */
#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11

struct blkif_request_rw {
	uint8_t        nr_segments;  /* number of segments                   */
	blkif_vdev_t   handle;       /* only for read/write requests         */
#ifdef CONFIG_X86_64
	uint32_t       _pad1;	     /* offsetof(blkif_request,u.rw.id) == 8 */
#endif
	uint64_t       id;           /* private guest value, echoed in resp  */
	blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
	struct blkif_request_segment {
		grant_ref_t gref;        /* reference to I/O buffer frame        */
		/* @first_sect: first sector in frame to transfer (inclusive).   */
		/* @last_sect: last sector in frame to transfer (inclusive).     */
		uint8_t     first_sect, last_sect;
	} seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
} __attribute__((__packed__));

struct blkif_request_discard {
	uint8_t        flag;         /* BLKIF_DISCARD_SECURE or zero.        */
#define BLKIF_DISCARD_SECURE (1<<0)  /* ignored if discard-secure=0          */
	blkif_vdev_t   _pad1;        /* only for read/write requests         */
#ifdef CONFIG_X86_64
	uint32_t       _pad2;        /* offsetof(blkif_req..,u.discard.id)==8*/
#endif
	uint64_t       id;           /* private guest value, echoed in resp  */
	blkif_sector_t sector_number;
	uint64_t       nr_sectors;
	uint8_t        _pad3;
} __attribute__((__packed__));

struct blkif_request {
	uint8_t        operation;    /* BLKIF_OP_???                         */
	union {
		struct blkif_request_rw rw;
		struct blkif_request_discard discard;
	} u;
} __attribute__((__packed__));

struct blkif_response {
	uint64_t        id;              /* copied from request */
	uint8_t         operation;       /* copied from request */
	int16_t         status;          /* BLKIF_RSP_???       */
};

/*
 * STATUS RETURN CODES.
 */
 /* Operation not supported (only happens on barrier writes). */
#define BLKIF_RSP_EOPNOTSUPP  -2
 /* Operation failed for some unspecified reason (-EIO). */
#define BLKIF_RSP_ERROR       -1
 /* Operation completed successfully. */
#define BLKIF_RSP_OKAY         0

/*
 * Generate blkif ring structures and types.
 */

DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response);

#define VDISK_CDROM        0x1
#define VDISK_REMOVABLE    0x2
#define VDISK_READONLY     0x4

/* Xen-defined major numbers for virtual disks, they look strangely
 * familiar */
#define XEN_IDE0_MAJOR	3
#define XEN_IDE1_MAJOR	22
#define XEN_SCSI_DISK0_MAJOR	8
#define XEN_SCSI_DISK1_MAJOR	65
#define XEN_SCSI_DISK2_MAJOR	66
#define XEN_SCSI_DISK3_MAJOR	67
#define XEN_SCSI_DISK4_MAJOR	68
#define XEN_SCSI_DISK5_MAJOR	69
#define XEN_SCSI_DISK6_MAJOR	70
#define XEN_SCSI_DISK7_MAJOR	71
#define XEN_SCSI_DISK8_MAJOR	128
#define XEN_SCSI_DISK9_MAJOR	129
#define XEN_SCSI_DISK10_MAJOR	130
#define XEN_SCSI_DISK11_MAJOR	131
#define XEN_SCSI_DISK12_MAJOR	132
#define XEN_SCSI_DISK13_MAJOR	133
#define XEN_SCSI_DISK14_MAJOR	134
#define XEN_SCSI_DISK15_MAJOR	135

#endif /* __XEN_PUBLIC_IO_BLKIF_H__ */

[-- Attachment #5: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2012-12-18 14:31 RFC v1: Xen block protocol overhaul - problem statement (with pictures!) Konrad Rzeszutek Wilk
@ 2012-12-18 14:49 ` Jan Beulich
  2013-01-18 16:00 ` Roger Pau Monné
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2012-12-18 14:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, matthew, felipe.franciosi, martin.petersen, xen-devel

>>> On 18.12.12 at 15:31, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> The A) has two solutions to this (look at
> http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
> details). One proposed by Justin from Spectralogic has to negotiate
> the segment size. This means that the ‘struct blkif_sring_entry’
> is now a variable size. It can expand from 112 bytes (cover 11 pages of
> data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
> extension by just making the array in the request expand from 11 to a
> variable size negotiated.

Iirc this extension still limits the number of segments per request
to 255 (as the total number must be specified in the request,
which only has an 8-bit field for that purpose).

That's one of the reasons why for vSCSI I didn't go that route,
but rather allowed the frontend to submit segment information
prior to the actual request (with the trailing segment pieces, if
any, being in the request itself).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2012-12-18 14:31 RFC v1: Xen block protocol overhaul - problem statement (with pictures!) Konrad Rzeszutek Wilk
  2012-12-18 14:49 ` Jan Beulich
@ 2013-01-18 16:00 ` Roger Pau Monné
  2013-01-18 18:20   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 19+ messages in thread
From: Roger Pau Monné @ 2013-01-18 16:00 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew

On 18/12/12 15:31, Konrad Rzeszutek Wilk wrote:
> Hey,
> 
> I am including some folks that are not always on Xen-devel to see if they have
> some extra ideas or can correct my misunderstandings.
> 
> This is very much RFC - so there is bound to be some bugs.
> The original is here
> https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit
> in case one wishes to modify/provide comment on that one.

Hello Konrad,

Thanks for the problem description and possible solutions, I'm 
going to take on this (provided that you are OK with it).

> 
> There are outstanding issues we have now with the block protocol:
> Note: I am assuming 64-bit guest/host - as the size’s of the structures
> change on 32-bit. I am also attaching the header for the blkif ring
> as of today.
> 
> A) Segment size is limited to 11 pages. It means we can at most squeeze
> in 44kB per request. The ring can hold 32 (next power of two below 36)
> requests, meaning we can do 1.4M of outstanding requests.

I've been looking at the different options here and I think the most
suitable one would be using a mechanism similar to VirtIO "indirect
descriptors", it is the one that allows for more in-flight requests
while being quite simple to implement.

> B). Producer and consumer index is on the same cache line. In present hardware
> that means the reader and writer will compete for the same cacheline causing a
> ping-pong between sockets.
> 
> C). The requests and responses are on the same ring. This again causes the
> ping-pong between sockets as the ownership of the cache line will shift between
> sockets.
> 
> D). Cache alignment. Currently the protocol is 16-bit aligned. This is awkward
> as the request and responses sometimes fit within a cacheline or sometimes
> straddle them.
> 
> E). Interrupt mitigation. We are currently doing a kick whenever we are
> done “processing” the ring. There are better ways to do this - and
> we could use existing network interrupt mitigation techniques to make the
> code poll when there is a lot of data.
> 
> F). Latency. The processing of the request limits us to only do 44kB - which means
> that a 1MB chunk of data - which on contemporary devices would only use I/O request
> - would be split up in multiple ‘requests’ inadvertently delaying the processing of
> said block.
> 
> G) Future extensions. DIF/DIX for integrity. There might
> be other in the future and it would be good to leave space for extra
> flags TBD.
> 
> H). Separate the response and request rings. The current
> implementation has one thread for one block ring. There is no reason why
> there could not be two threads - one for responses and one for requests -
> and especially if they are scheduled on different CPUs. Furthermore this
> could also be split in multi-queues - so two queues (response and request)
> on each vCPU.
> 
> I). We waste a lot of space on the ring - as we use the
> ring for both requests and responses. The response structure needs to
> occupy the same amount of space as the request structure (112 bytes). If
> the request structure is expanded to be able to fit more segments (say
> the ‘struct blkif_sring_entry is expanded to ~1500 bytes) that still
> requires us to have a matching size response structure. We do not need
> to use that much space for one response. Having a separate response ring
> would simplify the structures.
> 
> J). 32 bit vs 64 bit. Right now the size
> of the request structure is 112 bytes under 64-bit guest and 102 bytes
> under 32-bit guest. It is confusing and furthermore requires the host
> to do extra accounting and processing.
> 
> The crude drawing displays memory that the ring occupies in offset of
> 64 bytes (cache line). Of course future CPUs could have different cache
> lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is
> a bit different as the ‘struct blkif_sring_entry’ is of 102 bytes.
> 
> 
> The A) has two solutions to this (look at
> http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
> details). One proposed by Justin from Spectralogic has to negotiate
> the segment size. This means that the ‘struct blkif_sring_entry’
> is now a variable size. It can expand from 112 bytes (cover 11 pages of
> data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
> extension by just making the array in the request expand from 11 to a
> variable size negotiated.
> 
> 
> The math is as follow.
> 
> 
>         struct blkif_request_segment {
>                 uint32 grant;                         // 4 bytes uint8_t
>                 first_sect, last_sect;// 1, 1 = 6 bytes
>         }
> (6 bytes for each segment) - the above structure is in an array of size
> 11 in the request. The ‘struct blkif_sring_entry’ is 112 bytes. The
> change is to expand the array - in this example we would tack on 245 extra
> ‘struct blkif_request_segment’ - 245*6 + 112 = 1582. If we were to
> use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages).
> 
> 
> The other solution (from Intel - Ronghui) was to create one extra
> ring that only has the ‘struct blkif_request_segment’ in them. The
> ‘struct blkif_request’ would be changed to have an index in said
> ‘segment ring’. There is only one segment ring. This means that the
> size of the initial ring is still the same. The requests would point
> to the segment and enumerate out how many of the indexes it wants to
> use. The limit is of course the size of the segment. If one assumes a
> one-page segment this means we can in one request cover ~4MB. The math
> is as follow:
> 
> 
> First request uses the half of the segment ring - so index 0 up
> to 341 (out of 682). Each entry in the segment ring is a ‘struct
> blkif_request_segment’ so it occupies 6 bytes. The other requests on
> the ring (so there are 35 left) can use either the remaining 341 indexes
> of the sgement ring or use the old style request. The old style request
> can address use up to 44kB. For example:
> 
> 
>  sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832
>  sring[1]->[use sthe old style request] = 11*4096 = 45056
>  sring[2]->[uses 342->682 indexes in the segment ring] = 1392640
>  sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624
> 
> 
> Total: 4145152 bytes Naturally this could be extended to have a bigger
> segment ring to cover more.
> 
> 
> 
> 
> 
> The problem with this extension is that we use 6 bytes and end up
> straddling a cache line. Using 8 bytes will fix the cache straddling. This
> would mean fitting only 512 segments per page.
> 
> 
> There is yet another mechanism that could be employed  - and it borrows
> from VirtIO protocol. And that is the ‘indirect descriptors’. This
> very similar to what Intel suggests, but with a twist.
> 
> 
> We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the ‘struct
> blkif_sring’ (each entry can be up to 112 bytes if needed - so the
> old style request would fit). It would look like:
> 
> 
> /* so 64 bytes under 64-bit. If necessary, the array (seg) can be
> expanded to fit 11 segments as the old style request did */ struct
> blkif_request_indirect {
>         uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
> // 1 blkif_vdev_t   handle;       /* only for read/write requests         */ // 2
> #ifdef CONFIG_X86_64
>         uint32_t       _pad1;             /* offsetof(blkif_request,u.rw.id) == 8 */ // 2
> #endif
>         uint64_t       id;           /* private guest value, echoed in resp  */
>         grant_ref_t    gref;        /* reference to indirect buffer frame  if used*/
>             struct blkif_request_segment_aligned seg[4]; // each is 8 bytes
>
> } __attribute__((__packed__));
> 
> 
> struct blkif_request {
>         uint8_t        operation;    /* BLKIF_OP_???  */
>         union {
>                 struct blkif_request_rw rw;
>                 struct blkif_request_indirect
>                 indirect; … other ..
>         } u;
> } __attribute__((__packed__));
> 

Do you think it will be suitable to change the size of blkif_request, I 
was thinking that if we introduce indirect request we could reduce the 
maximum number of segments per request to 2, so blkif_request is 
reduced to 32bytes and we can fit 126 requests (64 due to rounding) 
in a single ring. 

Everything that needs more than two segments will switch to using the 
new indirect operation. Also, to make use of the extra space in 
blkif_request_indirect I would remove the extra segments and instead 
pass more grant references to frames containing 
blkif_request_indirect_entry entries.

#define BLKIF_MAX_SEGMENTS_PER_REQUEST 2
#define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 4

struct blkif_request_segment {
    grant_ref_t gref;        /* reference to I/O buffer frame        */
    /* @first_sect: first sector in frame to transfer (inclusive).   */
    /* @last_sect: last sector in frame to transfer (inclusive).     */
    uint8_t     first_sect, last_sect;
} __attribute__((__packed__));

struct blkif_request_indirect_entry {
    blkif_sector_t  sector_number;
    struct          blkif_request_segment seg;
    uint8_t         _pad[2];
} __attribute__((__packed__));

struct blkif_request_rw {
    uint8_t        nr_segments;  /* number of segments                   */
    blkif_vdev_t   handle;       /* only for read/write requests         */
    uint64_t       id;           /* private guest value, echoed in resp  */
    blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
    struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
} __attribute__((__packed__));

struct blkif_request_indirect {
    uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
    uint16_t       nr_indirect;  /* number of indirect entries in gref frames */
    blkif_vdev_t   handle;       /* only for read/write requests         */
    uint64_t       id;           /* private guest value, echoed in resp  */
    /* reference to indirect buffer frame */
    grant_ref_t    gref[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
} __attribute__((__packed__));

struct blkif_request_discard {
    uint8_t        flag;         /* BLKIF_DISCARD_SECURE or zero.        */
#define BLKIF_DISCARD_SECURE (1<<0)  /* ignored if discard-secure=0          */
    uint64_t       id;           /* private guest value, echoed in resp  */
    blkif_sector_t sector_number;
    uint64_t       nr_sectors;
} __attribute__((__packed__));

struct blkif_request {
    uint8_t        operation;    /* BLKIF_OP_???                         */
    union {
        struct blkif_request_rw rw;
        struct blkif_request_indirect indirect;
        struct blkif_request_discard discard;
    } u;
} __attribute__((__packed__));

I've removed all the extra paddings, because this protocol is no longer 
compatible with the previous one (we modify 
BLKIF_MAX_SEGMENTS_PER_REQUEST), and packed the structures. I'm 
wondering if using __attribute__ in a public header is allowed, since 
this is a GNU extension (AFAIK supported by gcc and clang at least).

The main motivation behind the modification of the size of the 
request is because I've realized that in your proposal, you where 
adding 4 extra segments to the indirect request, and I was wondering 
if I could pack more grant frames containing indirect entries in a 
single request. By using the old size of blkf_request we will be able to
send up to 9 grant frames filled with indirect entries, which gives us
4096 * 256 * 9 = 9MB of outstanding IO in each request. I think that
this is too high, and will introduce a lot of latency to the protocol.
That's the main reason I've decided to reduce the size of blkif_request
in order to be able to fit more of them in a ring, instead of making
each of the requests bigger.

Using this implementation each request has a size of 32bytes (to 
prevent cache straddle), so we could fit 126 requests in a ring, but 
due to the rounding, we end up with 64. Also each request can contain 
4 grant pages, each one filled with 256 blkif_request_indirect_entry. 
The math for the maximum data in a request is: 4096 * 256 * 4 = 4MB, 
and in the whole ring 4 * 64 = 256MB worth of outstanding IO.
 
> 
> 
> The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
> etc operation would now be in indirect.op. The indirect.gref points to
> a page that is filled with:
> 
> 
> struct blkif_request_indirect_entry {
>         blkif_sector_t sector_number;
>         struct blkif_request_segment seg;
> } __attribute__((__packed__));
> //16 bytes, so we can fit in a page 256 of these structures.
> 
> 
> This means that with the existing 36 slots in the ring (single page)
> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> ~= 32M. If we don’t want to use indirect descriptor we can still use
> up to 4 pages of the request (as it has enough space to contain four
> segments and the structure will still be cache-aligned).
> 
> 
> 
> 
> B). Both the producer (req_* and rsp_*) values are in the same
> cache-line. This means that we end up with the same cacheline being
> modified by two different guests. Depending on the architecture and
> placement of the guest this could be bad - as each logical CPU would
> try to write and read from the same cache-line. A mechanism where
> the req_* and rsp_ values are separated and on a different cache line
> could be used. The value of the correct cache-line and alignment could
> be negotiated via XenBus - in case future technologies start using 128
> bytes for cache or such. Or the the producer and consumer indexes are in
> separate rings. Meaning we have a ‘request ring’ and a ‘response
> ring’ - and only the ‘req_prod’, ‘req_event’ are modified in
> the ‘request ring’. The opposite (resp_*) are only modified in the
> ‘response ring’.

Using different rings for requests and responses sounds like a good idea,
this will reduce the number of variables on the ring, leaving only
"produced" and "event" (we no longer need rsp_ and req_).

Since we will no longer have two different domains writing to the same
cache line I guess this will get rid of the cache ping-pong problem.

Also while there we could try to use all the page, instead of rounding
down to the closer power of two, this will allow us to go from 64
requests to 126 (if using the indirect descriptors proposal explained
above).
 
> C). Similar to B) problem but with a bigger payload. Each
> ‘blkif_sring_entry’ occupies 112 bytes which does not lend itself
> to a nice cache line size. If the indirect descriptors are to be used
> for everything we could ‘slim-down’ the blkif_request/response to
> be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST
> to 5 as that would slim the largest of the structures to 64-bytes.
> Naturally this means negotiating a new size of the structure via XenBus.
> 
> 
> D). The first picture shows the problem. We now aligning everything
> on the wrong cachelines. Worst in ⅓ of the cases we straddle
> three cache-lines. We could negotiate a proper alignment for each
> request/response structure.
> 
> 
> E). The network stack has showed that going in a polling mode does improve
> performance. The current mechanism of kicking the guest and or block
> backend is not always clear.  [TODO: Konrad to explain it in details]
> 
> 
> F). The current block protocol for big I/Os that the backend devices can
> handle ends up doing extra work by splitting the I/O in smaller chunks,
> and then reassembling them. With the solutions outlined in A) this can
> be fixed. This is easily seen with 1MB I/Os. Since each request can
> only handle 44kB that means we have to split a 1MB I/O in 24 requests
> (23 * 4096 * 11 = 1081344). Then the backend ends up sending them in
> sector-sizes- which with contemporary devices (such as SSD) ends up with
> more processing. The SSDs are comfortable handling 128kB or higher I/Os
> in one go.
> 
> 
> G). DIF/DIX. This a protocol to carry extra ‘checksum’ information
> for each I/O. The I/O can be a sector size, page-size or an I/O size
> (most popular are 1MB). The DIF/DIX needs 8 bytes of information for
> each I/O. It would be worth considering putting/reserving that amount of
> space in each request/response. Also putting in extra flags for future
> extensions would be worth it - however the author is not aware of any
> right now.
> 
> 
> H). Separate response/request. Potentially even multi-queue per-VCPU
> queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was
> flawed. There is no similar concept in the storage world were the
> operating system can put a food down and say: “everything before this
> has to be on the disk.” There are ligther versions of this - called
> ‘FUA’ and ‘FLUSH’. Depending on the internal implementation
> of the storage they are either ignored or do the right thing. The
> filesystems determine the viability of these flags and change writing
> tactics depending on this. From a protocol level, this means that the
> WRITE/READ/SYNC requests can be intermixed - the storage by itself
> determines the order of the operation. The filesystem is the one that
> determines whether the WRITE should be with a FLUSH to preserve some form
> of atomicity. This means we do not have to preserve an order of operations
> - so we can have multiple queues for request and responses. This has
> show in the network world to improve performance considerably.
> 
> 
> I). Wastage of response/request on the same ring. Currently each response
> MUST occupy the same amount of space that the request occupies - as the
> ring can have both responses and requests. Separating the request and
> response ring would remove the wastage.
> 
> 
> J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring
> entries is different if the guest is in 32-bit or 64-bit mode. Cleaning
> this up to be the same size would save considerable accounting that the
> host has to do (extra memcpy for each response/request).

So forth I think most of the problems could be solved by using indirect
descriptors, aligning requests to a cache sensible value (to prevent
straddling) and using separate rings for requests and responses. This 
would be my plan for solving this issues:

1. Implement LRU mechanism in blkback for the persistent grants list. If
we are going to increment the number of in-flight IO we need to be able to
keep the list of persistently mapped grants to something sensible, and
we are no longer able to persistently map all the grants that could be 
used. In my proposal that would be 64 * 4 for each grant frame that 
contains indirect entries, plus 64 * 4 * 256, for all the possible
entries in the indirect frame 64 * 4 + 64 * 4 * 256 = 65792.

2. Implement indirect descriptors.

3. Implement new ring protocol, using two separate rings for requests
and replies.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-18 16:00 ` Roger Pau Monné
@ 2013-01-18 18:20   ` Konrad Rzeszutek Wilk
  2013-01-19 12:44     ` Roger Pau Monné
  2013-01-21 12:37     ` Ian Campbell
  0 siblings, 2 replies; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-18 18:20 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew

On Fri, Jan 18, 2013 at 05:00:10PM +0100, Roger Pau Monné wrote:
> On 18/12/12 15:31, Konrad Rzeszutek Wilk wrote:
> > Hey,
> > 
> > I am including some folks that are not always on Xen-devel to see if they have
> > some extra ideas or can correct my misunderstandings.
> > 
> > This is very much RFC - so there is bound to be some bugs.
> > The original is here
> > https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit
> > in case one wishes to modify/provide comment on that one.
> 
> Hello Konrad,
> 
> Thanks for the problem description and possible solutions, I'm 
> going to take on this (provided that you are OK with it).

The whole thing!? You are quite ambitious here!

> 
> > 
> > There are outstanding issues we have now with the block protocol:
> > Note: I am assuming 64-bit guest/host - as the size’s of the structures
> > change on 32-bit. I am also attaching the header for the blkif ring
> > as of today.
> > 
> > A) Segment size is limited to 11 pages. It means we can at most squeeze
> > in 44kB per request. The ring can hold 32 (next power of two below 36)
> > requests, meaning we can do 1.4M of outstanding requests.
> 
> I've been looking at the different options here and I think the most
> suitable one would be using a mechanism similar to VirtIO "indirect
> descriptors", it is the one that allows for more in-flight requests
> while being quite simple to implement.

<nods> I think you are right - plus it looks to be similar to how the more
modern hardware does requests (here I am thinking of NVMe and its command
descriptor - where it can point to a page full of DMA addresses + LBAs and say
"Go READ").

> 
> > B). Producer and consumer index is on the same cache line. In present hardware
> > that means the reader and writer will compete for the same cacheline causing a
> > ping-pong between sockets.
> > 
> > C). The requests and responses are on the same ring. This again causes the
> > ping-pong between sockets as the ownership of the cache line will shift between
> > sockets.
> > 
> > D). Cache alignment. Currently the protocol is 16-bit aligned. This is awkward
> > as the request and responses sometimes fit within a cacheline or sometimes
> > straddle them.
> > 
> > E). Interrupt mitigation. We are currently doing a kick whenever we are
> > done “processing” the ring. There are better ways to do this - and
> > we could use existing network interrupt mitigation techniques to make the
> > code poll when there is a lot of data.
> > 
> > F). Latency. The processing of the request limits us to only do 44kB - which means
> > that a 1MB chunk of data - which on contemporary devices would only use I/O request
> > - would be split up in multiple ‘requests’ inadvertently delaying the processing of
> > said block.
> > 
> > G) Future extensions. DIF/DIX for integrity. There might
> > be other in the future and it would be good to leave space for extra
> > flags TBD.
> > 
> > H). Separate the response and request rings. The current
> > implementation has one thread for one block ring. There is no reason why
> > there could not be two threads - one for responses and one for requests -
> > and especially if they are scheduled on different CPUs. Furthermore this
> > could also be split in multi-queues - so two queues (response and request)
> > on each vCPU.
> > 
> > I). We waste a lot of space on the ring - as we use the
> > ring for both requests and responses. The response structure needs to
> > occupy the same amount of space as the request structure (112 bytes). If
> > the request structure is expanded to be able to fit more segments (say
> > the ‘struct blkif_sring_entry is expanded to ~1500 bytes) that still
> > requires us to have a matching size response structure. We do not need
> > to use that much space for one response. Having a separate response ring
> > would simplify the structures.
> > 
> > J). 32 bit vs 64 bit. Right now the size
> > of the request structure is 112 bytes under 64-bit guest and 102 bytes
> > under 32-bit guest. It is confusing and furthermore requires the host
> > to do extra accounting and processing.
> > 
> > The crude drawing displays memory that the ring occupies in offset of
> > 64 bytes (cache line). Of course future CPUs could have different cache
> > lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is
> > a bit different as the ‘struct blkif_sring_entry’ is of 102 bytes.
> > 
> > 
> > The A) has two solutions to this (look at
> > http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
> > details). One proposed by Justin from Spectralogic has to negotiate
> > the segment size. This means that the ‘struct blkif_sring_entry’
> > is now a variable size. It can expand from 112 bytes (cover 11 pages of
> > data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
> > extension by just making the array in the request expand from 11 to a
> > variable size negotiated.
> > 
> > 
> > The math is as follow.
> > 
> > 
> >         struct blkif_request_segment {
> >                 uint32 grant;                         // 4 bytes uint8_t
> >                 first_sect, last_sect;// 1, 1 = 6 bytes
> >         }
> > (6 bytes for each segment) - the above structure is in an array of size
> > 11 in the request. The ‘struct blkif_sring_entry’ is 112 bytes. The
> > change is to expand the array - in this example we would tack on 245 extra
> > ‘struct blkif_request_segment’ - 245*6 + 112 = 1582. If we were to
> > use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages).
> > 
> > 
> > The other solution (from Intel - Ronghui) was to create one extra
> > ring that only has the ‘struct blkif_request_segment’ in them. The
> > ‘struct blkif_request’ would be changed to have an index in said
> > ‘segment ring’. There is only one segment ring. This means that the
> > size of the initial ring is still the same. The requests would point
> > to the segment and enumerate out how many of the indexes it wants to
> > use. The limit is of course the size of the segment. If one assumes a
> > one-page segment this means we can in one request cover ~4MB. The math
> > is as follow:
> > 
> > 
> > First request uses the half of the segment ring - so index 0 up
> > to 341 (out of 682). Each entry in the segment ring is a ‘struct
> > blkif_request_segment’ so it occupies 6 bytes. The other requests on
> > the ring (so there are 35 left) can use either the remaining 341 indexes
> > of the sgement ring or use the old style request. The old style request
> > can address use up to 44kB. For example:
> > 
> > 
> >  sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832
> >  sring[1]->[use sthe old style request] = 11*4096 = 45056
> >  sring[2]->[uses 342->682 indexes in the segment ring] = 1392640
> >  sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624
> > 
> > 
> > Total: 4145152 bytes Naturally this could be extended to have a bigger
> > segment ring to cover more.
> > 
> > 
> > 
> > 
> > 
> > The problem with this extension is that we use 6 bytes and end up
> > straddling a cache line. Using 8 bytes will fix the cache straddling. This
> > would mean fitting only 512 segments per page.
> > 
> > 
> > There is yet another mechanism that could be employed  - and it borrows
> > from VirtIO protocol. And that is the ‘indirect descriptors’. This
> > very similar to what Intel suggests, but with a twist.
> > 
> > 
> > We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the ‘struct
> > blkif_sring’ (each entry can be up to 112 bytes if needed - so the
> > old style request would fit). It would look like:
> > 
> > 
> > /* so 64 bytes under 64-bit. If necessary, the array (seg) can be
> > expanded to fit 11 segments as the old style request did */ struct
> > blkif_request_indirect {
> >         uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
> > // 1 blkif_vdev_t   handle;       /* only for read/write requests         */ // 2
> > #ifdef CONFIG_X86_64
> >         uint32_t       _pad1;             /* offsetof(blkif_request,u.rw.id) == 8 */ // 2
> > #endif
> >         uint64_t       id;           /* private guest value, echoed in resp  */
> >         grant_ref_t    gref;        /* reference to indirect buffer frame  if used*/
> >             struct blkif_request_segment_aligned seg[4]; // each is 8 bytes
> >
> > } __attribute__((__packed__));
> > 
> > 
> > struct blkif_request {
> >         uint8_t        operation;    /* BLKIF_OP_???  */
> >         union {
> >                 struct blkif_request_rw rw;
> >                 struct blkif_request_indirect
> >                 indirect; … other ..
> >         } u;
> > } __attribute__((__packed__));
> > 
> 
> Do you think it will be suitable to change the size of blkif_request, I 
> was thinking that if we introduce indirect request we could reduce the 
> maximum number of segments per request to 2, so blkif_request is 
> reduced to 32bytes and we can fit 126 requests (64 due to rounding) 
> in a single ring. 

Altering it to be 32-bytes means we are still crossing two cachelines.

This is where I thought of putting 'struct blkif_request_rw' on a diet
and just making it have 4 of the grants - in case the request is
just 16K. And then the size of the structure (if I got my math right)
would be 64-bytes.

Naturally this means we need to negotiate a 'feature-request-size'
where v1 says that the request is of 64-bytes length. v2 might
be 128bytes or whatever else we would need to fit within a cacheline
in future CPUs. We could pad the 'struct request_v1' with 64-bytes
in case the cacheline is 128bytes and our request is 64-bytes.


> 
> Everything that needs more than two segments will switch to using the 
> new indirect operation. Also, to make use of the extra space in 
> blkif_request_indirect I would remove the extra segments and instead 
> pass more grant references to frames containing 
> blkif_request_indirect_entry entries.

Right. The complications arise when we need to support this new format
and the old format and have to switch over when we migrate to old
backends.

> 
> #define BLKIF_MAX_SEGMENTS_PER_REQUEST 2
> #define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 4
> 
> struct blkif_request_segment {
>     grant_ref_t gref;        /* reference to I/O buffer frame        */
>     /* @first_sect: first sector in frame to transfer (inclusive).   */
>     /* @last_sect: last sector in frame to transfer (inclusive).     */
>     uint8_t     first_sect, last_sect;
> } __attribute__((__packed__));
> 
> struct blkif_request_indirect_entry {
>     blkif_sector_t  sector_number;
>     struct          blkif_request_segment seg;
>     uint8_t         _pad[2];
> } __attribute__((__packed__));
> 
> struct blkif_request_rw {
>     uint8_t        nr_segments;  /* number of segments                   */
>     blkif_vdev_t   handle;       /* only for read/write requests         */
>     uint64_t       id;           /* private guest value, echoed in resp  */
>     blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
>     struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> } __attribute__((__packed__));
> 
> struct blkif_request_indirect {
>     uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
>     uint16_t       nr_indirect;  /* number of indirect entries in gref frames */
>     blkif_vdev_t   handle;       /* only for read/write requests         */
>     uint64_t       id;           /* private guest value, echoed in resp  */
>     /* reference to indirect buffer frame */
>     grant_ref_t    gref[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
> } __attribute__((__packed__));
> 
> struct blkif_request_discard {
>     uint8_t        flag;         /* BLKIF_DISCARD_SECURE or zero.        */
> #define BLKIF_DISCARD_SECURE (1<<0)  /* ignored if discard-secure=0          */
>     uint64_t       id;           /* private guest value, echoed in resp  */
>     blkif_sector_t sector_number;
>     uint64_t       nr_sectors;
> } __attribute__((__packed__));
> 
> struct blkif_request {
>     uint8_t        operation;    /* BLKIF_OP_???                         */
>     union {
>         struct blkif_request_rw rw;
>         struct blkif_request_indirect indirect;
>         struct blkif_request_discard discard;
>     } u;
> } __attribute__((__packed__));
> 
> I've removed all the extra paddings, because this protocol is no longer 
> compatible with the previous one (we modify 
> BLKIF_MAX_SEGMENTS_PER_REQUEST), and packed the structures. I'm 
> wondering if using __attribute__ in a public header is allowed, since 
> this is a GNU extension (AFAIK supported by gcc and clang at least).

I don't know. But I think we can bypass the need for it if we calculate
the size of each entry right and insert padding whenever neccessary.

> 
> The main motivation behind the modification of the size of the 
> request is because I've realized that in your proposal, you where 
> adding 4 extra segments to the indirect request, and I was wondering 
> if I could pack more grant frames containing indirect entries in a 
> single request. By using the old size of blkf_request we will be able to
> send up to 9 grant frames filled with indirect entries, which gives us
> 4096 * 256 * 9 = 9MB of outstanding IO in each request. I think that
> this is too high, and will introduce a lot of latency to the protocol.
> That's the main reason I've decided to reduce the size of blkif_request
> in order to be able to fit more of them in a ring, instead of making
> each of the requests bigger.

OK. That is OK as long as we make sure to _not_ have two requests
straddling the cache-line. Or maybe that is not a problem.

But the other nice benefit of your idea - where it is only 32-bytes
- is that we have now extra 32-bytes where we can insert such
things as DIF/DIX or "other" new features. And that way have the
request be of size 64 bytes.

> 
> Using this implementation each request has a size of 32bytes (to 
> prevent cache straddle), so we could fit 126 requests in a ring, but 
> due to the rounding, we end up with 64. Also each request can contain 
> 4 grant pages, each one filled with 256 blkif_request_indirect_entry. 
> The math for the maximum data in a request is: 4096 * 256 * 4 = 4MB, 
> and in the whole ring 4 * 64 = 256MB worth of outstanding IO.
>  

Which has a nice power of two ring to it. (Ah the puns!)

I like the idea of putting the request on a diet - but too much
could cause us to miss the opportunity to insert other flags on it.
If I recall correctly, the DIF/DIX only need 8 bytes of data.
If we make the assumption that:
	I/O request = one ring entry

and the  "one ring entry" can use the the '4' grants if we just have a
16KB I/O request, but if it is more than that - we use the indirect page
and can stuff 1MB of data in there. 
The extra 32-bytes of space for such things as 'DIF/DIX'. This also
means we could unify the 'struct request' with the 'discard' operation
and it could utilize the 32-bytes of extra unused payload data.

> > 
> > 
> > The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
> > etc operation would now be in indirect.op. The indirect.gref points to
> > a page that is filled with:
> > 
> > 
> > struct blkif_request_indirect_entry {
> >         blkif_sector_t sector_number;
> >         struct blkif_request_segment seg;
> > } __attribute__((__packed__));
> > //16 bytes, so we can fit in a page 256 of these structures.
> > 
> > 
> > This means that with the existing 36 slots in the ring (single page)
> > we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> > ~= 32M. If we don’t want to use indirect descriptor we can still use
> > up to 4 pages of the request (as it has enough space to contain four
> > segments and the structure will still be cache-aligned).
> > 
> > 
> > 
> > 
> > B). Both the producer (req_* and rsp_*) values are in the same
> > cache-line. This means that we end up with the same cacheline being
> > modified by two different guests. Depending on the architecture and
> > placement of the guest this could be bad - as each logical CPU would
> > try to write and read from the same cache-line. A mechanism where
> > the req_* and rsp_ values are separated and on a different cache line
> > could be used. The value of the correct cache-line and alignment could
> > be negotiated via XenBus - in case future technologies start using 128
> > bytes for cache or such. Or the the producer and consumer indexes are in
> > separate rings. Meaning we have a ‘request ring’ and a ‘response
> > ring’ - and only the ‘req_prod’, ‘req_event’ are modified in
> > the ‘request ring’. The opposite (resp_*) are only modified in the
> > ‘response ring’.
> 
> Using different rings for requests and responses sounds like a good idea,
> this will reduce the number of variables on the ring, leaving only
> "produced" and "event" (we no longer need rsp_ and req_).

<nods>
> 
> Since we will no longer have two different domains writing to the same
> cache line I guess this will get rid of the cache ping-pong problem.

<nods> That is my thinking.
> 
> Also while there we could try to use all the page, instead of rounding
> down to the closer power of two, this will allow us to go from 64
> requests to 126 (if using the indirect descriptors proposal explained
> above).

Right.
Lets call this 'feature-seperate-req-and-rsp' ?

>  
> > C). Similar to B) problem but with a bigger payload. Each
> > ‘blkif_sring_entry’ occupies 112 bytes which does not lend itself
> > to a nice cache line size. If the indirect descriptors are to be used
> > for everything we could ‘slim-down’ the blkif_request/response to
> > be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST
> > to 5 as that would slim the largest of the structures to 64-bytes.
> > Naturally this means negotiating a new size of the structure via XenBus.

Here I am using 5, and I think you were using 4 - and you mentioned you
got it down to 32-bytes. I should double-check the 'sizeof' then to
see what it would be like.

> > 
> > 
> > D). The first picture shows the problem. We now aligning everything
> > on the wrong cachelines. Worst in ⅓ of the cases we straddle
> > three cache-lines. We could negotiate a proper alignment for each
> > request/response structure.
> > 
> > 
> > E). The network stack has showed that going in a polling mode does improve
> > performance. The current mechanism of kicking the guest and or block
> > backend is not always clear.  [TODO: Konrad to explain it in details]

Oh, I never did explain this - but I think the patches that Daniel came
up with actually fix a part of it. They make the kick-the-other guest
only happen when the backend has processed all of the requests and
cannot find anything else to do. Previously it was more of 'done one
request, lets kick the backend.'.

But going forward, this 'kick-the-other-guest' could be further modulated.
If we have a full ring and the frontend keeps on adding entries we (backend)
should never get an interrupt. As of matter of fact the frontend should be just
polling all the time and process them as fast as possible.

Or we could add tweaks in the backend to adopt an 'adaptive interrupt
coelescing' mechanism. This way we only kick the frontend if a certain
threshold of I/Os have been processed or if we are slowing down on the
threshold. Again, the idea is that - if we have the rings full we should
not send 'kicks' at all and just keep on processing.

Note: I *believe* that the backend does this - keeps on processing and
won't kick if the ring is full. But I am not 100% sure on the frontend side.

> > 
> > 
> > F). The current block protocol for big I/Os that the backend devices can
> > handle ends up doing extra work by splitting the I/O in smaller chunks,
> > and then reassembling them. With the solutions outlined in A) this can
> > be fixed. This is easily seen with 1MB I/Os. Since each request can
> > only handle 44kB that means we have to split a 1MB I/O in 24 requests
> > (23 * 4096 * 11 = 1081344). Then the backend ends up sending them in
> > sector-sizes- which with contemporary devices (such as SSD) ends up with
> > more processing. The SSDs are comfortable handling 128kB or higher I/Os
> > in one go.
> > 
> > 
> > G). DIF/DIX. This a protocol to carry extra ‘checksum’ information
> > for each I/O. The I/O can be a sector size, page-size or an I/O size
> > (most popular are 1MB). The DIF/DIX needs 8 bytes of information for
> > each I/O. It would be worth considering putting/reserving that amount of
> > space in each request/response. Also putting in extra flags for future
> > extensions would be worth it - however the author is not aware of any
> > right now.
> > 
> > 
> > H). Separate response/request. Potentially even multi-queue per-VCPU
> > queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was
> > flawed. There is no similar concept in the storage world were the
> > operating system can put a food down and say: “everything before this
> > has to be on the disk.” There are ligther versions of this - called
> > ‘FUA’ and ‘FLUSH’. Depending on the internal implementation
> > of the storage they are either ignored or do the right thing. The
> > filesystems determine the viability of these flags and change writing
> > tactics depending on this. From a protocol level, this means that the
> > WRITE/READ/SYNC requests can be intermixed - the storage by itself
> > determines the order of the operation. The filesystem is the one that
> > determines whether the WRITE should be with a FLUSH to preserve some form
> > of atomicity. This means we do not have to preserve an order of operations
> > - so we can have multiple queues for request and responses. This has
> > show in the network world to improve performance considerably.
> > 
> > 
> > I). Wastage of response/request on the same ring. Currently each response
> > MUST occupy the same amount of space that the request occupies - as the
> > ring can have both responses and requests. Separating the request and
> > response ring would remove the wastage.
> > 
> > 
> > J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring
> > entries is different if the guest is in 32-bit or 64-bit mode. Cleaning
> > this up to be the same size would save considerable accounting that the
> > host has to do (extra memcpy for each response/request).
> 
> So forth I think most of the problems could be solved by using indirect
> descriptors, aligning requests to a cache sensible value (to prevent
> straddling) and using separate rings for requests and responses. This 
> would be my plan for solving this issues:

That would take care of the bulk of the big issues.

> 
> 1. Implement LRU mechanism in blkback for the persistent grants list. If
> we are going to increment the number of in-flight IO we need to be able to
> keep the list of persistently mapped grants to something sensible, and
> we are no longer able to persistently map all the grants that could be 
> used. In my proposal that would be 64 * 4 for each grant frame that 

Can you explain the foundations behind the '64' and '4'? I think I know
the 4 (BLKIF_..SEGMENTS) but I am not sure about '64'.

> contains indirect entries, plus 64 * 4 * 256, for all the possible
> entries in the indirect frame 64 * 4 + 64 * 4 * 256 = 65792.

That would cover ~256MB of data per guest. Wow. 
> 
> 2. Implement indirect descriptors.
> 
> 3. Implement new ring protocol, using two separate rings for requests
> and replies.

Don't want to negotiate each of these proposal as a seperate feature? So:
	seperate-req-and-rsp
	slim-version-of-req
	indiret-descriptor
	multi-page-ring
	aligment-64-or-other-value


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-18 18:20   ` Konrad Rzeszutek Wilk
@ 2013-01-19 12:44     ` Roger Pau Monné
  2013-01-22 19:46       ` Konrad Rzeszutek Wilk
  2013-01-21 12:37     ` Ian Campbell
  1 sibling, 1 reply; 19+ messages in thread
From: Roger Pau Monné @ 2013-01-19 12:44 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew

On 18/01/13 19:20, Konrad Rzeszutek Wilk wrote:
> On Fri, Jan 18, 2013 at 05:00:10PM +0100, Roger Pau Monné wrote:
>> On 18/12/12 15:31, Konrad Rzeszutek Wilk wrote:
>>> Hey,
>>>
>>> I am including some folks that are not always on Xen-devel to see if they have
>>> some extra ideas or can correct my misunderstandings.
>>>
>>> This is very much RFC - so there is bound to be some bugs.
>>> The original is here
>>> https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit
>>> in case one wishes to modify/provide comment on that one.
>>
>> Hello Konrad,
>>
>> Thanks for the problem description and possible solutions, I'm
>> going to take on this (provided that you are OK with it).
> 
> The whole thing!? You are quite ambitious here!

I will try to fix as much as possible, one thing at a time, don't expect
a series from one day to another fixing all the issues :)

>>
>>>
>>> There are outstanding issues we have now with the block protocol:
>>> Note: I am assuming 64-bit guest/host - as the size’s of the structures
>>> change on 32-bit. I am also attaching the header for the blkif ring
>>> as of today.
>>>
>>> A) Segment size is limited to 11 pages. It means we can at most squeeze
>>> in 44kB per request. The ring can hold 32 (next power of two below 36)
>>> requests, meaning we can do 1.4M of outstanding requests.
>>
>> I've been looking at the different options here and I think the most
>> suitable one would be using a mechanism similar to VirtIO "indirect
>> descriptors", it is the one that allows for more in-flight requests
>> while being quite simple to implement.
> 
> <nods> I think you are right - plus it looks to be similar to how the more
> modern hardware does requests (here I am thinking of NVMe and its command
> descriptor - where it can point to a page full of DMA addresses + LBAs and say
> "Go READ").
> 
>>
>>> B). Producer and consumer index is on the same cache line. In present hardware
>>> that means the reader and writer will compete for the same cacheline causing a
>>> ping-pong between sockets.
>>>
>>> C). The requests and responses are on the same ring. This again causes the
>>> ping-pong between sockets as the ownership of the cache line will shift between
>>> sockets.
>>>
>>> D). Cache alignment. Currently the protocol is 16-bit aligned. This is awkward
>>> as the request and responses sometimes fit within a cacheline or sometimes
>>> straddle them.
>>>
>>> E). Interrupt mitigation. We are currently doing a kick whenever we are
>>> done “processing” the ring. There are better ways to do this - and
>>> we could use existing network interrupt mitigation techniques to make the
>>> code poll when there is a lot of data.
>>>
>>> F). Latency. The processing of the request limits us to only do 44kB - which means
>>> that a 1MB chunk of data - which on contemporary devices would only use I/O request
>>> - would be split up in multiple ‘requests’ inadvertently delaying the processing of
>>> said block.
>>>
>>> G) Future extensions. DIF/DIX for integrity. There might
>>> be other in the future and it would be good to leave space for extra
>>> flags TBD.
>>>
>>> H). Separate the response and request rings. The current
>>> implementation has one thread for one block ring. There is no reason why
>>> there could not be two threads - one for responses and one for requests -
>>> and especially if they are scheduled on different CPUs. Furthermore this
>>> could also be split in multi-queues - so two queues (response and request)
>>> on each vCPU.
>>>
>>> I). We waste a lot of space on the ring - as we use the
>>> ring for both requests and responses. The response structure needs to
>>> occupy the same amount of space as the request structure (112 bytes). If
>>> the request structure is expanded to be able to fit more segments (say
>>> the ‘struct blkif_sring_entry is expanded to ~1500 bytes) that still
>>> requires us to have a matching size response structure. We do not need
>>> to use that much space for one response. Having a separate response ring
>>> would simplify the structures.
>>>
>>> J). 32 bit vs 64 bit. Right now the size
>>> of the request structure is 112 bytes under 64-bit guest and 102 bytes
>>> under 32-bit guest. It is confusing and furthermore requires the host
>>> to do extra accounting and processing.
>>>
>>> The crude drawing displays memory that the ring occupies in offset of
>>> 64 bytes (cache line). Of course future CPUs could have different cache
>>> lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is
>>> a bit different as the ‘struct blkif_sring_entry’ is of 102 bytes.
>>>
>>>
>>> The A) has two solutions to this (look at
>>> http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
>>> details). One proposed by Justin from Spectralogic has to negotiate
>>> the segment size. This means that the ‘struct blkif_sring_entry’
>>> is now a variable size. It can expand from 112 bytes (cover 11 pages of
>>> data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
>>> extension by just making the array in the request expand from 11 to a
>>> variable size negotiated.
>>>
>>>
>>> The math is as follow.
>>>
>>>
>>>         struct blkif_request_segment {
>>>                 uint32 grant;                         // 4 bytes uint8_t
>>>                 first_sect, last_sect;// 1, 1 = 6 bytes
>>>         }
>>> (6 bytes for each segment) - the above structure is in an array of size
>>> 11 in the request. The ‘struct blkif_sring_entry’ is 112 bytes. The
>>> change is to expand the array - in this example we would tack on 245 extra
>>> ‘struct blkif_request_segment’ - 245*6 + 112 = 1582. If we were to
>>> use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages).
>>>
>>>
>>> The other solution (from Intel - Ronghui) was to create one extra
>>> ring that only has the ‘struct blkif_request_segment’ in them. The
>>> ‘struct blkif_request’ would be changed to have an index in said
>>> ‘segment ring’. There is only one segment ring. This means that the
>>> size of the initial ring is still the same. The requests would point
>>> to the segment and enumerate out how many of the indexes it wants to
>>> use. The limit is of course the size of the segment. If one assumes a
>>> one-page segment this means we can in one request cover ~4MB. The math
>>> is as follow:
>>>
>>>
>>> First request uses the half of the segment ring - so index 0 up
>>> to 341 (out of 682). Each entry in the segment ring is a ‘struct
>>> blkif_request_segment’ so it occupies 6 bytes. The other requests on
>>> the ring (so there are 35 left) can use either the remaining 341 indexes
>>> of the sgement ring or use the old style request. The old style request
>>> can address use up to 44kB. For example:
>>>
>>>
>>>  sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832
>>>  sring[1]->[use sthe old style request] = 11*4096 = 45056
>>>  sring[2]->[uses 342->682 indexes in the segment ring] = 1392640
>>>  sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624
>>>
>>>
>>> Total: 4145152 bytes Naturally this could be extended to have a bigger
>>> segment ring to cover more.
>>>
>>>
>>>
>>>
>>>
>>> The problem with this extension is that we use 6 bytes and end up
>>> straddling a cache line. Using 8 bytes will fix the cache straddling. This
>>> would mean fitting only 512 segments per page.
>>>
>>>
>>> There is yet another mechanism that could be employed  - and it borrows
>>> from VirtIO protocol. And that is the ‘indirect descriptors’. This
>>> very similar to what Intel suggests, but with a twist.
>>>
>>>
>>> We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the ‘struct
>>> blkif_sring’ (each entry can be up to 112 bytes if needed - so the
>>> old style request would fit). It would look like:
>>>
>>>
>>> /* so 64 bytes under 64-bit. If necessary, the array (seg) can be
>>> expanded to fit 11 segments as the old style request did */ struct
>>> blkif_request_indirect {
>>>         uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
>>> // 1 blkif_vdev_t   handle;       /* only for read/write requests         */ // 2
>>> #ifdef CONFIG_X86_64
>>>         uint32_t       _pad1;             /* offsetof(blkif_request,u.rw.id) == 8 */ // 2
>>> #endif
>>>         uint64_t       id;           /* private guest value, echoed in resp  */
>>>         grant_ref_t    gref;        /* reference to indirect buffer frame  if used*/
>>>             struct blkif_request_segment_aligned seg[4]; // each is 8 bytes
>>>
>>> } __attribute__((__packed__));
>>>
>>>
>>> struct blkif_request {
>>>         uint8_t        operation;    /* BLKIF_OP_???  */
>>>         union {
>>>                 struct blkif_request_rw rw;
>>>                 struct blkif_request_indirect
>>>                 indirect; … other ..
>>>         } u;
>>> } __attribute__((__packed__));
>>>
>>
>> Do you think it will be suitable to change the size of blkif_request, I
>> was thinking that if we introduce indirect request we could reduce the
>> maximum number of segments per request to 2, so blkif_request is
>> reduced to 32bytes and we can fit 126 requests (64 due to rounding)
>> in a single ring.
> 
> Altering it to be 32-bytes means we are still crossing two cachelines.

Are you sure? using 32bytes on a 64bytes cache line means we have two
requests in the same cache line, but it's impossible that a request uses
two different cache lines (which I guess it's the main problem):

                64bytes
    <------------------------------>
    +------------------------------+
    | ring header + pad            |
    +------------------------------+
       32bytes
    <--------------->
    +---------------+--------------+
    | req[0]        | req[1]       |
    +---------------|--------------+
    | req[2]        | req[3]       |
    +---------------+--------------+

> 
> This is where I thought of putting 'struct blkif_request_rw' on a diet
> and just making it have 4 of the grants - in case the request is
> just 16K. And then the size of the structure (if I got my math right)
> would be 64-bytes.

Using 64bytes still leaves us with only 32 requests on the ring (unless
we change the ring protocol to support a ring size different than a
power of two).

> Naturally this means we need to negotiate a 'feature-request-size'
> where v1 says that the request is of 64-bytes length. v2 might
> be 128bytes or whatever else we would need to fit within a cacheline
> in future CPUs. We could pad the 'struct request_v1' with 64-bytes
> in case the cacheline is 128bytes and our request is 64-bytes.

I would have to benchmark if having two requests on the same cache line
or only a single one has any kind of impact on performance, as I
understand it the only problem of this approach might happen when
blkback is writing a reply and blkfront is also writing a request in the
same cache line (which will be solved by using two different rings, one
for replies and one for request).

So I think a compromise solution is to use 32bytes requests, that would
always be cache aligned and to solve the ping-pong problem between
blkfront/blkback writing on the same cache line two rings should be used.

> 
>>
>> Everything that needs more than two segments will switch to using the
>> new indirect operation. Also, to make use of the extra space in
>> blkif_request_indirect I would remove the extra segments and instead
>> pass more grant references to frames containing
>> blkif_request_indirect_entry entries.
> 
> Right. The complications arise when we need to support this new format
> and the old format and have to switch over when we migrate to old
> backends.
> 
>>
>> #define BLKIF_MAX_SEGMENTS_PER_REQUEST 2
>> #define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 4
>>
>> struct blkif_request_segment {
>>     grant_ref_t gref;        /* reference to I/O buffer frame        */
>>     /* @first_sect: first sector in frame to transfer (inclusive).   */
>>     /* @last_sect: last sector in frame to transfer (inclusive).     */
>>     uint8_t     first_sect, last_sect;
>> } __attribute__((__packed__));
>>
>> struct blkif_request_indirect_entry {
>>     blkif_sector_t  sector_number;
>>     struct          blkif_request_segment seg;
>>     uint8_t         _pad[2];
>> } __attribute__((__packed__));
>>
>> struct blkif_request_rw {
>>     uint8_t        nr_segments;  /* number of segments                   */
>>     blkif_vdev_t   handle;       /* only for read/write requests         */
>>     uint64_t       id;           /* private guest value, echoed in resp  */
>>     blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
>>     struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> } __attribute__((__packed__));
>>
>> struct blkif_request_indirect {
>>     uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
>>     uint16_t       nr_indirect;  /* number of indirect entries in gref frames */
>>     blkif_vdev_t   handle;       /* only for read/write requests         */
>>     uint64_t       id;           /* private guest value, echoed in resp  */
>>     /* reference to indirect buffer frame */
>>     grant_ref_t    gref[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
>> } __attribute__((__packed__));
>>
>> struct blkif_request_discard {
>>     uint8_t        flag;         /* BLKIF_DISCARD_SECURE or zero.        */
>> #define BLKIF_DISCARD_SECURE (1<<0)  /* ignored if discard-secure=0          */
>>     uint64_t       id;           /* private guest value, echoed in resp  */
>>     blkif_sector_t sector_number;
>>     uint64_t       nr_sectors;
>> } __attribute__((__packed__));
>>
>> struct blkif_request {
>>     uint8_t        operation;    /* BLKIF_OP_???                         */
>>     union {
>>         struct blkif_request_rw rw;
>>         struct blkif_request_indirect indirect;
>>         struct blkif_request_discard discard;
>>     } u;
>> } __attribute__((__packed__));
>>
>> I've removed all the extra paddings, because this protocol is no longer
>> compatible with the previous one (we modify
>> BLKIF_MAX_SEGMENTS_PER_REQUEST), and packed the structures. I'm
>> wondering if using __attribute__ in a public header is allowed, since
>> this is a GNU extension (AFAIK supported by gcc and clang at least).
> 
> I don't know. But I think we can bypass the need for it if we calculate
> the size of each entry right and insert padding whenever neccessary.

Yes, I would like to avoid having different sizes depending on the
architecture.

> 
>>
>> The main motivation behind the modification of the size of the
>> request is because I've realized that in your proposal, you where
>> adding 4 extra segments to the indirect request, and I was wondering
>> if I could pack more grant frames containing indirect entries in a
>> single request. By using the old size of blkf_request we will be able to
>> send up to 9 grant frames filled with indirect entries, which gives us
>> 4096 * 256 * 9 = 9MB of outstanding IO in each request. I think that
>> this is too high, and will introduce a lot of latency to the protocol.
>> That's the main reason I've decided to reduce the size of blkif_request
>> in order to be able to fit more of them in a ring, instead of making
>> each of the requests bigger.
> 
> OK. That is OK as long as we make sure to _not_ have two requests
> straddling the cache-line. Or maybe that is not a problem.
> 
> But the other nice benefit of your idea - where it is only 32-bytes
> - is that we have now extra 32-bytes where we can insert such
> things as DIF/DIX or "other" new features. And that way have the
> request be of size 64 bytes.

So you prefer to keep the ring size to 32, instead of increasing it to
64, and use the remaining 32bytes in the request to store request
related data?

> 
>>
>> Using this implementation each request has a size of 32bytes (to
>> prevent cache straddle), so we could fit 126 requests in a ring, but
>> due to the rounding, we end up with 64. Also each request can contain
>> 4 grant pages, each one filled with 256 blkif_request_indirect_entry.
>> The math for the maximum data in a request is: 4096 * 256 * 4 = 4MB,
>> and in the whole ring 4 * 64 = 256MB worth of outstanding IO.
>>
> 
> Which has a nice power of two ring to it. (Ah the puns!)
> 
> I like the idea of putting the request on a diet - but too much
> could cause us to miss the opportunity to insert other flags on it.
> If I recall correctly, the DIF/DIX only need 8 bytes of data.
> If we make the assumption that:
>         I/O request = one ring entry

So we only need to reserve 8bytes for each DIF/IDX, even if the request
contains a variable number of data? (I mean, block requests can at a
minimum contain 4096bytes, or much more)

> and the  "one ring entry" can use the the '4' grants if we just have a
> 16KB I/O request, but if it is more than that - we use the indirect page

Well, on my purpose I've limited the number of segments of a "rw"
requests to 2, so it's only 8K, anything bigger has to use indirect
descriptors, which can fit 4M of data (because I'm passing 4 grant
frames full of "blkif_request_indirect_entry" entries).

> and can stuff 1MB of data in there.
> The extra 32-bytes of space for such things as 'DIF/DIX'. This also
> means we could unify the 'struct request' with the 'discard' operation
> and it could utilize the 32-bytes of extra unused payload data.
> 
>>>
>>>
>>> The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
>>> etc operation would now be in indirect.op. The indirect.gref points to
>>> a page that is filled with:
>>>
>>>
>>> struct blkif_request_indirect_entry {
>>>         blkif_sector_t sector_number;
>>>         struct blkif_request_segment seg;
>>> } __attribute__((__packed__));
>>> //16 bytes, so we can fit in a page 256 of these structures.
>>>
>>>
>>> This means that with the existing 36 slots in the ring (single page)
>>> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
>>> ~= 32M. If we don’t want to use indirect descriptor we can still use
>>> up to 4 pages of the request (as it has enough space to contain four
>>> segments and the structure will still be cache-aligned).
>>>
>>>
>>>
>>>
>>> B). Both the producer (req_* and rsp_*) values are in the same
>>> cache-line. This means that we end up with the same cacheline being
>>> modified by two different guests. Depending on the architecture and
>>> placement of the guest this could be bad - as each logical CPU would
>>> try to write and read from the same cache-line. A mechanism where
>>> the req_* and rsp_ values are separated and on a different cache line
>>> could be used. The value of the correct cache-line and alignment could
>>> be negotiated via XenBus - in case future technologies start using 128
>>> bytes for cache or such. Or the the producer and consumer indexes are in
>>> separate rings. Meaning we have a ‘request ring’ and a ‘response
>>> ring’ - and only the ‘req_prod’, ‘req_event’ are modified in
>>> the ‘request ring’. The opposite (resp_*) are only modified in the
>>> ‘response ring’.
>>
>> Using different rings for requests and responses sounds like a good idea,
>> this will reduce the number of variables on the ring, leaving only
>> "produced" and "event" (we no longer need rsp_ and req_).
> 
> <nods>
>>
>> Since we will no longer have two different domains writing to the same
>> cache line I guess this will get rid of the cache ping-pong problem.
> 
> <nods> That is my thinking.
>>
>> Also while there we could try to use all the page, instead of rounding
>> down to the closer power of two, this will allow us to go from 64
>> requests to 126 (if using the indirect descriptors proposal explained
>> above).
> 
> Right.
> Lets call this 'feature-seperate-req-and-rsp' ?

Yes, and blkfront will also need to pass two different grant frames at
init, because we need to map two rings.

> 
>>
>>> C). Similar to B) problem but with a bigger payload. Each
>>> ‘blkif_sring_entry’ occupies 112 bytes which does not lend itself
>>> to a nice cache line size. If the indirect descriptors are to be used
>>> for everything we could ‘slim-down’ the blkif_request/response to
>>> be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST
>>> to 5 as that would slim the largest of the structures to 64-bytes.
>>> Naturally this means negotiating a new size of the structure via XenBus.
> 
> Here I am using 5, and I think you were using 4 - and you mentioned you
> got it down to 32-bytes. I should double-check the 'sizeof' then to
> see what it would be like.

No, I was limiting it to 2 segments in the "rw" request, the 4 comes
because I'm passing 4 grant frames in the "blkif_request_indirect",
instead of just one (see above, "gref" is now an array in my proposal).

> 
>>>
>>>
>>> D). The first picture shows the problem. We now aligning everything
>>> on the wrong cachelines. Worst in ⅓ of the cases we straddle
>>> three cache-lines. We could negotiate a proper alignment for each
>>> request/response structure.
>>>
>>>
>>> E). The network stack has showed that going in a polling mode does improve
>>> performance. The current mechanism of kicking the guest and or block
>>> backend is not always clear.  [TODO: Konrad to explain it in details]
> 
> Oh, I never did explain this - but I think the patches that Daniel came
> up with actually fix a part of it. They make the kick-the-other guest
> only happen when the backend has processed all of the requests and
> cannot find anything else to do. Previously it was more of 'done one
> request, lets kick the backend.'.

Yes, that's what I'm seeing in the current code.

> But going forward, this 'kick-the-other-guest' could be further modulated.
> If we have a full ring and the frontend keeps on adding entries we (backend)
> should never get an interrupt. As of matter of fact the frontend should be just
> polling all the time and process them as fast as possible.

Yes, we could switch to polling in both ends and only "kick" when the
ring is full (which should not happen usually), this will probably
reduce CPU utilization, but I doubt it's going to increase the throughput.

> Or we could add tweaks in the backend to adopt an 'adaptive interrupt
> coelescing' mechanism. This way we only kick the frontend if a certain
> threshold of I/Os have been processed or if we are slowing down on the
> threshold. Again, the idea is that - if we have the rings full we should
> not send 'kicks' at all and just keep on processing.
> 
> Note: I *believe* that the backend does this - keeps on processing and
> won't kick if the ring is full. But I am not 100% sure on the frontend side.

Changing this is not really difficult (in my opinion, but I've might be
wrong), so I think we could try several approaches and see which one
gives better performance (or reduces CPU utilization).

Anyway, I would focus now on the indirect descriptors and separate the
rings, and leave this work for afterwards (because what we could
probably see now might not be true once the other improvements are done).

>>>
>>>
>>> F). The current block protocol for big I/Os that the backend devices can
>>> handle ends up doing extra work by splitting the I/O in smaller chunks,
>>> and then reassembling them. With the solutions outlined in A) this can
>>> be fixed. This is easily seen with 1MB I/Os. Since each request can
>>> only handle 44kB that means we have to split a 1MB I/O in 24 requests
>>> (23 * 4096 * 11 = 1081344). Then the backend ends up sending them in
>>> sector-sizes- which with contemporary devices (such as SSD) ends up with
>>> more processing. The SSDs are comfortable handling 128kB or higher I/Os
>>> in one go.
>>>
>>>
>>> G). DIF/DIX. This a protocol to carry extra ‘checksum’ information
>>> for each I/O. The I/O can be a sector size, page-size or an I/O size
>>> (most popular are 1MB). The DIF/DIX needs 8 bytes of information for
>>> each I/O. It would be worth considering putting/reserving that amount of
>>> space in each request/response. Also putting in extra flags for future
>>> extensions would be worth it - however the author is not aware of any
>>> right now.
>>>
>>>
>>> H). Separate response/request. Potentially even multi-queue per-VCPU
>>> queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was
>>> flawed. There is no similar concept in the storage world were the
>>> operating system can put a food down and say: “everything before this
>>> has to be on the disk.” There are ligther versions of this - called
>>> ‘FUA’ and ‘FLUSH’. Depending on the internal implementation
>>> of the storage they are either ignored or do the right thing. The
>>> filesystems determine the viability of these flags and change writing
>>> tactics depending on this. From a protocol level, this means that the
>>> WRITE/READ/SYNC requests can be intermixed - the storage by itself
>>> determines the order of the operation. The filesystem is the one that
>>> determines whether the WRITE should be with a FLUSH to preserve some form
>>> of atomicity. This means we do not have to preserve an order of operations
>>> - so we can have multiple queues for request and responses. This has
>>> show in the network world to improve performance considerably.
>>>
>>>
>>> I). Wastage of response/request on the same ring. Currently each response
>>> MUST occupy the same amount of space that the request occupies - as the
>>> ring can have both responses and requests. Separating the request and
>>> response ring would remove the wastage.
>>>
>>>
>>> J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring
>>> entries is different if the guest is in 32-bit or 64-bit mode. Cleaning
>>> this up to be the same size would save considerable accounting that the
>>> host has to do (extra memcpy for each response/request).
>>
>> So forth I think most of the problems could be solved by using indirect
>> descriptors, aligning requests to a cache sensible value (to prevent
>> straddling) and using separate rings for requests and responses. This
>> would be my plan for solving this issues:
> 
> That would take care of the bulk of the big issues.
> 
>>
>> 1. Implement LRU mechanism in blkback for the persistent grants list. If
>> we are going to increment the number of in-flight IO we need to be able to
>> keep the list of persistently mapped grants to something sensible, and
>> we are no longer able to persistently map all the grants that could be
>> used. In my proposal that would be 64 * 4 for each grant frame that
> 
> Can you explain the foundations behind the '64' and '4'? I think I know
> the 4 (BLKIF_..SEGMENTS) but I am not sure about '64'.

The 4 comes because in my proposal I'm passing 4 indirect grant frames
filled with blkif_request_segment (so gref in blkif_request_indirect is
now an array).

The 64 comes because I've reduced the size of blkif_request to 32bytes,
so now we can fit 64 blkif_requests in a ring.

Putting it all together:

4 (indirect grant frames) * 256 (entries in each indirect grant frame) *
4096 (PAGE_SIZE) = 4M IO in a single request

64 (32bytes entries in a ring) * 4M (max IO in each entry) = 256M worth
of IO in a 4096bytes ring.

>> contains indirect entries, plus 64 * 4 * 256, for all the possible
>> entries in the indirect frame 64 * 4 + 64 * 4 * 256 = 65792.
> 
> That would cover ~256MB of data per guest. Wow.
>>
>> 2. Implement indirect descriptors.
>>
>> 3. Implement new ring protocol, using two separate rings for requests
>> and replies.
> 
> Don't want to negotiate each of these proposal as a seperate feature? So:
>         seperate-req-and-rsp
>         slim-version-of-req
>         indiret-descriptor
>         multi-page-ring
>         aligment-64-or-other-value

Yes, but I will try to couple them in a way that makes sense, for
example reducing the size of blkif_request without using indirect
descriptors doesn't make much sense, so I would couple
indirect-descriptor with slim-version-of-req, this will leave us with
the following new extensions:

indirect-descriptors (which includes a slim version of blkif_request)
separate-req-and-rsp

This will be the two first extensions we should try to implement from my
POV, because I think they are going to deliver quite a performance boost.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-18 18:20   ` Konrad Rzeszutek Wilk
  2013-01-19 12:44     ` Roger Pau Monné
@ 2013-01-21 12:37     ` Ian Campbell
  2013-01-22 19:25       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2013-01-21 12:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Fri, 2013-01-18 at 18:20 +0000, Konrad Rzeszutek Wilk wrote:
> 
> > > E). The network stack has showed that going in a polling mode does improve
> > > performance. The current mechanism of kicking the guest and or block
> > > backend is not always clear.  [TODO: Konrad to explain it in details]
> 
> Oh, I never did explain this - but I think the patches that Daniel came
> up with actually fix a part of it. They make the kick-the-other guest
> only happen when the backend has processed all of the requests and
> cannot find anything else to do. Previously it was more of 'done one
> request, lets kick the backend.'.

blkback uses RING_PUSH_RESPONSES_AND_CHECK_NOTIFY so doesn't it get some
amount of evthcn mitigation for free?

> But going forward, this 'kick-the-other-guest' could be further modulated.
> If we have a full ring and the frontend keeps on adding entries we (backend)
> should never get an interrupt. As of matter of fact the frontend should be just
> polling all the time and process them as fast as possible.

I think the existing req_event/rsp_event ring fields should enable this
already, assuming the front- and back-ends are using them right.

Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-21 12:37     ` Ian Campbell
@ 2013-01-22 19:25       ` Konrad Rzeszutek Wilk
  2013-01-23  9:24         ` Ian Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-22 19:25 UTC (permalink / raw)
  To: Ian Campbell
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Mon, Jan 21, 2013 at 12:37:18PM +0000, Ian Campbell wrote:
> On Fri, 2013-01-18 at 18:20 +0000, Konrad Rzeszutek Wilk wrote:
> > 
> > > > E). The network stack has showed that going in a polling mode does improve
> > > > performance. The current mechanism of kicking the guest and or block
> > > > backend is not always clear.  [TODO: Konrad to explain it in details]
> > 
> > Oh, I never did explain this - but I think the patches that Daniel came
> > up with actually fix a part of it. They make the kick-the-other guest
> > only happen when the backend has processed all of the requests and
> > cannot find anything else to do. Previously it was more of 'done one
> > request, lets kick the backend.'.
> 
> blkback uses RING_PUSH_RESPONSES_AND_CHECK_NOTIFY so doesn't it get some
> amount of evthcn mitigation for free?

So there are two paths here - the kick from a) frontend and the kick b) backend
gives the frontend.

The a) case is fairly straighforward. We process all of the rings we and everytime
we have finished with a request we re-read the producer. So if the frontend keeps
us bussy we will keep on processing.

The b) case is the one that is trigger happy. Every time a request is completed (so
say 44kB of data has finally been read/written) we kick the frontend. In the
networking world there are mechanism to modify the hardware were it would
kick the OS (so frontend in our case) when it has processed 8, 16, or 64 packets
(or some other value). Depending on the latency this can be bad or good. If the
backend is using a very slow disk we would probably want the frontend to be
kicked every time a response has been completed.

But if we have a very fast SSD, we might want to batch those kicks up so
that the frontend does not get kicked that often. I don't know the impact
of these 'one request = one kick' is but we could make this a bit more
adaptive - so that it starts scalling down the kicks as it has more responses.
And if there are less responses it notches up the amount of kicks.
I think this is called adaptive interrupt moderation (Or interrupt coalescing)

> 
> > But going forward, this 'kick-the-other-guest' could be further modulated.
> > If we have a full ring and the frontend keeps on adding entries we (backend)
> > should never get an interrupt. As of matter of fact the frontend should be just
> > polling all the time and process them as fast as possible.
> 
> I think the existing req_event/rsp_event ring fields should enable this
> already, assuming the front- and back-ends are using them right.
> 
> Ian.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-19 12:44     ` Roger Pau Monné
@ 2013-01-22 19:46       ` Konrad Rzeszutek Wilk
  2013-01-23  9:53         ` Ian Campbell
  2013-02-20 21:31         ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-22 19:46 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew

On Sat, Jan 19, 2013 at 01:44:53PM +0100, Roger Pau Monné wrote:
> On 18/01/13 19:20, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jan 18, 2013 at 05:00:10PM +0100, Roger Pau Monné wrote:
> >> On 18/12/12 15:31, Konrad Rzeszutek Wilk wrote:
> >>> Hey,
> >>>
> >>> I am including some folks that are not always on Xen-devel to see if they have
> >>> some extra ideas or can correct my misunderstandings.
> >>>
> >>> This is very much RFC - so there is bound to be some bugs.
> >>> The original is here
> >>> https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit
> >>> in case one wishes to modify/provide comment on that one.
> >>
> >> Hello Konrad,
> >>
> >> Thanks for the problem description and possible solutions, I'm
> >> going to take on this (provided that you are OK with it).
> > 
> > The whole thing!? You are quite ambitious here!
> 
> I will try to fix as much as possible, one thing at a time, don't expect
> a series from one day to another fixing all the issues :)

Heh.
> 
> >>
> >>>
> >>> There are outstanding issues we have now with the block protocol:
> >>> Note: I am assuming 64-bit guest/host - as the size’s of the structures
> >>> change on 32-bit. I am also attaching the header for the blkif ring
> >>> as of today.
> >>>
> >>> A) Segment size is limited to 11 pages. It means we can at most squeeze
> >>> in 44kB per request. The ring can hold 32 (next power of two below 36)
> >>> requests, meaning we can do 1.4M of outstanding requests.
> >>
> >> I've been looking at the different options here and I think the most
> >> suitable one would be using a mechanism similar to VirtIO "indirect
> >> descriptors", it is the one that allows for more in-flight requests
> >> while being quite simple to implement.
> > 
> > <nods> I think you are right - plus it looks to be similar to how the more
> > modern hardware does requests (here I am thinking of NVMe and its command
> > descriptor - where it can point to a page full of DMA addresses + LBAs and say
> > "Go READ").
> > 
> >>
> >>> B). Producer and consumer index is on the same cache line. In present hardware
> >>> that means the reader and writer will compete for the same cacheline causing a
> >>> ping-pong between sockets.
> >>>
> >>> C). The requests and responses are on the same ring. This again causes the
> >>> ping-pong between sockets as the ownership of the cache line will shift between
> >>> sockets.
> >>>
> >>> D). Cache alignment. Currently the protocol is 16-bit aligned. This is awkward
> >>> as the request and responses sometimes fit within a cacheline or sometimes
> >>> straddle them.
> >>>
> >>> E). Interrupt mitigation. We are currently doing a kick whenever we are
> >>> done “processing” the ring. There are better ways to do this - and
> >>> we could use existing network interrupt mitigation techniques to make the
> >>> code poll when there is a lot of data.
> >>>
> >>> F). Latency. The processing of the request limits us to only do 44kB - which means
> >>> that a 1MB chunk of data - which on contemporary devices would only use I/O request
> >>> - would be split up in multiple ‘requests’ inadvertently delaying the processing of
> >>> said block.
> >>>
> >>> G) Future extensions. DIF/DIX for integrity. There might
> >>> be other in the future and it would be good to leave space for extra
> >>> flags TBD.
> >>>
> >>> H). Separate the response and request rings. The current
> >>> implementation has one thread for one block ring. There is no reason why
> >>> there could not be two threads - one for responses and one for requests -
> >>> and especially if they are scheduled on different CPUs. Furthermore this
> >>> could also be split in multi-queues - so two queues (response and request)
> >>> on each vCPU.
> >>>
> >>> I). We waste a lot of space on the ring - as we use the
> >>> ring for both requests and responses. The response structure needs to
> >>> occupy the same amount of space as the request structure (112 bytes). If
> >>> the request structure is expanded to be able to fit more segments (say
> >>> the ‘struct blkif_sring_entry is expanded to ~1500 bytes) that still
> >>> requires us to have a matching size response structure. We do not need
> >>> to use that much space for one response. Having a separate response ring
> >>> would simplify the structures.
> >>>
> >>> J). 32 bit vs 64 bit. Right now the size
> >>> of the request structure is 112 bytes under 64-bit guest and 102 bytes
> >>> under 32-bit guest. It is confusing and furthermore requires the host
> >>> to do extra accounting and processing.
> >>>
> >>> The crude drawing displays memory that the ring occupies in offset of
> >>> 64 bytes (cache line). Of course future CPUs could have different cache
> >>> lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is
> >>> a bit different as the ‘struct blkif_sring_entry’ is of 102 bytes.
> >>>
> >>>
> >>> The A) has two solutions to this (look at
> >>> http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
> >>> details). One proposed by Justin from Spectralogic has to negotiate
> >>> the segment size. This means that the ‘struct blkif_sring_entry’
> >>> is now a variable size. It can expand from 112 bytes (cover 11 pages of
> >>> data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
> >>> extension by just making the array in the request expand from 11 to a
> >>> variable size negotiated.
> >>>
> >>>
> >>> The math is as follow.
> >>>
> >>>
> >>>         struct blkif_request_segment {
> >>>                 uint32 grant;                         // 4 bytes uint8_t
> >>>                 first_sect, last_sect;// 1, 1 = 6 bytes
> >>>         }
> >>> (6 bytes for each segment) - the above structure is in an array of size
> >>> 11 in the request. The ‘struct blkif_sring_entry’ is 112 bytes. The
> >>> change is to expand the array - in this example we would tack on 245 extra
> >>> ‘struct blkif_request_segment’ - 245*6 + 112 = 1582. If we were to
> >>> use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages).
> >>>
> >>>
> >>> The other solution (from Intel - Ronghui) was to create one extra
> >>> ring that only has the ‘struct blkif_request_segment’ in them. The
> >>> ‘struct blkif_request’ would be changed to have an index in said
> >>> ‘segment ring’. There is only one segment ring. This means that the
> >>> size of the initial ring is still the same. The requests would point
> >>> to the segment and enumerate out how many of the indexes it wants to
> >>> use. The limit is of course the size of the segment. If one assumes a
> >>> one-page segment this means we can in one request cover ~4MB. The math
> >>> is as follow:
> >>>
> >>>
> >>> First request uses the half of the segment ring - so index 0 up
> >>> to 341 (out of 682). Each entry in the segment ring is a ‘struct
> >>> blkif_request_segment’ so it occupies 6 bytes. The other requests on
> >>> the ring (so there are 35 left) can use either the remaining 341 indexes
> >>> of the sgement ring or use the old style request. The old style request
> >>> can address use up to 44kB. For example:
> >>>
> >>>
> >>>  sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832
> >>>  sring[1]->[use sthe old style request] = 11*4096 = 45056
> >>>  sring[2]->[uses 342->682 indexes in the segment ring] = 1392640
> >>>  sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624
> >>>
> >>>
> >>> Total: 4145152 bytes Naturally this could be extended to have a bigger
> >>> segment ring to cover more.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> The problem with this extension is that we use 6 bytes and end up
> >>> straddling a cache line. Using 8 bytes will fix the cache straddling. This
> >>> would mean fitting only 512 segments per page.
> >>>
> >>>
> >>> There is yet another mechanism that could be employed  - and it borrows
> >>> from VirtIO protocol. And that is the ‘indirect descriptors’. This
> >>> very similar to what Intel suggests, but with a twist.
> >>>
> >>>
> >>> We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the ‘struct
> >>> blkif_sring’ (each entry can be up to 112 bytes if needed - so the
> >>> old style request would fit). It would look like:
> >>>
> >>>
> >>> /* so 64 bytes under 64-bit. If necessary, the array (seg) can be
> >>> expanded to fit 11 segments as the old style request did */ struct
> >>> blkif_request_indirect {
> >>>         uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
> >>> // 1 blkif_vdev_t   handle;       /* only for read/write requests         */ // 2
> >>> #ifdef CONFIG_X86_64
> >>>         uint32_t       _pad1;             /* offsetof(blkif_request,u.rw.id) == 8 */ // 2
> >>> #endif
> >>>         uint64_t       id;           /* private guest value, echoed in resp  */
> >>>         grant_ref_t    gref;        /* reference to indirect buffer frame  if used*/
> >>>             struct blkif_request_segment_aligned seg[4]; // each is 8 bytes
> >>>
> >>> } __attribute__((__packed__));
> >>>
> >>>
> >>> struct blkif_request {
> >>>         uint8_t        operation;    /* BLKIF_OP_???  */
> >>>         union {
> >>>                 struct blkif_request_rw rw;
> >>>                 struct blkif_request_indirect
> >>>                 indirect; … other ..
> >>>         } u;
> >>> } __attribute__((__packed__));
> >>>
> >>
> >> Do you think it will be suitable to change the size of blkif_request, I
> >> was thinking that if we introduce indirect request we could reduce the
> >> maximum number of segments per request to 2, so blkif_request is
> >> reduced to 32bytes and we can fit 126 requests (64 due to rounding)
> >> in a single ring.
> > 
> > Altering it to be 32-bytes means we are still crossing two cachelines.
> 
> Are you sure? using 32bytes on a 64bytes cache line means we have two
> requests in the same cache line, but it's impossible that a request uses
> two different cache lines (which I guess it's the main problem):

Correct. In your case we have one cacheline shared by two requests.

This means we will have to be extra careful in the backend (And frontend)
to call 'wmb()' after we have filled two entries. Otherwise we up doing
something like this:

	fill out the req[0]
	a) wmb()   => writes the full 64-bytes cacheline out.
	fill out the req[1]
	b) wmb()	<== has to throw out the cacheline and re-write the new
		    one.

With a 64-bytes one we do not have to worry about that.

> 
>                 64bytes
>     <------------------------------>
>     +------------------------------+
>     | ring header + pad            |
>     +------------------------------+
>        32bytes
>     <--------------->
>     +---------------+--------------+
>     | req[0]        | req[1]       |
>     +---------------|--------------+
>     | req[2]        | req[3]       |
>     +---------------+--------------+
> 
> > 
> > This is where I thought of putting 'struct blkif_request_rw' on a diet
> > and just making it have 4 of the grants - in case the request is
> > just 16K. And then the size of the structure (if I got my math right)
> > would be 64-bytes.
> 
> Using 64bytes still leaves us with only 32 requests on the ring (unless
> we change the ring protocol to support a ring size different than a
> power of two).

More I believe. The first 64 bytes are for the ring header + pad. Each
request/response is 64 bytes, so 4032 / 64 = 63.

So 63 requests. We should also support a bigger ring size if the
backend has say ultra-fast SSD so that can fill an incredible amount
of requests and get responses almost as fast as we fill the requests.

The neat thing about the split request/response is that the response
ring can be of smaller size. But then we run in the cache issue
I mentioned earlier.

> 
> > Naturally this means we need to negotiate a 'feature-request-size'
> > where v1 says that the request is of 64-bytes length. v2 might
> > be 128bytes or whatever else we would need to fit within a cacheline
> > in future CPUs. We could pad the 'struct request_v1' with 64-bytes
> > in case the cacheline is 128bytes and our request is 64-bytes.
> 
> I would have to benchmark if having two requests on the same cache line
> or only a single one has any kind of impact on performance, as I
> understand it the only problem of this approach might happen when
> blkback is writing a reply and blkfront is also writing a request in the
> same cache line (which will be solved by using two different rings, one
> for replies and one for request).

Right. The issue with sharing two requests on the same cache-line is
a different beast.

> 
> So I think a compromise solution is to use 32bytes requests, that would
> always be cache aligned and to solve the ping-pong problem between
> blkfront/blkback writing on the same cache line two rings should be used.

For that problem of ping-pong that is certainly the case.

> 
> > 
> >>
> >> Everything that needs more than two segments will switch to using the
> >> new indirect operation. Also, to make use of the extra space in
> >> blkif_request_indirect I would remove the extra segments and instead
> >> pass more grant references to frames containing
> >> blkif_request_indirect_entry entries.
> > 
> > Right. The complications arise when we need to support this new format
> > and the old format and have to switch over when we migrate to old
> > backends.
> > 
> >>
> >> #define BLKIF_MAX_SEGMENTS_PER_REQUEST 2
> >> #define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 4
> >>
> >> struct blkif_request_segment {
> >>     grant_ref_t gref;        /* reference to I/O buffer frame        */
> >>     /* @first_sect: first sector in frame to transfer (inclusive).   */
> >>     /* @last_sect: last sector in frame to transfer (inclusive).     */
> >>     uint8_t     first_sect, last_sect;
> >> } __attribute__((__packed__));
> >>
> >> struct blkif_request_indirect_entry {
> >>     blkif_sector_t  sector_number;
> >>     struct          blkif_request_segment seg;
> >>     uint8_t         _pad[2];
> >> } __attribute__((__packed__));
> >>
> >> struct blkif_request_rw {
> >>     uint8_t        nr_segments;  /* number of segments                   */
> >>     blkif_vdev_t   handle;       /* only for read/write requests         */
> >>     uint64_t       id;           /* private guest value, echoed in resp  */
> >>     blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
> >>     struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> } __attribute__((__packed__));
> >>
> >> struct blkif_request_indirect {
> >>     uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
> >>     uint16_t       nr_indirect;  /* number of indirect entries in gref frames */
> >>     blkif_vdev_t   handle;       /* only for read/write requests         */
> >>     uint64_t       id;           /* private guest value, echoed in resp  */
> >>     /* reference to indirect buffer frame */
> >>     grant_ref_t    gref[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
> >> } __attribute__((__packed__));
> >>
> >> struct blkif_request_discard {
> >>     uint8_t        flag;         /* BLKIF_DISCARD_SECURE or zero.        */
> >> #define BLKIF_DISCARD_SECURE (1<<0)  /* ignored if discard-secure=0          */
> >>     uint64_t       id;           /* private guest value, echoed in resp  */
> >>     blkif_sector_t sector_number;
> >>     uint64_t       nr_sectors;
> >> } __attribute__((__packed__));
> >>
> >> struct blkif_request {
> >>     uint8_t        operation;    /* BLKIF_OP_???                         */
> >>     union {
> >>         struct blkif_request_rw rw;
> >>         struct blkif_request_indirect indirect;
> >>         struct blkif_request_discard discard;
> >>     } u;
> >> } __attribute__((__packed__));
> >>
> >> I've removed all the extra paddings, because this protocol is no longer
> >> compatible with the previous one (we modify
> >> BLKIF_MAX_SEGMENTS_PER_REQUEST), and packed the structures. I'm
> >> wondering if using __attribute__ in a public header is allowed, since
> >> this is a GNU extension (AFAIK supported by gcc and clang at least).
> > 
> > I don't know. But I think we can bypass the need for it if we calculate
> > the size of each entry right and insert padding whenever neccessary.
> 
> Yes, I would like to avoid having different sizes depending on the
> architecture.

Good.
> 
> > 
> >>
> >> The main motivation behind the modification of the size of the
> >> request is because I've realized that in your proposal, you where
> >> adding 4 extra segments to the indirect request, and I was wondering
> >> if I could pack more grant frames containing indirect entries in a
> >> single request. By using the old size of blkf_request we will be able to
> >> send up to 9 grant frames filled with indirect entries, which gives us
> >> 4096 * 256 * 9 = 9MB of outstanding IO in each request. I think that
> >> this is too high, and will introduce a lot of latency to the protocol.
> >> That's the main reason I've decided to reduce the size of blkif_request
> >> in order to be able to fit more of them in a ring, instead of making
> >> each of the requests bigger.
> > 
> > OK. That is OK as long as we make sure to _not_ have two requests
> > straddling the cache-line. Or maybe that is not a problem.
> > 
> > But the other nice benefit of your idea - where it is only 32-bytes
> > - is that we have now extra 32-bytes where we can insert such
> > things as DIF/DIX or "other" new features. And that way have the
> > request be of size 64 bytes.
> 
> So you prefer to keep the ring size to 32, instead of increasing it to
> 64, and use the remaining 32bytes in the request to store request
> related data?

Correct - and increase the ring size to two pages or more to compensate
for that.
> 
> > 
> >>
> >> Using this implementation each request has a size of 32bytes (to
> >> prevent cache straddle), so we could fit 126 requests in a ring, but
> >> due to the rounding, we end up with 64. Also each request can contain
> >> 4 grant pages, each one filled with 256 blkif_request_indirect_entry.
> >> The math for the maximum data in a request is: 4096 * 256 * 4 = 4MB,
> >> and in the whole ring 4 * 64 = 256MB worth of outstanding IO.
> >>
> > 
> > Which has a nice power of two ring to it. (Ah the puns!)
> > 
> > I like the idea of putting the request on a diet - but too much
> > could cause us to miss the opportunity to insert other flags on it.
> > If I recall correctly, the DIF/DIX only need 8 bytes of data.
> > If we make the assumption that:
> >         I/O request = one ring entry
> 
> So we only need to reserve 8bytes for each DIF/IDX, even if the request
> contains a variable number of data? (I mean, block requests can at a
> minimum contain 4096bytes, or much more)

I need to double check with Martin (CC-ed here). But my recollection
is that it is just attached the the 'bio'. So if the BIO is 4K or 1MB -
it would only have one DIF/DIX data type.

Hmm, but then we operate on the 'struct request' so that might not
be the case..
> 
> > and the  "one ring entry" can use the the '4' grants if we just have a
> > 16KB I/O request, but if it is more than that - we use the indirect page
> 
> Well, on my purpose I've limited the number of segments of a "rw"
> requests to 2, so it's only 8K, anything bigger has to use indirect
> descriptors, which can fit 4M of data (because I'm passing 4 grant
> frames full of "blkif_request_indirect_entry" entries).

<nods>
> 
> > and can stuff 1MB of data in there.
> > The extra 32-bytes of space for such things as 'DIF/DIX'. This also
> > means we could unify the 'struct request' with the 'discard' operation
> > and it could utilize the 32-bytes of extra unused payload data.
> > 
> >>>
> >>>
> >>> The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
> >>> etc operation would now be in indirect.op. The indirect.gref points to
> >>> a page that is filled with:
> >>>
> >>>
> >>> struct blkif_request_indirect_entry {
> >>>         blkif_sector_t sector_number;
> >>>         struct blkif_request_segment seg;
> >>> } __attribute__((__packed__));
> >>> //16 bytes, so we can fit in a page 256 of these structures.
> >>>
> >>>
> >>> This means that with the existing 36 slots in the ring (single page)
> >>> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> >>> ~= 32M. If we don’t want to use indirect descriptor we can still use
> >>> up to 4 pages of the request (as it has enough space to contain four
> >>> segments and the structure will still be cache-aligned).
> >>>
> >>>
> >>>
> >>>
> >>> B). Both the producer (req_* and rsp_*) values are in the same
> >>> cache-line. This means that we end up with the same cacheline being
> >>> modified by two different guests. Depending on the architecture and
> >>> placement of the guest this could be bad - as each logical CPU would
> >>> try to write and read from the same cache-line. A mechanism where
> >>> the req_* and rsp_ values are separated and on a different cache line
> >>> could be used. The value of the correct cache-line and alignment could
> >>> be negotiated via XenBus - in case future technologies start using 128
> >>> bytes for cache or such. Or the the producer and consumer indexes are in
> >>> separate rings. Meaning we have a ‘request ring’ and a ‘response
> >>> ring’ - and only the ‘req_prod’, ‘req_event’ are modified in
> >>> the ‘request ring’. The opposite (resp_*) are only modified in the
> >>> ‘response ring’.
> >>
> >> Using different rings for requests and responses sounds like a good idea,
> >> this will reduce the number of variables on the ring, leaving only
> >> "produced" and "event" (we no longer need rsp_ and req_).
> > 
> > <nods>
> >>
> >> Since we will no longer have two different domains writing to the same
> >> cache line I guess this will get rid of the cache ping-pong problem.
> > 
> > <nods> That is my thinking.
> >>
> >> Also while there we could try to use all the page, instead of rounding
> >> down to the closer power of two, this will allow us to go from 64
> >> requests to 126 (if using the indirect descriptors proposal explained
> >> above).
> > 
> > Right.
> > Lets call this 'feature-seperate-req-and-rsp' ?
> 
> Yes, and blkfront will also need to pass two different grant frames at
> init, because we need to map two rings.

<nods>
> 
> > 
> >>
> >>> C). Similar to B) problem but with a bigger payload. Each
> >>> ‘blkif_sring_entry’ occupies 112 bytes which does not lend itself
> >>> to a nice cache line size. If the indirect descriptors are to be used
> >>> for everything we could ‘slim-down’ the blkif_request/response to
> >>> be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST
> >>> to 5 as that would slim the largest of the structures to 64-bytes.
> >>> Naturally this means negotiating a new size of the structure via XenBus.
> > 
> > Here I am using 5, and I think you were using 4 - and you mentioned you
> > got it down to 32-bytes. I should double-check the 'sizeof' then to
> > see what it would be like.
> 
> No, I was limiting it to 2 segments in the "rw" request, the 4 comes
> because I'm passing 4 grant frames in the "blkif_request_indirect",
> instead of just one (see above, "gref" is now an array in my proposal).

<nods>
> 
> > 
> >>>
> >>>
> >>> D). The first picture shows the problem. We now aligning everything
> >>> on the wrong cachelines. Worst in ⅓ of the cases we straddle
> >>> three cache-lines. We could negotiate a proper alignment for each
> >>> request/response structure.
> >>>
> >>>
> >>> E). The network stack has showed that going in a polling mode does improve
> >>> performance. The current mechanism of kicking the guest and or block
> >>> backend is not always clear.  [TODO: Konrad to explain it in details]
> > 
> > Oh, I never did explain this - but I think the patches that Daniel came
> > up with actually fix a part of it. They make the kick-the-other guest
> > only happen when the backend has processed all of the requests and
> > cannot find anything else to do. Previously it was more of 'done one
> > request, lets kick the backend.'.
> 
> Yes, that's what I'm seeing in the current code.
> 
> > But going forward, this 'kick-the-other-guest' could be further modulated.
> > If we have a full ring and the frontend keeps on adding entries we (backend)
> > should never get an interrupt. As of matter of fact the frontend should be just
> > polling all the time and process them as fast as possible.
> 
> Yes, we could switch to polling in both ends and only "kick" when the
> ring is full (which should not happen usually), this will probably
> reduce CPU utilization, but I doubt it's going to increase the throughput.

It might on the frontend as it won't be interrupted so often. But then
with indirect requests that is going to lower the amount of kicks per I/O
we will do too. Sounds like this is something that can be done in the future
at a lower priority.

> 
> > Or we could add tweaks in the backend to adopt an 'adaptive interrupt
> > coelescing' mechanism. This way we only kick the frontend if a certain
> > threshold of I/Os have been processed or if we are slowing down on the
> > threshold. Again, the idea is that - if we have the rings full we should
> > not send 'kicks' at all and just keep on processing.
> > 
> > Note: I *believe* that the backend does this - keeps on processing and
> > won't kick if the ring is full. But I am not 100% sure on the frontend side.
> 
> Changing this is not really difficult (in my opinion, but I've might be
> wrong), so I think we could try several approaches and see which one
> gives better performance (or reduces CPU utilization).
> 
> Anyway, I would focus now on the indirect descriptors and separate the
> rings, and leave this work for afterwards (because what we could
> probably see now might not be true once the other improvements are done).

<nods>
> 
> >>>
> >>>
> >>> F). The current block protocol for big I/Os that the backend devices can
> >>> handle ends up doing extra work by splitting the I/O in smaller chunks,
> >>> and then reassembling them. With the solutions outlined in A) this can
> >>> be fixed. This is easily seen with 1MB I/Os. Since each request can
> >>> only handle 44kB that means we have to split a 1MB I/O in 24 requests
> >>> (23 * 4096 * 11 = 1081344). Then the backend ends up sending them in
> >>> sector-sizes- which with contemporary devices (such as SSD) ends up with
> >>> more processing. The SSDs are comfortable handling 128kB or higher I/Os
> >>> in one go.
> >>>
> >>>
> >>> G). DIF/DIX. This a protocol to carry extra ‘checksum’ information
> >>> for each I/O. The I/O can be a sector size, page-size or an I/O size
> >>> (most popular are 1MB). The DIF/DIX needs 8 bytes of information for
> >>> each I/O. It would be worth considering putting/reserving that amount of
> >>> space in each request/response. Also putting in extra flags for future
> >>> extensions would be worth it - however the author is not aware of any
> >>> right now.
> >>>
> >>>
> >>> H). Separate response/request. Potentially even multi-queue per-VCPU
> >>> queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was
> >>> flawed. There is no similar concept in the storage world were the
> >>> operating system can put a food down and say: “everything before this
> >>> has to be on the disk.” There are ligther versions of this - called
> >>> ‘FUA’ and ‘FLUSH’. Depending on the internal implementation
> >>> of the storage they are either ignored or do the right thing. The
> >>> filesystems determine the viability of these flags and change writing
> >>> tactics depending on this. From a protocol level, this means that the
> >>> WRITE/READ/SYNC requests can be intermixed - the storage by itself
> >>> determines the order of the operation. The filesystem is the one that
> >>> determines whether the WRITE should be with a FLUSH to preserve some form
> >>> of atomicity. This means we do not have to preserve an order of operations
> >>> - so we can have multiple queues for request and responses. This has
> >>> show in the network world to improve performance considerably.
> >>>
> >>>
> >>> I). Wastage of response/request on the same ring. Currently each response
> >>> MUST occupy the same amount of space that the request occupies - as the
> >>> ring can have both responses and requests. Separating the request and
> >>> response ring would remove the wastage.
> >>>
> >>>
> >>> J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring
> >>> entries is different if the guest is in 32-bit or 64-bit mode. Cleaning
> >>> this up to be the same size would save considerable accounting that the
> >>> host has to do (extra memcpy for each response/request).
> >>
> >> So forth I think most of the problems could be solved by using indirect
> >> descriptors, aligning requests to a cache sensible value (to prevent
> >> straddling) and using separate rings for requests and responses. This
> >> would be my plan for solving this issues:
> > 
> > That would take care of the bulk of the big issues.
> > 
> >>
> >> 1. Implement LRU mechanism in blkback for the persistent grants list. If
> >> we are going to increment the number of in-flight IO we need to be able to
> >> keep the list of persistently mapped grants to something sensible, and
> >> we are no longer able to persistently map all the grants that could be
> >> used. In my proposal that would be 64 * 4 for each grant frame that
> > 
> > Can you explain the foundations behind the '64' and '4'? I think I know
> > the 4 (BLKIF_..SEGMENTS) but I am not sure about '64'.
> 
> The 4 comes because in my proposal I'm passing 4 indirect grant frames
> filled with blkif_request_segment (so gref in blkif_request_indirect is
> now an array).
> 
> The 64 comes because I've reduced the size of blkif_request to 32bytes,
> so now we can fit 64 blkif_requests in a ring.

You could fit more of them with a 32-bytes request. You have 64 bytes for
the header and then 126 32-byte chunks of space free.

> 
> Putting it all together:
> 
> 4 (indirect grant frames) * 256 (entries in each indirect grant frame) *
> 4096 (PAGE_SIZE) = 4M IO in a single request
> 
> 64 (32bytes entries in a ring) * 4M (max IO in each entry) = 256M worth
> of IO in a 4096bytes ring.

s/64/126/ I believe?

> 
> >> contains indirect entries, plus 64 * 4 * 256, for all the possible
> >> entries in the indirect frame 64 * 4 + 64 * 4 * 256 = 65792.
> > 
> > That would cover ~256MB of data per guest. Wow.
> >>
> >> 2. Implement indirect descriptors.
> >>
> >> 3. Implement new ring protocol, using two separate rings for requests
> >> and replies.
> > 
> > Don't want to negotiate each of these proposal as a seperate feature? So:
> >         seperate-req-and-rsp
> >         slim-version-of-req
> >         indiret-descriptor
> >         multi-page-ring
> >         aligment-64-or-other-value
> 
> Yes, but I will try to couple them in a way that makes sense, for
> example reducing the size of blkif_request without using indirect
> descriptors doesn't make much sense, so I would couple
> indirect-descriptor with slim-version-of-req, this will leave us with
> the following new extensions:
> 
> indirect-descriptors (which includes a slim version of blkif_request)
> separate-req-and-rsp
> 
> This will be the two first extensions we should try to implement from my
> POV, because I think they are going to deliver quite a performance boost.

<nods>
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-22 19:25       ` Konrad Rzeszutek Wilk
@ 2013-01-23  9:24         ` Ian Campbell
  2013-01-23 15:03           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2013-01-23  9:24 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Tue, 2013-01-22 at 19:25 +0000, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 21, 2013 at 12:37:18PM +0000, Ian Campbell wrote:
> > On Fri, 2013-01-18 at 18:20 +0000, Konrad Rzeszutek Wilk wrote:
> > > 
> > > > > E). The network stack has showed that going in a polling mode does improve
> > > > > performance. The current mechanism of kicking the guest and or block
> > > > > backend is not always clear.  [TODO: Konrad to explain it in details]
> > > 
> > > Oh, I never did explain this - but I think the patches that Daniel came
> > > up with actually fix a part of it. They make the kick-the-other guest
> > > only happen when the backend has processed all of the requests and
> > > cannot find anything else to do. Previously it was more of 'done one
> > > request, lets kick the backend.'.
> > 
> > blkback uses RING_PUSH_RESPONSES_AND_CHECK_NOTIFY so doesn't it get some
> > amount of evthcn mitigation for free?
> 
> So there are two paths here - the kick from a) frontend and the kick b) backend
> gives the frontend.
> 
> The a) case is fairly straighforward. We process all of the rings we and everytime
> we have finished with a request we re-read the producer. So if the frontend keeps
> us bussy we will keep on processing.
> 
> The b) case is the one that is trigger happy. Every time a request is completed (so
> say 44kB of data has finally been read/written) we kick the frontend.
>  In the networking world there are mechanism to modify the hardware were it would
> kick the OS (so frontend in our case) when it has processed 8, 16, or 64 packets
> (or some other value). Depending on the latency this can be bad or good. If the
> backend is using a very slow disk we would probably want the frontend to be
> kicked every time a response has been completed.

Perhaps all that is needed is to have the f.e. set rsp_event to
min(rsp_cons + <BATCH_SIZE>, rsp_prod (+/- 1?) ) in blkfront's
RING_FINAL_CHECK_FOR_RESPONSES to implement batching, like the comment
in ring.h says:
 *  These macros will set the req_event/rsp_event field to trigger a
 *  notification on the very next message that is enqueued. If you want to
 *  create batches of work (i.e., only receive a notification after several
 *  messages have been enqueued) then you will need to create a customised
 *  version of the FINAL_CHECK macro in your own code, which sets the event
 *  field appropriately.

IOW I think we already have the mechanisms in the protocol to implement
this sort of thing.

> 
> But if we have a very fast SSD, we might want to batch those kicks up so
> that the frontend does not get kicked that often. I don't know the impact
> of these 'one request = one kick' is but we could make this a bit more
> adaptive - so that it starts scalling down the kicks as it has more responses.
> And if there are less responses it notches up the amount of kicks.
> I think this is called adaptive interrupt moderation (Or interrupt coalescing)
> 
> > 
> > > But going forward, this 'kick-the-other-guest' could be further modulated.
> > > If we have a full ring and the frontend keeps on adding entries we (backend)
> > > should never get an interrupt. As of matter of fact the frontend should be just
> > > polling all the time and process them as fast as possible.
> > 
> > I think the existing req_event/rsp_event ring fields should enable this
> > already, assuming the front- and back-ends are using them right.
> > 
> > Ian.
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-22 19:46       ` Konrad Rzeszutek Wilk
@ 2013-01-23  9:53         ` Ian Campbell
  2013-01-23 15:21           ` Konrad Rzeszutek Wilk
  2013-02-20 21:31         ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2013-01-23  9:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Tue, 2013-01-22 at 19:46 +0000, Konrad Rzeszutek Wilk wrote:
> Correct. In your case we have one cacheline shared by two requests.
> 
> This means we will have to be extra careful in the backend (And frontend)
> to call 'wmb()' after we have filled two entries. Otherwise we up doing
> something like this:
> 
>         fill out the req[0]
>         a) wmb()   => writes the full 64-bytes cacheline out.
>         fill out the req[1]
>         b) wmb()        <== has to throw out the cacheline and re-write the new
>                     one.
> 
> With a 64-bytes one we do not have to worry about that.

... unless your cachelines are 128- or 256-bytes...

Short answer: the existing ring.h macros take care of this for you.

Long answer: The RING_PUSH_{REQUESTS,RESPONSES} handle this by only
issuing the wmb() over the entire current batch of things, not each time
you queue something. i.e. you can queue requests, modifying req_prod_pvt
as you go and then at the end of a suitable batch you call
RING_PUSH_REQUESTS which updates req_prod and does the barriers so you
end up with
	fill out the req[0]
	rsp_prod_pvt = 0
	fill out the req[1]
	rsp_prod_pvt = 1
	fill out the req[2]
	rsp_prod_pvt = 2
	Batch Done => RING_PUSH_REQUESTS
	wmb()
	rsp_prod = rsp_prod_pvt

The last req might cross a cache line but if you are at the end of a
batch how much does that matter? I suppose you could push a nop request
or something to align if it was an issue.

Maybe you don't get this behaviour if you are batching effectively
though? Solve the batching problem and you solve the cacheline thing
basically for free though.

> > > Naturally this means we need to negotiate a 'feature-request-size'
> > > where v1 says that the request is of 64-bytes length.

Have you consider variable length requests? This would let the request
size scale with the number of segments required for that request, and
allow you to cache align the ends of the requests without wasting the
extra space that including the worst case number of segments would
imply. e.g. a small write would take 32-bytes (padded to 64 if you must)
and a larger one would take 196 (padded to 256). You should end up with
more efficient use of the space in the ring this way.

This also allows for other things like inlining requests (maybe more
interesting for net than blk) or including DIX requests without
incurring the overhead of however many bytes that is on every request.

I'd really like it if the result of this conversation could be a new
generic ring structure that was applicable to at least net and blk
rather than a new blk specific protocol.

Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-23  9:24         ` Ian Campbell
@ 2013-01-23 15:03           ` Konrad Rzeszutek Wilk
  2013-01-23 15:39             ` Ian Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-23 15:03 UTC (permalink / raw)
  To: Ian Campbell
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Wed, Jan 23, 2013 at 09:24:37AM +0000, Ian Campbell wrote:
> On Tue, 2013-01-22 at 19:25 +0000, Konrad Rzeszutek Wilk wrote:
> > On Mon, Jan 21, 2013 at 12:37:18PM +0000, Ian Campbell wrote:
> > > On Fri, 2013-01-18 at 18:20 +0000, Konrad Rzeszutek Wilk wrote:
> > > > 
> > > > > > E). The network stack has showed that going in a polling mode does improve
> > > > > > performance. The current mechanism of kicking the guest and or block
> > > > > > backend is not always clear.  [TODO: Konrad to explain it in details]
> > > > 
> > > > Oh, I never did explain this - but I think the patches that Daniel came
> > > > up with actually fix a part of it. They make the kick-the-other guest
> > > > only happen when the backend has processed all of the requests and
> > > > cannot find anything else to do. Previously it was more of 'done one
> > > > request, lets kick the backend.'.
> > > 
> > > blkback uses RING_PUSH_RESPONSES_AND_CHECK_NOTIFY so doesn't it get some
> > > amount of evthcn mitigation for free?
> > 
> > So there are two paths here - the kick from a) frontend and the kick b) backend
> > gives the frontend.
> > 
> > The a) case is fairly straighforward. We process all of the rings we and everytime
> > we have finished with a request we re-read the producer. So if the frontend keeps
> > us bussy we will keep on processing.
> > 
> > The b) case is the one that is trigger happy. Every time a request is completed (so
> > say 44kB of data has finally been read/written) we kick the frontend.
> >  In the networking world there are mechanism to modify the hardware were it would
> > kick the OS (so frontend in our case) when it has processed 8, 16, or 64 packets
> > (or some other value). Depending on the latency this can be bad or good. If the
> > backend is using a very slow disk we would probably want the frontend to be
> > kicked every time a response has been completed.
> 
> Perhaps all that is needed is to have the f.e. set rsp_event to
> min(rsp_cons + <BATCH_SIZE>, rsp_prod (+/- 1?) ) in blkfront's
> RING_FINAL_CHECK_FOR_RESPONSES to implement batching, like the comment
> in ring.h says:
>  *  These macros will set the req_event/rsp_event field to trigger a
>  *  notification on the very next message that is enqueued. If you want to
>  *  create batches of work (i.e., only receive a notification after several
>  *  messages have been enqueued) then you will need to create a customised
>  *  version of the FINAL_CHECK macro in your own code, which sets the event
>  *  field appropriately.
> 
> IOW I think we already have the mechanisms in the protocol to implement
> this sort of thing.

Yes. It is question of how does the frontend and backend negotiate this.
As in should there be an negotiation of this value, say 'feature-intr-batch'.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-23  9:53         ` Ian Campbell
@ 2013-01-23 15:21           ` Konrad Rzeszutek Wilk
  2013-01-23 15:41             ` Ian Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-23 15:21 UTC (permalink / raw)
  To: Ian Campbell
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Wed, Jan 23, 2013 at 09:53:14AM +0000, Ian Campbell wrote:
> On Tue, 2013-01-22 at 19:46 +0000, Konrad Rzeszutek Wilk wrote:
> > Correct. In your case we have one cacheline shared by two requests.
> > 
> > This means we will have to be extra careful in the backend (And frontend)
> > to call 'wmb()' after we have filled two entries. Otherwise we up doing
> > something like this:
> > 
> >         fill out the req[0]
> >         a) wmb()   => writes the full 64-bytes cacheline out.
> >         fill out the req[1]
> >         b) wmb()        <== has to throw out the cacheline and re-write the new
> >                     one.
> > 
> > With a 64-bytes one we do not have to worry about that.
> 
> ... unless your cachelines are 128- or 256-bytes...
> 
> Short answer: the existing ring.h macros take care of this for you.

Nice.
> 
> Long answer: The RING_PUSH_{REQUESTS,RESPONSES} handle this by only
> issuing the wmb() over the entire current batch of things, not each time
> you queue something. i.e. you can queue requests, modifying req_prod_pvt
> as you go and then at the end of a suitable batch you call
> RING_PUSH_REQUESTS which updates req_prod and does the barriers so you
> end up with
> 	fill out the req[0]
> 	rsp_prod_pvt = 0
> 	fill out the req[1]
> 	rsp_prod_pvt = 1
> 	fill out the req[2]
> 	rsp_prod_pvt = 2
> 	Batch Done => RING_PUSH_REQUESTS
> 	wmb()
> 	rsp_prod = rsp_prod_pvt
> 
> The last req might cross a cache line but if you are at the end of a
> batch how much does that matter? I suppose you could push a nop request
> or something to align if it was an issue.

Oh, NOP. That is an idea.
> 
> Maybe you don't get this behaviour if you are batching effectively
> though? Solve the batching problem and you solve the cacheline thing
> basically for free though.

Right. And looking at the code it actually does this.
> 
> > > > Naturally this means we need to negotiate a 'feature-request-size'
> > > > where v1 says that the request is of 64-bytes length.
> 
> Have you consider variable length requests? This would let the request
> size scale with the number of segments required for that request, and
> allow you to cache align the ends of the requests without wasting the
> extra space that including the worst case number of segments would
> imply. e.g. a small write would take 32-bytes (padded to 64 if you must)
> and a larger one would take 196 (padded to 256). You should end up with
> more efficient use of the space in the ring this way.

Yes. If I recall right, the D) had it as "we could negotiate a proper alignment for
each request/response structure.". The idea is pretty much the same as yours
- pad to the cacheline if the cacheline is bigger than the request or response.

> 
> This also allows for other things like inlining requests (maybe more
> interesting for net than blk) or including DIX requests without
> incurring the overhead of however many bytes that is on every request.
> 
> I'd really like it if the result of this conversation could be a new
> generic ring structure that was applicable to at least net and blk
> rather than a new blk specific protocol.
> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-23 15:03           ` Konrad Rzeszutek Wilk
@ 2013-01-23 15:39             ` Ian Campbell
  2013-01-23 16:57               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2013-01-23 15:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Wed, 2013-01-23 at 15:03 +0000, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 23, 2013 at 09:24:37AM +0000, Ian Campbell wrote:
> > On Tue, 2013-01-22 at 19:25 +0000, Konrad Rzeszutek Wilk wrote:
> > > On Mon, Jan 21, 2013 at 12:37:18PM +0000, Ian Campbell wrote:
> > > > On Fri, 2013-01-18 at 18:20 +0000, Konrad Rzeszutek Wilk wrote:
> > > > > 
> > > > > > > E). The network stack has showed that going in a polling mode does improve
> > > > > > > performance. The current mechanism of kicking the guest and or block
> > > > > > > backend is not always clear.  [TODO: Konrad to explain it in details]
> > > > > 
> > > > > Oh, I never did explain this - but I think the patches that Daniel came
> > > > > up with actually fix a part of it. They make the kick-the-other guest
> > > > > only happen when the backend has processed all of the requests and
> > > > > cannot find anything else to do. Previously it was more of 'done one
> > > > > request, lets kick the backend.'.
> > > > 
> > > > blkback uses RING_PUSH_RESPONSES_AND_CHECK_NOTIFY so doesn't it get some
> > > > amount of evthcn mitigation for free?
> > > 
> > > So there are two paths here - the kick from a) frontend and the kick b) backend
> > > gives the frontend.
> > > 
> > > The a) case is fairly straighforward. We process all of the rings we and everytime
> > > we have finished with a request we re-read the producer. So if the frontend keeps
> > > us bussy we will keep on processing.
> > > 
> > > The b) case is the one that is trigger happy. Every time a request is completed (so
> > > say 44kB of data has finally been read/written) we kick the frontend.
> > >  In the networking world there are mechanism to modify the hardware were it would
> > > kick the OS (so frontend in our case) when it has processed 8, 16, or 64 packets
> > > (or some other value). Depending on the latency this can be bad or good. If the
> > > backend is using a very slow disk we would probably want the frontend to be
> > > kicked every time a response has been completed.
> > 
> > Perhaps all that is needed is to have the f.e. set rsp_event to
> > min(rsp_cons + <BATCH_SIZE>, rsp_prod (+/- 1?) ) in blkfront's
> > RING_FINAL_CHECK_FOR_RESPONSES to implement batching, like the comment
> > in ring.h says:
> >  *  These macros will set the req_event/rsp_event field to trigger a
> >  *  notification on the very next message that is enqueued. If you want to
> >  *  create batches of work (i.e., only receive a notification after several
> >  *  messages have been enqueued) then you will need to create a customised
> >  *  version of the FINAL_CHECK macro in your own code, which sets the event
> >  *  field appropriately.
> > 
> > IOW I think we already have the mechanisms in the protocol to implement
> > this sort of thing.
> 
> Yes. It is question of how does the frontend and backend negotiate this.
> As in should there be an negotiation of this value, say 'feature-intr-batch'.

Either end can independently implement this using the existing
mechanisms, for their "direction", can't they? No need to negotiate.

Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-23 15:21           ` Konrad Rzeszutek Wilk
@ 2013-01-23 15:41             ` Ian Campbell
  2013-01-23 16:59               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2013-01-23 15:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Wed, 2013-01-23 at 15:21 +0000, Konrad Rzeszutek Wilk wrote:
> > Have you consider variable length requests? This would let the request
> > size scale with the number of segments required for that request, and
> > allow you to cache align the ends of the requests without wasting the
> > extra space that including the worst case number of segments would
> > imply. e.g. a small write would take 32-bytes (padded to 64 if you must)
> > and a larger one would take 196 (padded to 256). You should end up with
> > more efficient use of the space in the ring this way.
> 
> Yes. If I recall right, the D) had it as "we could negotiate a proper alignment for
> each request/response structure.". The idea is pretty much the same as yours
> - pad to the cacheline if the cacheline is bigger than the request or response.

It would be interesting to know whether it is cache line bouncing or
smaller ring capacity (which padding reduces) which becomes the
bottleneck.

My money is, unhelpfully, on "both, depending on workload".

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-23 15:39             ` Ian Campbell
@ 2013-01-23 16:57               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-23 16:57 UTC (permalink / raw)
  To: Ian Campbell
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Wed, Jan 23, 2013 at 03:39:59PM +0000, Ian Campbell wrote:
> On Wed, 2013-01-23 at 15:03 +0000, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jan 23, 2013 at 09:24:37AM +0000, Ian Campbell wrote:
> > > On Tue, 2013-01-22 at 19:25 +0000, Konrad Rzeszutek Wilk wrote:
> > > > On Mon, Jan 21, 2013 at 12:37:18PM +0000, Ian Campbell wrote:
> > > > > On Fri, 2013-01-18 at 18:20 +0000, Konrad Rzeszutek Wilk wrote:
> > > > > > 
> > > > > > > > E). The network stack has showed that going in a polling mode does improve
> > > > > > > > performance. The current mechanism of kicking the guest and or block
> > > > > > > > backend is not always clear.  [TODO: Konrad to explain it in details]
> > > > > > 
> > > > > > Oh, I never did explain this - but I think the patches that Daniel came
> > > > > > up with actually fix a part of it. They make the kick-the-other guest
> > > > > > only happen when the backend has processed all of the requests and
> > > > > > cannot find anything else to do. Previously it was more of 'done one
> > > > > > request, lets kick the backend.'.
> > > > > 
> > > > > blkback uses RING_PUSH_RESPONSES_AND_CHECK_NOTIFY so doesn't it get some
> > > > > amount of evthcn mitigation for free?
> > > > 
> > > > So there are two paths here - the kick from a) frontend and the kick b) backend
> > > > gives the frontend.
> > > > 
> > > > The a) case is fairly straighforward. We process all of the rings we and everytime
> > > > we have finished with a request we re-read the producer. So if the frontend keeps
> > > > us bussy we will keep on processing.
> > > > 
> > > > The b) case is the one that is trigger happy. Every time a request is completed (so
> > > > say 44kB of data has finally been read/written) we kick the frontend.
> > > >  In the networking world there are mechanism to modify the hardware were it would
> > > > kick the OS (so frontend in our case) when it has processed 8, 16, or 64 packets
> > > > (or some other value). Depending on the latency this can be bad or good. If the
> > > > backend is using a very slow disk we would probably want the frontend to be
> > > > kicked every time a response has been completed.
> > > 
> > > Perhaps all that is needed is to have the f.e. set rsp_event to
> > > min(rsp_cons + <BATCH_SIZE>, rsp_prod (+/- 1?) ) in blkfront's
> > > RING_FINAL_CHECK_FOR_RESPONSES to implement batching, like the comment
> > > in ring.h says:
> > >  *  These macros will set the req_event/rsp_event field to trigger a
> > >  *  notification on the very next message that is enqueued. If you want to
> > >  *  create batches of work (i.e., only receive a notification after several
> > >  *  messages have been enqueued) then you will need to create a customised
> > >  *  version of the FINAL_CHECK macro in your own code, which sets the event
> > >  *  field appropriately.
> > > 
> > > IOW I think we already have the mechanisms in the protocol to implement
> > > this sort of thing.
> > 
> > Yes. It is question of how does the frontend and backend negotiate this.
> > As in should there be an negotiation of this value, say 'feature-intr-batch'.
> 
> Either end can independently implement this using the existing
> mechanisms, for their "direction", can't they? No need to negotiate.

I think that is feasible. I am just worried that some backends (or frontends)
will choke if the "normal" one payload = one interrupt is not happening. Hence
this new feature so that we don't have to worry about the old drivers. But maybe
I am being to paranoid and it all would work just great.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-23 15:41             ` Ian Campbell
@ 2013-01-23 16:59               ` Konrad Rzeszutek Wilk
  2013-01-24 10:06                 ` Ian Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-23 16:59 UTC (permalink / raw)
  To: Ian Campbell
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Wed, Jan 23, 2013 at 03:41:17PM +0000, Ian Campbell wrote:
> On Wed, 2013-01-23 at 15:21 +0000, Konrad Rzeszutek Wilk wrote:
> > > Have you consider variable length requests? This would let the request
> > > size scale with the number of segments required for that request, and
> > > allow you to cache align the ends of the requests without wasting the
> > > extra space that including the worst case number of segments would
> > > imply. e.g. a small write would take 32-bytes (padded to 64 if you must)
> > > and a larger one would take 196 (padded to 256). You should end up with
> > > more efficient use of the space in the ring this way.
> > 
> > Yes. If I recall right, the D) had it as "we could negotiate a proper alignment for
> > each request/response structure.". The idea is pretty much the same as yours
> > - pad to the cacheline if the cacheline is bigger than the request or response.
> 
> It would be interesting to know whether it is cache line bouncing or
> smaller ring capacity (which padding reduces) which becomes the
> bottleneck.
> 
> My money is, unhelpfully, on "both, depending on workload".

and on the guest pCPU distribution potentially. The one workload I've been
doing is to stick guests on different sockets and also turning
hardware prefetching off (and also Turbo Mode for good measure).

This way it is the worst of the worlds - but it can provide me with the
data whether this cache issue is a problem or not.

> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-23 16:59               ` Konrad Rzeszutek Wilk
@ 2013-01-24 10:06                 ` Ian Campbell
  2013-01-24 15:11                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2013-01-24 10:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Wed, 2013-01-23 at 16:59 +0000, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 23, 2013 at 03:41:17PM +0000, Ian Campbell wrote:
> > On Wed, 2013-01-23 at 15:21 +0000, Konrad Rzeszutek Wilk wrote:
> > > > Have you consider variable length requests? This would let the request
> > > > size scale with the number of segments required for that request, and
> > > > allow you to cache align the ends of the requests without wasting the
> > > > extra space that including the worst case number of segments would
> > > > imply. e.g. a small write would take 32-bytes (padded to 64 if you must)
> > > > and a larger one would take 196 (padded to 256). You should end up with
> > > > more efficient use of the space in the ring this way.
> > > 
> > > Yes. If I recall right, the D) had it as "we could negotiate a proper alignment for
> > > each request/response structure.". The idea is pretty much the same as yours
> > > - pad to the cacheline if the cacheline is bigger than the request or response.
> > 
> > It would be interesting to know whether it is cache line bouncing or
> > smaller ring capacity (which padding reduces) which becomes the
> > bottleneck.
> > 
> > My money is, unhelpfully, on "both, depending on workload".
> 
> and on the guest pCPU distribution potentially. The one workload I've been
> doing is to stick guests on different sockets and also turning
> hardware prefetching off (and also Turbo Mode for good measure).

This last two are constructing a rather artificial scenario though
aren't they?

> This way it is the worst of the worlds - but it can provide me with the
> data whether this cache issue is a problem or not.
> 
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-24 10:06                 ` Ian Campbell
@ 2013-01-24 15:11                   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-24 15:11 UTC (permalink / raw)
  To: Ian Campbell
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew,
	Roger Pau Monne

On Thu, Jan 24, 2013 at 10:06:56AM +0000, Ian Campbell wrote:
> On Wed, 2013-01-23 at 16:59 +0000, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jan 23, 2013 at 03:41:17PM +0000, Ian Campbell wrote:
> > > On Wed, 2013-01-23 at 15:21 +0000, Konrad Rzeszutek Wilk wrote:
> > > > > Have you consider variable length requests? This would let the request
> > > > > size scale with the number of segments required for that request, and
> > > > > allow you to cache align the ends of the requests without wasting the
> > > > > extra space that including the worst case number of segments would
> > > > > imply. e.g. a small write would take 32-bytes (padded to 64 if you must)
> > > > > and a larger one would take 196 (padded to 256). You should end up with
> > > > > more efficient use of the space in the ring this way.
> > > > 
> > > > Yes. If I recall right, the D) had it as "we could negotiate a proper alignment for
> > > > each request/response structure.". The idea is pretty much the same as yours
> > > > - pad to the cacheline if the cacheline is bigger than the request or response.
> > > 
> > > It would be interesting to know whether it is cache line bouncing or
> > > smaller ring capacity (which padding reduces) which becomes the
> > > bottleneck.
> > > 
> > > My money is, unhelpfully, on "both, depending on workload".
> > 
> > and on the guest pCPU distribution potentially. The one workload I've been
> > doing is to stick guests on different sockets and also turning
> > hardware prefetching off (and also Turbo Mode for good measure).
> 
> This last two are constructing a rather artificial scenario though
> aren't they?

Yes. But they give a great insight on the cache issues.
> 
> > This way it is the worst of the worlds - but it can provide me with the
> > data whether this cache issue is a problem or not.
> > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.xen.org
> > > http://lists.xen.org/xen-devel
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC v1: Xen block protocol overhaul - problem statement (with pictures!)
  2013-01-22 19:46       ` Konrad Rzeszutek Wilk
  2013-01-23  9:53         ` Ian Campbell
@ 2013-02-20 21:31         ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-02-20 21:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: axboe, xen-devel, Felipe Franciosi, martin.petersen, matthew

> > > Which has a nice power of two ring to it. (Ah the puns!)
> > > 
> > > I like the idea of putting the request on a diet - but too much
> > > could cause us to miss the opportunity to insert other flags on it.
> > > If I recall correctly, the DIF/DIX only need 8 bytes of data.
> > > If we make the assumption that:
> > >         I/O request = one ring entry
> > 
> > So we only need to reserve 8bytes for each DIF/IDX, even if the request
> > contains a variable number of data? (I mean, block requests can at a
> > minimum contain 4096bytes, or much more)
> 
> I need to double check with Martin (CC-ed here). But my recollection
> is that it is just attached the the 'bio'. So if the BIO is 4K or 1MB -
> it would only have one DIF/DIX data type.

And that is semi-correct. If the user did a horrible job (say using
dd) the pages are chained together - and we end up with a link list
of bio's. The last bio would point to a page filled with 'sector's worth
of data has a checksum. Each checksum occupies 8 bytes. So if the
total 'bio' length is say 1MB, this last page is filled with 256 of
checksums - so 2048 bytes of data.

> 
> Hmm, but then we operate on the 'struct request' so that might not
> be the case..
> > 
> > > and the  "one ring entry" can use the the '4' grants if we just have a
> > > 16KB I/O request, but if it is more than that - we use the indirect page
> > 
> > Well, on my purpose I've limited the number of segments of a "rw"
> > requests to 2, so it's only 8K, anything bigger has to use indirect
> > descriptors, which can fit 4M of data (because I'm passing 4 grant
> > frames full of "blkif_request_indirect_entry" entries).
> 
> <nods>
> > 
> > > and can stuff 1MB of data in there.
> > > The extra 32-bytes of space for such things as 'DIF/DIX'. This also
> > > means we could unify the 'struct request' with the 'discard' operation
> > > and it could utilize the 32-bytes of extra unused payload data.
> > > 
> > >>>
> > >>>
> > >>> The ‘operation’ would be BLKIF_OP_INDIRECT. The read/write/discard,
> > >>> etc operation would now be in indirect.op. The indirect.gref points to
> > >>> a page that is filled with:
> > >>>
> > >>>
> > >>> struct blkif_request_indirect_entry {
> > >>>         blkif_sector_t sector_number;
> > >>>         struct blkif_request_segment seg;
> > >>> } __attribute__((__packed__));
> > >>> //16 bytes, so we can fit in a page 256 of these structures.
> > >>>
> > >>>
> > >>> This means that with the existing 36 slots in the ring (single page)
> > >>> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> > >>> ~= 32M. If we don’t want to use indirect descriptor we can still use
> > >>> up to 4 pages of the request (as it has enough space to contain four
> > >>> segments and the structure will still be cache-aligned).
> > >>>


Martin asked me why we even do this via these entries. Meaning why
have this tuple of information for each page: <lba, first_sect, last_sect, gref>.
The lba on the next subsequent indirect entry is going to be incremented by
one. The first_sect and last_sect too... So why not just do:

struct blkif_request_indirect {
        uint8_t        operation;
        blkif_vdev_t   handle;       /* only for read/write requests         */
#ifdef CONFIG_X86_64
        uint32_t       _pad1;        /* offsetof(blkif_request,u.rw.id) == 8 */
#endif
        uint64_t       id;           /* private guest value, echoed in resp  */
        blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */

	grant_ref_t	indirect_desc;
	uint16_t	nr_elems;
}

And the 'indirect_desc' would point to a page that looks quite close to
what the scatterlist looks like:

	struct indirect_chain {
		uint16_t	op_flag;	//*Can D_NEXT, D_START, D_END ?
		uint16_t	next;
		uint16_t	offset;
		uint16_t	length;
		uint32_t	gref;
		uint32_t	_pad;		// Need this in case we ever want to
						// make gref + _pad be a physical addr.
	}

And the page itself would be:
	struct indirect_chain[256];

the 'next' would just contain the index inside in indirect_chain page - so from
0->256.  The offset and length would reference wherein the page the data is
contained.

This way the 'lba' information is part of the 'blkif_request_indirect' and the
payload info is all in the indirect descriptors.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-02-20 21:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-18 14:31 RFC v1: Xen block protocol overhaul - problem statement (with pictures!) Konrad Rzeszutek Wilk
2012-12-18 14:49 ` Jan Beulich
2013-01-18 16:00 ` Roger Pau Monné
2013-01-18 18:20   ` Konrad Rzeszutek Wilk
2013-01-19 12:44     ` Roger Pau Monné
2013-01-22 19:46       ` Konrad Rzeszutek Wilk
2013-01-23  9:53         ` Ian Campbell
2013-01-23 15:21           ` Konrad Rzeszutek Wilk
2013-01-23 15:41             ` Ian Campbell
2013-01-23 16:59               ` Konrad Rzeszutek Wilk
2013-01-24 10:06                 ` Ian Campbell
2013-01-24 15:11                   ` Konrad Rzeszutek Wilk
2013-02-20 21:31         ` Konrad Rzeszutek Wilk
2013-01-21 12:37     ` Ian Campbell
2013-01-22 19:25       ` Konrad Rzeszutek Wilk
2013-01-23  9:24         ` Ian Campbell
2013-01-23 15:03           ` Konrad Rzeszutek Wilk
2013-01-23 15:39             ` Ian Campbell
2013-01-23 16:57               ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.