From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38917) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XORp5-00069t-4H for qemu-devel@nongnu.org; Mon, 01 Sep 2014 09:39:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XORoz-00022C-3V for qemu-devel@nongnu.org; Mon, 01 Sep 2014 09:39:07 -0400 Received: from lputeaux-656-01-25-125.w80-12.abo.wanadoo.fr ([80.12.84.125]:45930 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XORoy-00021v-PG for qemu-devel@nongnu.org; Mon, 01 Sep 2014 09:39:01 -0400 Date: Mon, 1 Sep 2014 15:38:12 +0200 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20140901133812.GA12307@irqsave.net> References: <20140828143809.GB28789@irqsave.net> <8761h7fz27.fsf@blackfin.pond.sub.org> <20140901104437.GA15673@irqsave.net> <87bnqzd0vm.fsf@blackfin.pond.sub.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <87bnqzd0vm.fsf@blackfin.pond.sub.org> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [libvirt] IO accounting overhaul List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Markus Armbruster Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , kwolf@redhat.com, libvir-list@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, anshul.makkar@profitbricks.com The Monday 01 Sep 2014 =E0 13:41:01 (+0200), Markus Armbruster wrote : > Beno=EEt Canet writes: >=20 > > The Monday 01 Sep 2014 =E0 11:52:00 (+0200), Markus Armbruster wrote = : > >> Cc'ing libvirt following Stefan's lead. > >>=20 > >> Beno=EEt Canet writes: > >>=20 > >> > Hi, > >> > > >> > I collected some items of a cloud provider wishlist regarding I/O = accouting. > >>=20 > >> Feedback from real power-users, lovely! > >>=20 > >> > In a cloud I/O accouting can have 3 purpose: billing, helping the = customers > >> > and doing metrology to help the cloud provider seeks hidden costs. > >> > > >> > I'll cover the two former topic in this mail because they are the > >> > most important > >> > business wize. > >> > > >> > 1) prefered place to collect billing IO accounting data: > >> > -------------------------------------------------------- > >> > For billing purpose the collected data must be as close as > >> > possible to what the > >> > customer would see by using iostats in his vm. > >>=20 > >> Good point. > >>=20 > >> > The first conclusion we can draw is that the choice of collecting > >> > IO accouting > >> > data used for billing in the block devices models is right. > >>=20 > >> Slightly rephrasing: doing I/O accounting in the block device models= is > >> right for billing. > >>=20 > >> There may be other uses for I/O accounting, with different preferenc= es. > >> For instance, data on how exactly guest I/O gets translated to host = I/O > >> as it flows through the nodes in the block graph could be useful. > > > > I think this is the third point that I named as metrology. > > Basically it boils down to "Where are the hidden IO costs of the QEMU > > block layer". >=20 > Understood. >=20 > >> Doesn't diminish the need for accurate billing information, of cours= e. > >>=20 > >> > 2) what to do with occurences of rare events: > >> > --------------------------------------------- > >> > > >> > Another point is that QEMU developpers agree that they don't know > >> > which policy > >> > to apply to some I/O accounting events. > >> > Must QEMU discard invalid I/O write IO or account them as done ? > >> > Must QEMU count a failed read I/O as done ? > >> > > >> > When discusting this with a cloud provider the following appears: > >> > these decisions > >> > are really specific to each cloud provider and QEMU should not > >> > implement them. > >>=20 > >> Good point, consistent with the old advice to avoid baking policy in= to > >> inappropriately low levels of the stack. > >>=20 > >> > The right thing to do is to add accouting counters to collect thes= e events. > >> > > >> > Moreover these rare events are precious troubleshooting data so it= 's > >> > an additional > >> > reason not to toss them. > >>=20 > >> Another good point. > >>=20 > >> > 3) list of block I/O accouting metrics wished for billing and help= ing > >> > the customers > >> > ------------------------------------------------------------------= ----------------- > >> > > >> > Basic I/O accouting data will end up making the customers bills. > >> > Extra I/O accouting informations would be a precious help for the > >> > cloud provider > >> > to implement a monitoring panel like Amazon Cloudwatch. > >>=20 > >> These are the first two from your list of three purposes, i.e. the o= nes > >> you promised to cover here. > >>=20 > >> > Here is the list of counters and statitics I would like to help > >> > implement in QEMU. > >> > > >> > This is the most important part of the mail and the one I would li= ke > >> > the community > >> > review the most. > >> > > >> > Once this list is settled I would proceed to implement the require= d > >> > infrastructure > >> > in QEMU before using it in the device models. > >>=20 > >> For context, let me recap how I/O accounting works now. > >>=20 > >> The BlockDriverState abstract data type (short: BDS) can hold the > >> following accounting data: > >>=20 > >> uint64_t nr_bytes[BDRV_MAX_IOTYPE]; > >> uint64_t nr_ops[BDRV_MAX_IOTYPE]; > >> uint64_t total_time_ns[BDRV_MAX_IOTYPE]; > >> uint64_t wr_highest_sector; > >>=20 > >> where BDRV_MAX_IOTYPE enumerates read, write, flush. > >>=20 > >> wr_highest_sector is a high watermark updated by the block layer as = it > >> writes sectors. > >>=20 > >> The other three are *not* touched by the block layer. Instead, the > >> block layer provides a pair of functions for device models to update > >> them: > >>=20 > >> void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cook= ie, > >> int64_t bytes, enum BlockAcctType type); > >> void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cooki= e); > >>=20 > >> bdrv_acct_start() initializes cookie for a read, write, or flush > >> operation of a certain size. The size of a flush is always zero. > >>=20 > >> bdrv_acct_done() adds the operations to the BDS's accounting data. > >> total_time_ns is incremented by the time between _start() and _done(= ). > >>=20 > >> You may call _start() without calling _done(). That's a feature. > >> Device models use it to avoid accounting some requests. > >>=20 > >> Device models are not supposed to mess with cookie directly, only > >> through these two functions. > >>=20 > >> Some device models implement accounting, some don't. The ones that = do > >> don't agree on how to count invalid guest requests (the ones not pas= sed > >> to block layer) and failed requests (passed to block layer and faile= d > >> there). It's a mess in part caused by us never writing down what > >> exactly device models are expected to do. > >>=20 > >> Accounting data is used by "query-blockstats", and nothing else. > >>=20 > >> Corollary: even though every BDS holds accounting data, only the one= s in > >> "top" BDSes ever get used. This is a common block layer blemish, an= d > >> we're working on cleaning it up. > >>=20 > >> If a device model doesn't implement accounting, query-blockstats lie= s. > >> Fortunately, its lies are pretty transparent (everything's zero) as = long > >> as you don't do things like connecting a backend to a device model t= hat > >> doesn't implement accounting after disconnecting it from a device mo= del > >> that does. Still, I'd welcome a more honest QMP interface. > >>=20 > >> For me, this accounting data belongs to the device model, not the BD= S. > >> Naturally, the block device models should use common infrastructure.= I > >> guess they use the block layer only because it's obvious infrastruct= ure > >> they share. Clumsy design. > >>=20 > >> > /* volume of data transfered by the IOs */ > >> > read_bytes > >> > write_bytes > >>=20 > >> This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE]. > >>=20 > >> nr_bytes[BDRV_ACCT_FLUSH] is always zero. > >>=20 > >> Should this count only actual I/O, i.e. accumulated size of successf= ul > >> operations? > >>=20 > >> > /* operation count */ > >> > read_ios > >> > write_ios > >> > flush_ios > >> > > >> > /* how many invalid IOs the guest submit */ > >> > invalid_read_ios > >> > invalid_write_ios > >> > invalid_flush_ios > >> > > >> > /* how many io error happened */ > >> > read_ios_error > >> > write_ios_error > >> > flush_ios_error > >>=20 > >> This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE], > >> nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed= . > >>=20 > >> > /* account the time passed doing IOs */ > >> > total_read_time > >> > total_write_time > >> > total_flush_time > >>=20 > >> This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE= ], > >> total_time_ns[BDRV_ACCT_FLUSH]. > >>=20 > >> I guess this should count both successful and failed I/O. Could thr= ow > >> in invalid, too, but it's probably too quick to matter. > >>=20 > >> Please specify the unit clearly. Both total_FOO_time_ns or total_FO= O_ns > >> would work for me. > > > > Yes _ns is fine for me too. > > > >>=20 > >> > /* since when the volume is iddle */ > >> > qvolume_iddleness_time > >>=20 > >> "idle" > >>=20 > >> The obvious way to maintain this information with the current could > >> would be saving the value of get_clock() in bdrv_acct_done(). > >>=20 > >> > /* the following would compute latecies for slices of 1 seconds th= en toss the > >> > * result and start a new slice. A weighted sumation of the instan= t latencies > >> > * could help to implement this. > >> > */ > >> > 1s_read_average_latency > >> > 1s_write_average_latency > >> > 1s_flush_average_latency > >> > > >> > /* the former three numbers could be used to further compute a 1 > >> > minute slice value */ > >> > 1m_read_average_latency > >> > 1m_write_average_latency > >> > 1m_flush_average_latency > >> > > >> > /* the former three numbers could be used to further compute a 1 h= ours > >> > slice value */ > >> > 1h_read_average_latency > >> > 1h_write_average_latency > >> > 1h_flush_average_latency > >>=20 > >> This is something like "what we added to total_FOO_time in the last > >> completed 1s / 1m / 1h time slice divided by the number of additions= ". > >> Just another way to accumulate the same raw data, thus no worries. > >>=20 > >> > /* 1 second average number of requests in flight */ > >> > 1s_read_queue_depth > >> > 1s_write_queue_depth > >> > > >> > /* 1 minute average number of requests in flight */ > >> > 1m_read_queue_depth > >> > 1m_write_queue_depth > >> > > >> > /* 1 hours average number of requests in flight */ > >> > 1h_read_queue_depth > >> > 1h_write_queue_depth > >>=20 > >> I guess this involves counting bdrv_acct_start() and bdrv_acct_done(= ). > >> The "you need not call bdrv_acct_done()" feature may get in the way. > >> Solvable. > >>=20 > >> Permit me a short detour into the other use for I/O accounting I > >> mentioned: data on how exactly guest I/O gets translated to host I/O= as > >> it flows through the nodes in the block graph. Do you think this wo= uld > >> be pretty much the same data, just collected at different points? > > > > That something I would like to take care in a further sub project. >=20 > I'm asking now because it impacts where I'd prefer the shared > infrastructure to live. >=20 > > Optionally collecting the same data for each BDS of the graph. >=20 > If that's the case, keeping the shared infrastructure in the block laye= r > makes sense. >=20 > BDS member acct then holds I/O stats for the BDS. We currently use it > for something else: I/O stats of the device model backed by this BDS. > That needs to move elsewhere. Two places come to mind: >=20 > 1. BlockBackend, when it's available (I resumed working on it last week > for a bit). Superficially attractive, because it's close to what we > have now, but then we have to deal with what to do when the backend > gets disconnected from its device model, then connected to another > one. >=20 > 2. The device models that actually implement I/O accounting. Since > query-blockstats names a backend rather than a device model, we need > a BlockDevOps callback to fetch the stats. Fetch fails when the > callback is null. Lets us distinguish "no stats yet" and "device > model can't do stats", thus permits a QMP interface that doesn't lie= . >=20 > Right now, I like (2) better. ok >=20 > >> > 4) Making this happen > >> > ------------------------- > >> > > >> > Outscale want to make these IO stat happen and gave me the go to d= o whatever > >> > grunt is required to do so. > >> > That said we could collaborate on some part of the work. > >>=20 > >> Cool! > >>=20 > >> A quick stab at tasks: > >>=20 > >> * QMP interface, either a compatible extension of query-blockstats o= r a > >> new one. > > > > I would like to extend query-blockstat in a first time but I also > > would like to postpone the QMP interface changes and just write the > > shared infrastructure and deploy it in the device models. >=20 > Implementing improved data collection need not wait for the QMP design. >=20 > >> * Rough idea on how to do the shared infrastructure. > > > > -API wize I think about adding > > bdrv_acct_invalid() and > > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asa= p. >=20 > Complication: partial success. Example: >=20 > 1. Guest requests a read of N sectors. >=20 > 2. Device model calls > bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT_R= EAD) >=20 > 3. Device model examines the request, and deems it valid. >=20 > 4. Device model passes it to the block layer. >=20 > 5. Block layer does its thing, but for some reason only M < N sectors > can be read. Block layer returns M. >=20 > 6. What's the device model to do now? Both bdrv_acct_failed() and > bdrv_acct_done() would be wrong. >=20 > Should the device model account for a read of size M? This ignores > the partial failure. >=20 > Should it split the read into a successful and a failed part for > accounting purposes? This miscounts the number of reads. maybe bdrv_acct_partial() accounting the size of data read in the bandwit= h counter and keeping care of counting this event. Maybe we will discover some other rare event to account. >=20 > > -To calculate the averages I think about a global timer firing every = seconds > > and iterating of the bds list to make the computations even when ther= e is no > > IO activity. Is it acceptable to have a qemu_mutex by statitic struct= ure ? >=20 > Let me try to sketch something simpler. For brevity, I cover only a > single slice, but adding mroe is just more of the same. >=20 > Have an end-of-slice timestamp. >=20 > When accounting an operation, compare current time to end-of-slice (we > already get the current time to measure latency, so this is cheap). If > end-of-slice is in the past, zero the slice's stats, and update > end-of-slice. >=20 > When something asks for stats, do the same, to avoid returning stale > slice stats. >=20 > >> * Implement (can be split up into several tasks if desired) > > > > First I would like to write a series implementing a > > backward-compatible API and get it > > merged. > > > > Then the deployment of the new API specifics in the devices models ca= n > > be splitted/parallelized. >=20 > Works for me. >=20 > -- > libvir-list mailing list > libvir-list@redhat.com > https://www.redhat.com/mailman/listinfo/libvir-list