From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55774) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XPux6-0000ge-Bd for qemu-devel@nongnu.org; Fri, 05 Sep 2014 10:57:33 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XPux1-0004FU-5r for qemu-devel@nongnu.org; Fri, 05 Sep 2014 10:57:28 -0400 Received: from lputeaux-656-01-25-125.w80-12.abo.wanadoo.fr ([80.12.84.125]:48349 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XPux0-0004FQ-SZ for qemu-devel@nongnu.org; Fri, 05 Sep 2014 10:57:23 -0400 Date: Fri, 5 Sep 2014 16:56:34 +0200 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20140905145634.GA3883@irqsave.net> References: <20140828143809.GB28789@irqsave.net> <8761h7fz27.fsf@blackfin.pond.sub.org> <20140901104437.GA15673@irqsave.net> <87bnqzd0vm.fsf@blackfin.pond.sub.org> <20140905143031.GG4656@noname.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20140905143031.GG4656@noname.str.redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [libvirt] IO accounting overhaul List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , libvir-list@redhat.com, Markus Armbruster , qemu-devel@nongnu.org, stefanha@redhat.com, anshul.makkar@profitbricks.com The Friday 05 Sep 2014 =E0 16:30:31 (+0200), Kevin Wolf wrote : > Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben: > > Beno=EEt Canet writes: > >=20 > > > The Monday 01 Sep 2014 =E0 11:52:00 (+0200), Markus Armbruster wrot= e : > > >> Cc'ing libvirt following Stefan's lead. > > >>=20 > > >> Beno=EEt Canet writes: > > >> > /* the following would compute latecies for slices of 1 seconds = then toss the > > >> > * result and start a new slice. A weighted sumation of the inst= ant latencies > > >> > * could help to implement this. > > >> > */ > > >> > 1s_read_average_latency > > >> > 1s_write_average_latency > > >> > 1s_flush_average_latency > > >> > > > >> > /* the former three numbers could be used to further compute a 1 > > >> > minute slice value */ > > >> > 1m_read_average_latency > > >> > 1m_write_average_latency > > >> > 1m_flush_average_latency > > >> > > > >> > /* the former three numbers could be used to further compute a 1= hours > > >> > slice value */ > > >> > 1h_read_average_latency > > >> > 1h_write_average_latency > > >> > 1h_flush_average_latency > > >>=20 > > >> This is something like "what we added to total_FOO_time in the las= t > > >> completed 1s / 1m / 1h time slice divided by the number of additio= ns". > > >> Just another way to accumulate the same raw data, thus no worries. > > >>=20 > > >> > /* 1 second average number of requests in flight */ > > >> > 1s_read_queue_depth > > >> > 1s_write_queue_depth > > >> > > > >> > /* 1 minute average number of requests in flight */ > > >> > 1m_read_queue_depth > > >> > 1m_write_queue_depth > > >> > > > >> > /* 1 hours average number of requests in flight */ > > >> > 1h_read_queue_depth > > >> > 1h_write_queue_depth >=20 > I don't think I agree with putting fixed time periods like 1 s/min/h > into qemu. What you need there is policy and we should probably make > it configurable. >=20 > Do we need accounting for multiple time periods at the same time or > would it be enough to have one and make its duration an option? >=20 > > > Optionally collecting the same data for each BDS of the graph. > >=20 > > If that's the case, keeping the shared infrastructure in the block la= yer > > makes sense. > >=20 > > BDS member acct then holds I/O stats for the BDS. We currently use i= t > > for something else: I/O stats of the device model backed by this BDS. > > That needs to move elsewhere. Two places come to mind: > >=20 > > 1. BlockBackend, when it's available (I resumed working on it last we= ek > > for a bit). Superficially attractive, because it's close to what = we > > have now, but then we have to deal with what to do when the backen= d > > gets disconnected from its device model, then connected to another > > one. > >=20 > > 2. The device models that actually implement I/O accounting. Since > > query-blockstats names a backend rather than a device model, we ne= ed > > a BlockDevOps callback to fetch the stats. Fetch fails when the > > callback is null. Lets us distinguish "no stats yet" and "device > > model can't do stats", thus permits a QMP interface that doesn't l= ie. > >=20 > > Right now, I like (2) better. >=20 > So let's say I have some block device, which is attached to a guest > device for a while, but then I detach it and continue using it in a > different place (maybe another guest device or a block job). Should we > really reset all counters in query-blockstats to 0? >=20 > I think as I user I would be surprised about this, because I still refe= r > to it by the same name (the device_name, which will be in the BB), so > it's the same thing for me and the total requests include everything > that was ever issued against it. It all depends on where you are in the user food chain. If you are a cloud end user you are interested in the stats associated wi= th the hardware of your /dev/vdx because that's what iostat allow you to see= . >=20 > > > -API wize I think about adding > > > bdrv_acct_invalid() and > > > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() a= sap. > >=20 > > Complication: partial success. Example: > >=20 > > 1. Guest requests a read of N sectors. > >=20 > > 2. Device model calls > > bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT= _READ) > >=20 > > 3. Device model examines the request, and deems it valid. > >=20 > > 4. Device model passes it to the block layer. > >=20 > > 5. Block layer does its thing, but for some reason only M < N sectors > > can be read. Block layer returns M. >=20 > No, it returns -errno. >=20 > > 6. What's the device model to do now? Both bdrv_acct_failed() and > > bdrv_acct_done() would be wrong. > >=20 > > Should the device model account for a read of size M? This ignore= s > > the partial failure. > >=20 > > Should it split the read into a successful and a failed part for > > accounting purposes? This miscounts the number of reads. >=20 > I think we should simply account it as a failed request because this is > what it will look like for the guest. If you want the partial data that > was internally issued, you need to look at different statistics > (probably those of bs->file). >=20 > Kevin