From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55774)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1XPux6-0000ge-Bd
	for qemu-devel@nongnu.org; Fri, 05 Sep 2014 10:57:33 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1XPux1-0004FU-5r
	for qemu-devel@nongnu.org; Fri, 05 Sep 2014 10:57:28 -0400
Received: from lputeaux-656-01-25-125.w80-12.abo.wanadoo.fr
	([80.12.84.125]:48349 helo=paradis.irqsave.net)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1XPux0-0004FQ-SZ
	for qemu-devel@nongnu.org; Fri, 05 Sep 2014 10:57:23 -0400
Date: Fri, 5 Sep 2014 16:56:34 +0200
From: =?iso-8859-1?Q?Beno=EEt?= Canet <benoit.canet@irqsave.net>
Message-ID: <20140905145634.GA3883@irqsave.net>
References: <20140828143809.GB28789@irqsave.net>
	<8761h7fz27.fsf@blackfin.pond.sub.org>
	<20140901104437.GA15673@irqsave.net>
	<87bnqzd0vm.fsf@blackfin.pond.sub.org>
	<20140905143031.GG4656@noname.str.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <20140905143031.GG4656@noname.str.redhat.com>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [libvirt]  IO accounting overhaul
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: =?iso-8859-1?Q?Beno=EEt?= Canet <benoit.canet@irqsave.net>, libvir-list@redhat.com, Markus Armbruster <armbru@redhat.com>, qemu-devel@nongnu.org, stefanha@redhat.com, anshul.makkar@profitbricks.com

The Friday 05 Sep 2014 =E0 16:30:31 (+0200), Kevin Wolf wrote :
> Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben:
> > Beno=EEt Canet <benoit.canet@irqsave.net> writes:
> >=20
> > > The Monday 01 Sep 2014 =E0 11:52:00 (+0200), Markus Armbruster wrot=
e :
> > >> Cc'ing libvirt following Stefan's lead.
> > >>=20
> > >> Beno=EEt Canet <benoit.canet@irqsave.net> writes:
> > >> > /* the following would compute latecies for slices of 1 seconds =
then toss the
> > >> >  * result and start a new slice. A weighted sumation of the inst=
ant latencies
> > >> >  * could help to implement this.
> > >> >  */
> > >> > 1s_read_average_latency
> > >> > 1s_write_average_latency
> > >> > 1s_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1
> > >> > minute slice value */
> > >> > 1m_read_average_latency
> > >> > 1m_write_average_latency
> > >> > 1m_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1=
 hours
> > >> > slice value */
> > >> > 1h_read_average_latency
> > >> > 1h_write_average_latency
> > >> > 1h_flush_average_latency
> > >>=20
> > >> This is something like "what we added to total_FOO_time in the las=
t
> > >> completed 1s / 1m / 1h time slice divided by the number of additio=
ns".
> > >> Just another way to accumulate the same raw data, thus no worries.
> > >>=20
> > >> > /* 1 second average number of requests in flight */
> > >> > 1s_read_queue_depth
> > >> > 1s_write_queue_depth
> > >> >
> > >> > /* 1 minute average number of requests in flight */
> > >> > 1m_read_queue_depth
> > >> > 1m_write_queue_depth
> > >> >
> > >> > /* 1 hours average number of requests in flight */
> > >> > 1h_read_queue_depth
> > >> > 1h_write_queue_depth
>=20
> I don't think I agree with putting fixed time periods like 1 s/min/h
> into qemu. What you need there is policy and we should probably make
> it configurable.
>=20
> Do we need accounting for multiple time periods at the same time or
> would it be enough to have one and make its duration an option?
>=20
> > > Optionally collecting the same data for each BDS of the graph.
> >=20
> > If that's the case, keeping the shared infrastructure in the block la=
yer
> > makes sense.
> >=20
> > BDS member acct then holds I/O stats for the BDS.  We currently use i=
t
> > for something else: I/O stats of the device model backed by this BDS.
> > That needs to move elsewhere.  Two places come to mind:
> >=20
> > 1. BlockBackend, when it's available (I resumed working on it last we=
ek
> >    for a bit).  Superficially attractive, because it's close to what =
we
> >    have now, but then we have to deal with what to do when the backen=
d
> >    gets disconnected from its device model, then connected to another
> >    one.
> >=20
> > 2. The device models that actually implement I/O accounting.  Since
> >    query-blockstats names a backend rather than a device model, we ne=
ed
> >    a BlockDevOps callback to fetch the stats.  Fetch fails when the
> >    callback is null.  Lets us distinguish "no stats yet" and "device
> >    model can't do stats", thus permits a QMP interface that doesn't l=
ie.
> >=20
> > Right now, I like (2) better.
>=20
> So let's say I have some block device, which is attached to a guest
> device for a while, but then I detach it and continue using it in a
> different place (maybe another guest device or a block job). Should we
> really reset all counters in query-blockstats to 0?
>=20
> I think as I user I would be surprised about this, because I still refe=
r
> to it by the same name (the device_name, which will be in the BB), so
> it's the same thing for me and the total requests include everything
> that was ever issued against it.

It all depends on where you are in the user food chain.
If you are a cloud end user you are interested in the stats associated wi=
th
the hardware of your /dev/vdx because that's what iostat allow you to see=
.

>=20
> > > -API wize I think about adding
> > > bdrv_acct_invalid() and
> > > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() a=
sap.
> >=20
> > Complication: partial success.  Example:
> >=20
> > 1. Guest requests a read of N sectors.
> >=20
> > 2. Device model calls
> >    bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT=
_READ)
> >=20
> > 3. Device model examines the request, and deems it valid.
> >=20
> > 4. Device model passes it to the block layer.
> >=20
> > 5. Block layer does its thing, but for some reason only M < N sectors
> >    can be read.  Block layer returns M.
>=20
> No, it returns -errno.
>=20
> > 6. What's the device model to do now?  Both bdrv_acct_failed() and
> >    bdrv_acct_done() would be wrong.
> >=20
> >    Should the device model account for a read of size M?  This ignore=
s
> >    the partial failure.
> >=20
> >    Should it split the read into a successful and a failed part for
> >    accounting purposes?  This miscounts the number of reads.
>=20
> I think we should simply account it as a failed request because this is
> what it will look like for the guest. If you want the partial data that
> was internally issued, you need to look at different statistics
> (probably those of bs->file).
>=20
> Kevin