From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38917)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1XORp5-00069t-4H
	for qemu-devel@nongnu.org; Mon, 01 Sep 2014 09:39:13 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1XORoz-00022C-3V
	for qemu-devel@nongnu.org; Mon, 01 Sep 2014 09:39:07 -0400
Received: from lputeaux-656-01-25-125.w80-12.abo.wanadoo.fr
	([80.12.84.125]:45930 helo=paradis.irqsave.net)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1XORoy-00021v-PG
	for qemu-devel@nongnu.org; Mon, 01 Sep 2014 09:39:01 -0400
Date: Mon, 1 Sep 2014 15:38:12 +0200
From: =?iso-8859-1?Q?Beno=EEt?= Canet <benoit.canet@irqsave.net>
Message-ID: <20140901133812.GA12307@irqsave.net>
References: <20140828143809.GB28789@irqsave.net>
	<8761h7fz27.fsf@blackfin.pond.sub.org>
	<20140901104437.GA15673@irqsave.net>
	<87bnqzd0vm.fsf@blackfin.pond.sub.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <87bnqzd0vm.fsf@blackfin.pond.sub.org>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [libvirt]    IO accounting overhaul
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Markus Armbruster <armbru@redhat.com>
Cc: =?iso-8859-1?Q?Beno=EEt?= Canet <benoit.canet@irqsave.net>, kwolf@redhat.com, libvir-list@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, anshul.makkar@profitbricks.com

The Monday 01 Sep 2014 =E0 13:41:01 (+0200), Markus Armbruster wrote :
> Beno=EEt Canet <benoit.canet@irqsave.net> writes:
>=20
> > The Monday 01 Sep 2014 =E0 11:52:00 (+0200), Markus Armbruster wrote =
:
> >> Cc'ing libvirt following Stefan's lead.
> >>=20
> >> Beno=EEt Canet <benoit.canet@irqsave.net> writes:
> >>=20
> >> > Hi,
> >> >
> >> > I collected some items of a cloud provider wishlist regarding I/O =
accouting.
> >>=20
> >> Feedback from real power-users, lovely!
> >>=20
> >> > In a cloud I/O accouting can have 3 purpose: billing, helping the =
customers
> >> > and doing metrology to help the cloud provider seeks hidden costs.
> >> >
> >> > I'll cover the two former topic in this mail because they are the
> >> > most important
> >> > business wize.
> >> >
> >> > 1) prefered place to collect billing IO accounting data:
> >> > --------------------------------------------------------
> >> > For billing purpose the collected data must be as close as
> >> > possible to what the
> >> > customer would see by using iostats in his vm.
> >>=20
> >> Good point.
> >>=20
> >> > The first conclusion we can draw is that the choice of collecting
> >> > IO accouting
> >> > data used for billing in the block devices models is right.
> >>=20
> >> Slightly rephrasing: doing I/O accounting in the block device models=
 is
> >> right for billing.
> >>=20
> >> There may be other uses for I/O accounting, with different preferenc=
es.
> >> For instance, data on how exactly guest I/O gets translated to host =
I/O
> >> as it flows through the nodes in the block graph could be useful.
> >
> > I think this is the third point that I named as metrology.
> > Basically it boils down to "Where are the hidden IO costs of the QEMU
> > block layer".
>=20
> Understood.
>=20
> >> Doesn't diminish the need for accurate billing information, of cours=
e.
> >>=20
> >> > 2) what to do with occurences of rare events:
> >> > ---------------------------------------------
> >> >
> >> > Another point is that QEMU developpers agree that they don't know
> >> > which policy
> >> > to apply to some I/O accounting events.
> >> > Must QEMU discard invalid I/O write IO or account them as done ?
> >> > Must QEMU count a failed read I/O as done ?
> >> >
> >> > When discusting this with a cloud provider the following appears:
> >> > these decisions
> >> > are really specific to each cloud provider and QEMU should not
> >> > implement them.
> >>=20
> >> Good point, consistent with the old advice to avoid baking policy in=
to
> >> inappropriately low levels of the stack.
> >>=20
> >> > The right thing to do is to add accouting counters to collect thes=
e events.
> >> >
> >> > Moreover these rare events are precious troubleshooting data so it=
's
> >> > an additional
> >> > reason not to toss them.
> >>=20
> >> Another good point.
> >>=20
> >> > 3) list of block I/O accouting metrics wished for billing and help=
ing
> >> > the customers
> >> > ------------------------------------------------------------------=
-----------------
> >> >
> >> > Basic I/O accouting data will end up making the customers bills.
> >> > Extra I/O accouting informations would be a precious help for the
> >> > cloud provider
> >> > to implement a monitoring panel like Amazon Cloudwatch.
> >>=20
> >> These are the first two from your list of three purposes, i.e. the o=
nes
> >> you promised to cover here.
> >>=20
> >> > Here is the list of counters and statitics I would like to help
> >> > implement in QEMU.
> >> >
> >> > This is the most important part of the mail and the one I would li=
ke
> >> > the community
> >> > review the most.
> >> >
> >> > Once this list is settled I would proceed to implement the require=
d
> >> > infrastructure
> >> > in QEMU before using it in the device models.
> >>=20
> >> For context, let me recap how I/O accounting works now.
> >>=20
> >> The BlockDriverState abstract data type (short: BDS) can hold the
> >> following accounting data:
> >>=20
> >>     uint64_t nr_bytes[BDRV_MAX_IOTYPE];
> >>     uint64_t nr_ops[BDRV_MAX_IOTYPE];
> >>     uint64_t total_time_ns[BDRV_MAX_IOTYPE];
> >>     uint64_t wr_highest_sector;
> >>=20
> >> where BDRV_MAX_IOTYPE enumerates read, write, flush.
> >>=20
> >> wr_highest_sector is a high watermark updated by the block layer as =
it
> >> writes sectors.
> >>=20
> >> The other three are *not* touched by the block layer.  Instead, the
> >> block layer provides a pair of functions for device models to update
> >> them:
> >>=20
> >>     void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cook=
ie,
> >>             int64_t bytes, enum BlockAcctType type);
> >>     void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cooki=
e);
> >>=20
> >> bdrv_acct_start() initializes cookie for a read, write, or flush
> >> operation of a certain size.  The size of a flush is always zero.
> >>=20
> >> bdrv_acct_done() adds the operations to the BDS's accounting data.
> >> total_time_ns is incremented by the time between _start() and _done(=
).
> >>=20
> >> You may call _start() without calling _done().  That's a feature.
> >> Device models use it to avoid accounting some requests.
> >>=20
> >> Device models are not supposed to mess with cookie directly, only
> >> through these two functions.
> >>=20
> >> Some device models implement accounting, some don't.  The ones that =
do
> >> don't agree on how to count invalid guest requests (the ones not pas=
sed
> >> to block layer) and failed requests (passed to block layer and faile=
d
> >> there).  It's a mess in part caused by us never writing down what
> >> exactly device models are expected to do.
> >>=20
> >> Accounting data is used by "query-blockstats", and nothing else.
> >>=20
> >> Corollary: even though every BDS holds accounting data, only the one=
s in
> >> "top" BDSes ever get used.  This is a common block layer blemish, an=
d
> >> we're working on cleaning it up.
> >>=20
> >> If a device model doesn't implement accounting, query-blockstats lie=
s.
> >> Fortunately, its lies are pretty transparent (everything's zero) as =
long
> >> as you don't do things like connecting a backend to a device model t=
hat
> >> doesn't implement accounting after disconnecting it from a device mo=
del
> >> that does.  Still, I'd welcome a more honest QMP interface.
> >>=20
> >> For me, this accounting data belongs to the device model, not the BD=
S.
> >> Naturally, the block device models should use common infrastructure.=
  I
> >> guess they use the block layer only because it's obvious infrastruct=
ure
> >> they share.  Clumsy design.
> >>=20
> >> > /* volume of data transfered by the IOs */
> >> > read_bytes
> >> > write_bytes
> >>=20
> >> This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE].
> >>=20
> >> nr_bytes[BDRV_ACCT_FLUSH] is always zero.
> >>=20
> >> Should this count only actual I/O, i.e. accumulated size of successf=
ul
> >> operations?
> >>=20
> >> > /* operation count */
> >> > read_ios
> >> > write_ios
> >> > flush_ios
> >> >
> >> > /* how many invalid IOs the guest submit */
> >> > invalid_read_ios
> >> > invalid_write_ios
> >> > invalid_flush_ios
> >> >
> >> > /* how many io error happened */
> >> > read_ios_error
> >> > write_ios_error
> >> > flush_ios_error
> >>=20
> >> This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE],
> >> nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed=
.
> >>=20
> >> > /* account the time passed doing IOs */
> >> > total_read_time
> >> > total_write_time
> >> > total_flush_time
> >>=20
> >> This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE=
],
> >> total_time_ns[BDRV_ACCT_FLUSH].
> >>=20
> >> I guess this should count both successful and failed I/O.  Could thr=
ow
> >> in invalid, too, but it's probably too quick to matter.
> >>=20
> >> Please specify the unit clearly.  Both total_FOO_time_ns or total_FO=
O_ns
> >> would work for me.
> >
> > Yes _ns is fine for me too.
> >
> >>=20
> >> > /* since when the volume is iddle */
> >> > qvolume_iddleness_time
> >>=20
> >> "idle"
> >>=20
> >> The obvious way to maintain this information with the current could
> >> would be saving the value of get_clock() in bdrv_acct_done().
> >>=20
> >> > /* the following would compute latecies for slices of 1 seconds th=
en toss the
> >> >  * result and start a new slice. A weighted sumation of the instan=
t latencies
> >> >  * could help to implement this.
> >> >  */
> >> > 1s_read_average_latency
> >> > 1s_write_average_latency
> >> > 1s_flush_average_latency
> >> >
> >> > /* the former three numbers could be used to further compute a 1
> >> > minute slice value */
> >> > 1m_read_average_latency
> >> > 1m_write_average_latency
> >> > 1m_flush_average_latency
> >> >
> >> > /* the former three numbers could be used to further compute a 1 h=
ours
> >> > slice value */
> >> > 1h_read_average_latency
> >> > 1h_write_average_latency
> >> > 1h_flush_average_latency
> >>=20
> >> This is something like "what we added to total_FOO_time in the last
> >> completed 1s / 1m / 1h time slice divided by the number of additions=
".
> >> Just another way to accumulate the same raw data, thus no worries.
> >>=20
> >> > /* 1 second average number of requests in flight */
> >> > 1s_read_queue_depth
> >> > 1s_write_queue_depth
> >> >
> >> > /* 1 minute average number of requests in flight */
> >> > 1m_read_queue_depth
> >> > 1m_write_queue_depth
> >> >
> >> > /* 1 hours average number of requests in flight */
> >> > 1h_read_queue_depth
> >> > 1h_write_queue_depth
> >>=20
> >> I guess this involves counting bdrv_acct_start() and bdrv_acct_done(=
).
> >> The "you need not call bdrv_acct_done()" feature may get in the way.
> >> Solvable.
> >>=20
> >> Permit me a short detour into the other use for I/O accounting I
> >> mentioned: data on how exactly guest I/O gets translated to host I/O=
 as
> >> it flows through the nodes in the block graph.  Do you think this wo=
uld
> >> be pretty much the same data, just collected at different points?
> >
> > That something I would like to take care in a further sub project.
>=20
> I'm asking now because it impacts where I'd prefer the shared
> infrastructure to live.
>=20
> > Optionally collecting the same data for each BDS of the graph.
>=20
> If that's the case, keeping the shared infrastructure in the block laye=
r
> makes sense.
>=20
> BDS member acct then holds I/O stats for the BDS.  We currently use it
> for something else: I/O stats of the device model backed by this BDS.
> That needs to move elsewhere.  Two places come to mind:
>=20
> 1. BlockBackend, when it's available (I resumed working on it last week
>    for a bit).  Superficially attractive, because it's close to what we
>    have now, but then we have to deal with what to do when the backend
>    gets disconnected from its device model, then connected to another
>    one.
>=20
> 2. The device models that actually implement I/O accounting.  Since
>    query-blockstats names a backend rather than a device model, we need
>    a BlockDevOps callback to fetch the stats.  Fetch fails when the
>    callback is null.  Lets us distinguish "no stats yet" and "device
>    model can't do stats", thus permits a QMP interface that doesn't lie=
.
>=20
> Right now, I like (2) better.

ok

>=20
> >> > 4) Making this happen
> >> > -------------------------
> >> >
> >> > Outscale want to make these IO stat happen and gave me the go to d=
o whatever
> >> > grunt is required to do so.
> >> > That said we could collaborate on some part of the work.
> >>=20
> >> Cool!
> >>=20
> >> A quick stab at tasks:
> >>=20
> >> * QMP interface, either a compatible extension of query-blockstats o=
r a
> >>   new one.
> >
> > I would like to extend query-blockstat in a first time but I also
> > would like to postpone the QMP interface changes and just write the
> > shared infrastructure and deploy it in the device models.
>=20
> Implementing improved data collection need not wait for the QMP design.
>=20
> >> * Rough idea on how to do the shared infrastructure.
> >
> > -API wize I think about adding
> > bdrv_acct_invalid() and
> > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asa=
p.
>=20
> Complication: partial success.  Example:
>=20
> 1. Guest requests a read of N sectors.
>=20
> 2. Device model calls
>    bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT_R=
EAD)
>=20
> 3. Device model examines the request, and deems it valid.
>=20
> 4. Device model passes it to the block layer.
>=20
> 5. Block layer does its thing, but for some reason only M < N sectors
>    can be read.  Block layer returns M.
>=20
> 6. What's the device model to do now?  Both bdrv_acct_failed() and
>    bdrv_acct_done() would be wrong.
>=20
>    Should the device model account for a read of size M?  This ignores
>    the partial failure.
>=20
>    Should it split the read into a successful and a failed part for
>    accounting purposes?  This miscounts the number of reads.

maybe bdrv_acct_partial() accounting the size of data read in the bandwit=
h counter
and keeping care of counting this event.

Maybe we will discover some other rare event to account.

>=20
> > -To calculate the averages I think about a global timer firing every =
seconds
> > and iterating of the bds list to make the computations even when ther=
e is no
> > IO activity. Is it acceptable to have a qemu_mutex by statitic struct=
ure ?
>=20
> Let me try to sketch something simpler.  For brevity, I cover only a
> single slice, but adding mroe is just more of the same.
>=20
> Have an end-of-slice timestamp.
>=20
> When accounting an operation, compare current time to end-of-slice (we
> already get the current time to measure latency, so this is cheap).  If
> end-of-slice is in the past, zero the slice's stats, and update
> end-of-slice.
>=20
> When something asks for stats, do the same, to avoid returning stale
> slice stats.
>=20
> >> * Implement (can be split up into several tasks if desired)
> >
> > First I would like to write a series implementing a
> > backward-compatible API and get it
> > merged.
> >
> > Then the deployment of the new API specifics in the devices models ca=
n
> > be splitted/parallelized.
>=20
> Works for me.
>=20
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list