Re: [Qemu-devel] [libvirt] IO accounting overhaul

From: "Benoît Canet" <benoit.canet@irqsave.net>
To: Kevin Wolf <kwolf@redhat.com>
Cc: "Benoît Canet" <benoit.canet@irqsave.net>,
	libvir-list@redhat.com, "Markus Armbruster" <armbru@redhat.com>,
	qemu-devel@nongnu.org, stefanha@redhat.com,
	anshul.makkar@profitbricks.com
Subject: Re: [Qemu-devel] [libvirt]    IO accounting overhaul
Date: Fri, 5 Sep 2014 16:57:28 +0200	[thread overview]
Message-ID: <20140905145728.GB3883@irqsave.net> (raw)
In-Reply-To: <20140905143031.GG4656@noname.str.redhat.com>

The Friday 05 Sep 2014 à 16:30:31 (+0200), Kevin Wolf wrote :
> Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben:
> > Benoît Canet <benoit.canet@irqsave.net> writes:
> > 
> > > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> > >> Cc'ing libvirt following Stefan's lead.
> > >> 
> > >> Benoît Canet <benoit.canet@irqsave.net> writes:
> > >> > /* the following would compute latecies for slices of 1 seconds then toss the
> > >> >  * result and start a new slice. A weighted sumation of the instant latencies
> > >> >  * could help to implement this.
> > >> >  */
> > >> > 1s_read_average_latency
> > >> > 1s_write_average_latency
> > >> > 1s_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1
> > >> > minute slice value */
> > >> > 1m_read_average_latency
> > >> > 1m_write_average_latency
> > >> > 1m_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1 hours
> > >> > slice value */
> > >> > 1h_read_average_latency
> > >> > 1h_write_average_latency
> > >> > 1h_flush_average_latency
> > >> 
> > >> This is something like "what we added to total_FOO_time in the last
> > >> completed 1s / 1m / 1h time slice divided by the number of additions".
> > >> Just another way to accumulate the same raw data, thus no worries.
> > >> 
> > >> > /* 1 second average number of requests in flight */
> > >> > 1s_read_queue_depth
> > >> > 1s_write_queue_depth
> > >> >
> > >> > /* 1 minute average number of requests in flight */
> > >> > 1m_read_queue_depth
> > >> > 1m_write_queue_depth
> > >> >
> > >> > /* 1 hours average number of requests in flight */
> > >> > 1h_read_queue_depth
> > >> > 1h_write_queue_depth
> 
> I don't think I agree with putting fixed time periods like 1 s/min/h
> into qemu. What you need there is policy and we should probably make
> it configurable.

I agree.

> 
> Do we need accounting for multiple time periods at the same time or
> would it be enough to have one and make its duration an option?

I don't know yet.

> 
> > > Optionally collecting the same data for each BDS of the graph.
> > 
> > If that's the case, keeping the shared infrastructure in the block layer
> > makes sense.
> > 
> > BDS member acct then holds I/O stats for the BDS.  We currently use it
> > for something else: I/O stats of the device model backed by this BDS.
> > That needs to move elsewhere.  Two places come to mind:
> > 
> > 1. BlockBackend, when it's available (I resumed working on it last week
> >    for a bit).  Superficially attractive, because it's close to what we
> >    have now, but then we have to deal with what to do when the backend
> >    gets disconnected from its device model, then connected to another
> >    one.
> > 
> > 2. The device models that actually implement I/O accounting.  Since
> >    query-blockstats names a backend rather than a device model, we need
> >    a BlockDevOps callback to fetch the stats.  Fetch fails when the
> >    callback is null.  Lets us distinguish "no stats yet" and "device
> >    model can't do stats", thus permits a QMP interface that doesn't lie.
> > 
> > Right now, I like (2) better.
> 
> So let's say I have some block device, which is attached to a guest
> device for a while, but then I detach it and continue using it in a
> different place (maybe another guest device or a block job). Should we
> really reset all counters in query-blockstats to 0?
> 
> I think as I user I would be surprised about this, because I still refer
> to it by the same name (the device_name, which will be in the BB), so
> it's the same thing for me and the total requests include everything
> that was ever issued against it.
> 
> > > -API wize I think about adding
> > > bdrv_acct_invalid() and
> > > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
> > 
> > Complication: partial success.  Example:
> > 
> > 1. Guest requests a read of N sectors.
> > 
> > 2. Device model calls
> >    bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT_READ)
> > 
> > 3. Device model examines the request, and deems it valid.
> > 
> > 4. Device model passes it to the block layer.
> > 
> > 5. Block layer does its thing, but for some reason only M < N sectors
> >    can be read.  Block layer returns M.
> 
> No, it returns -errno.
> 
> > 6. What's the device model to do now?  Both bdrv_acct_failed() and
> >    bdrv_acct_done() would be wrong.
> > 
> >    Should the device model account for a read of size M?  This ignores
> >    the partial failure.
> > 
> >    Should it split the read into a successful and a failed part for
> >    accounting purposes?  This miscounts the number of reads.
> 
> I think we should simply account it as a failed request because this is
> what it will look like for the guest. If you want the partial data that
> was internally issued, you need to look at different statistics
> (probably those of bs->file).
> 
> Kevin
> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list