Re: sleeps and waits during io_submit

From: Dave Chinner <david@fromorbit.com>
To: Avi Kivity <avi@scylladb.com>
Cc: Glauber Costa <glauber@scylladb.com>, xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Fri, 4 Dec 2015 14:16:48 +1100	[thread overview]
Message-ID: <20151204031648.GC26718@dastard> (raw)
In-Reply-To: <56603AF8.1080209@scylladb.com>

On Thu, Dec 03, 2015 at 02:52:08PM +0200, Avi Kivity wrote:
> 
> 
> On 12/03/2015 01:19 AM, Dave Chinner wrote:
> >On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote:
> >>On 12/02/2015 01:06 AM, Dave Chinner wrote:
> >>>On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
> >>>>On 12/01/2015 11:19 PM, Dave Chinner wrote:
> >>>>>>>XFS spread files across the allocation groups, based on the directory these
> >>>>>>>files are created,
> >>>>>>Idea: create the files in some subdirectory, and immediately move
> >>>>>>them to their required location.
> >....
> >>>>My hack involves creating the file in a random directory, and while
> >>>>it is still zero sized, move it to its final directory.  This is
> >>>>simply to defeat the ag selection heuristic.
> >>>Which you really don't want to do.
> >>Why not?  For my directory structure, files in the same directory do
> >>not share temporal locality.  What does the ag selection heuristic
> >>give me?
> >Wrong question. The right question is this: what problems does
> >subverting the AG selection heuristic cause me?
> >
> >If you can't answer that question, then you can't quantify the risks
> >involved with making such a behavioural change.
> 
> Okay.  Any hint about the answer to that question?

If your file set is randomly distributed across the filesystem, then
it's quite likely that the filesystem will use all of the LBA space
rather than reusing the same AGs and hence LBA regions. That's going
to slowly fragment free space as metadata (which has different
lifetimes to data) and long term data gets more widely distributed.
That, in term will slowly result in the working dataset being made
up of more and smaller extents, whcih will also slowly get more
distributed over time, which them means allocation and freeing of
extents takes longer, trim becomes less effective because it's
workingwith smaller spaces, the SSD's "LBA in use" mapping becomes
more fragmented so garbage collection becomes harder, etc...

But, really, the only way to tell is to test, measure, observe and
analyse....

> >>>>>>This is pointless for an SSD. Perhaps XFS should randomize the ag on
> >>>>>>nonrotational media instead.
> >>>>>Actually, no, it is not pointless. SSDs do not require optimisation
> >>>>>for minimal seek time, but data locality is still just as important
> >>>>>as spinning disks, if not moreso. Why? Because the garbage
> >>>>>collection routines in the SSDs are all about locality and we can't
> >>>>>drive garbage collection effectively via discard operations if the
> >>>>>filesystem is not keeping temporally related files close together in
> >>>>>it's block address space.
> >>>>In my case, files in the same directory are not temporally related.
> >>>>But I understand where the heuristic comes from.
> >>>>
> >>>>Maybe an ioctl to set a directory attribute "the files in this
> >>>>directory are not temporally related"?
> >>>And exactly what does that gain us?
> >>I have a directory with commitlog files that are constantly and
> >>rapidly being created, appended to, and removed, from all logical
> >>cores in the system.  Does this not put pressure on that allocation
> >>group's locks?
> >Not usually, because if an AG is contended, the allocation algorithm
> >skips the contended AG and selects the next uncontended AG to
> >allocate in. And given that the append algorithm used by the
> >allocator attempts to use the last block of the last extent as the
> >target for the new extent (i.e. contiguous allocation) once a file
> >has skipped to a different AG all allocations will continue in that
> >new AG until it is either full or it becomes contended....
> >
> >IOWs, when AG contention occurs, the filesystem automatically
> >spreads out the load over multiple AGs. Put simply, we optimise for
> >locality first, but we're willing to compromise on locality to
> >minimise contention when it occurs. But, also, keep in mind that
> >in minimising contention we are still selecting the most local of
> >possible alternatives, and that's something you can't do in
> >userspace....
> 
> Cool.  I don't think "nearly-local" matters much for an SSD (it's
> either contiguous or it is not), but it's good to know that it's
> self-tuning wrt. contention.

"Nearly local" matters a lot for filesystem free space management
and hence minimising the amount o LBA space the filesystem actually
uses in the long term given a relatively predicatable workload....

> In some good news, Glauber hacked our I/O engine not to throw so
> many concurrent I/Os at the filesystem, and indeed so the contention
> reduced.  So it's likely we were pushing the fs so hard all the ags
> were contended, but this is no longer the case.

What is the xfs_info output of the filesystem you tested on?

> >With the way the XFS allocator works, it fills AGs from lowest to
> >highest blocks, and if you free lots of space down low in the AG
> >then that tends to get reused before the higher offset free space.
> >hence the XFS allocates space in the above workload would result in
> >roughly 1/3rd of the LBA space associated with the filesystem
> >remaining unused. This is another allocator behaviour designed for
> >spinning disks (to keep the data on the faster outer edges of
> >drives) that maps very well to internal SSD allocation/reclaim
> >algorithms....
> 
> Cool.  So we'll keep fstrim usage to daily, or something similarly low.

Well, it's something you'll need to monitor to determine what the
best frequency is, as even fstrim doesn't come for free (esp. if the
storage does not support queued TRIM commands).

> >FWIW, did you know that TRIM generally doesn't return the disk to
> >the performance of a pristine, empty disk?  Generally only a secure
> >erase will guarantee that a SSD returns to "empty disk" performance,
> >but that also removes all data from then entire SSD.  Hence the
> >baseline "sustained performance" you should be using is not "empty
> >disk" performance, but the performance once the disk has been
> >overwritten completely at least once. Only them will you tend to see
> >what effect TRIM will actually have.
> 
> I did not know that.  Maybe that's another factor in why cloud SSDs
> are so slow.

Have a look at the random write performance consistency graphs for
the different enterprise SSDs here:

http://www.anandtech.com/show/9430/micron-m510dc-480gb-enterprise-sata-ssd-review/3

You'll see just how different sustained write load performance is to
the empty drive performance (which is only the first few hundred
seconds of each graph) across the different drives that have been
tested. The next page has similar results for mixed random
read/write workloads....

That will give you a good idea of how the current enterprise SSDs
behave under sustained write load. It's a *lot* better than the way
the 1st and 2nd generation drives performed....

> >>write 10%-20% of the disk's capacity.
> >Run the workload to steady state performance and measure the
> >degradation as it continues to run and overwrite the SSDs
> >repeatedly. To do this properly you are going to have to sacrifice
> >some SSDs, because you're going to need to overwrite them quite a
> >few times to get an idea of the degradation characteristics and
> >whether a periodic trim makes any difference or not.
> 
> Enterprise SSDs are guaranteed for something like N full writes /
> day for several years, are they not?

Yes, usually somewhere between 3-15 DWPD (Drive Writes Per Day) - it
typically works out at around 5000 full drive write cycles for
enterprise drives.  However, at both the low capacity end of the
scale or the high performance end (i.e. pcie cards capable of multiple
GB/s writes), it's not uncommon to be able to burn a DW cycle in
under 10 minutes and so you can easily burn the life out of a drive
in a couple of weeks of intense testing....

> So such a test can take weeks
> or months, depending on the ratio between disk size and bandwidth.
> Still, I guess it has to be done.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs