Re: EXT4 vs LVM performance for VMs

From: Dave Chinner <david@fromorbit.com>
To: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: Premysl Kouril <premysl.kouril@gmail.com>,
	Theodore Ts'o <tytso@mit.edu>, Andi Kleen <andi@firstfloor.org>,
	linux-fsdevel@vger.kernel.org, changwoo.m@gmail.com,
	taesoo@gatech.edu, steffen.maass@gatech.edu, changwoo@gatech.edu,
	"Kashyap, Sanidhya" <sanidhya@gatech.edu>
Subject: Re: EXT4 vs LVM performance for VMs
Date: Sun, 14 Feb 2016 11:01:22 +1100	[thread overview]
Message-ID: <20160214000122.GR19486@dastard> (raw)
In-Reply-To: <CADa969gCM0_-DieCTrfxFJ0GK5w3fpBCCBn2v4M5oUYUJJ8CNg@mail.gmail.com>

On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote:
> We did quite extensive performance evaluation on file systems,
> including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core
> scalability using micro-benchmarks and application benchmarks.
> 
> Your workload, i.e., multiple tasks are concurrently overwriting a
> single file, whose file system blocks are previously written, is quite
> similar to one of our benchmark.
> 
> Based on our analysis, none of the file systems supports concurrent
> update of a file even when each task accesses different region of
> a file. That is because all file systems hold a lock for an entire
> file. Only one exception is the concurrent direct I/O of XFS.
> 
> I think that local file systems need to support the range-based
> locking, which is common in parallel file systems, to improve
> concurrency level of I/O operations, specifically write operations.

Yes, we've spent a fair bit of time talking about that (pretty sure
it was a topic of discussion at last year's LFSMM developer
conference), but it really isn't a simply thing to add to the VFS or
most filesystems.

> If you can split a single file image into multiple files, you can
> increase the concurrency level of write operations a little bit.

At the cost of increased storage stack complexity. most people don't
need extreme performance in their VMs, so a single file is generally
adequate on XFS.

> For more details, please take a look at our paper draft:
>   https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf
> 
> Though our paper is in review, I think it is okay to share since
> the review process is single-blinded. You can find our analysis on
> overwrite operations at Section 5.1.2. Scalability behavior of current
> file systems are summarized at Section 7.

It's a nice summary of the issues, but there are no surprises in the
paper. i.e. It's all things we already know about and, in
some cases, are already looking at solutions (e.g.  per-node/per-cpu
lists to address inode_sb_list_lock contention, potential for
converting i_mutex to an rwsem to allow shared read-only access to
directories, etc).

The only thing that surprised me is how badly rwsems degrade when
contended on large machines. I've done local benchmarks on 16p
machines with single file direct IO and pushed to being CPU bound
I've measured over 2 million single sector random read IOPS, 1.5
million random overwrite IOPS, and ~800k random write w/ allocate
IOPS.  IOWs, the IO scalability is there when the lock doesn't
degrade (which really is a core OS issue, not so much a fs issue).

A couple of things I noticed in the summary:

"High locality can cause performance collapse"

You imply filesystems try to maintain high locality to improve cache
hit rates.  Filesystems try to maintain locality in disk allocation
to minimise seek time for physical IO on related structures to
maintain good performance when /cache misses occur/. IOWs, the
scalability of the in-memory caches is completely unrelated to the
"high locality" optimisations that filesystem make...

"because XFS holds a per-device lock instead of a per-file lock in
an O_DIRECT mode"

That's a new one - I've never heard anyone say that about XFS (and
I've heard a lot of wacky things about XFS!). It's much simpler than
that - we don't use the i_mutex in O_DIRECT mode, and instead uses
shared read locking on the per-inode IO lock for all IO operations.

"Overwriting is as expensive as appending"

You shouldn't make generalisations that don't apply generally to the
the filesystems you tested. :P

FWIW, log->l_icloglock contention in XFS implies the application has
an excessive fsync problem - that's the only way that lock can see
any sort of significant concurrent access.  It's probably just the
case that the old-school algorithm the code uses to wait for journal
IO completion was never expected to scale to operations on storage
that can sustain millions of IOPS.

I'll add it to the list of known journalling scalabiity bottlenecks
in XFS - there's a lot more issues than your testing has told you
about.... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com