Re: high throughput storage server?

From: "Keld Jørn Simonsen" <keld@keldix.com>
To: Stan Hoeppner <stan@hardwarefreak.com>
Cc: "Keld Jørn Simonsen" <keld@keldix.com>,
	Mdadm <linux-raid@vger.kernel.org>,
	"Roberto Spadim" <roberto@spadim.com.br>,
	NeilBrown <neilb@suse.de>,
	"Christoph Hellwig" <hch@infradead.org>,
	Drew <drew.kay@gmail.com>
Subject: Re: high throughput storage server?
Date: Tue, 22 Mar 2011 12:01:29 +0100	[thread overview]
Message-ID: <20110322110129.GB9329@www2.open-std.org> (raw)
In-Reply-To: <4D887348.3030902@hardwarefreak.com>

On Tue, Mar 22, 2011 at 05:00:40AM -0500, Stan Hoeppner wrote:
> Keld Jørn Simonsen put forth on 3/21/2011 5:13 PM:
> > On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> >> Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:
> >>
> >>> Anyway, with 384 spindles and only 50 users, each user will have in
> >>> average 7 spindles for himself. I think much of the time this would mean 
> >>> no random IO, as most users are doing large sequential reading. 
> >>> Thus on average you can expect quite close to striping speed if you
> >>> are running RAID capable of striping. 
> >>
> >> This is not how large scale shared RAID storage works under a
> >> multi-stream workload.  I thought I explained this in sufficient detail.
> >>  Maybe not.
> > 
> > Given that the whole array system is only lightly loaded, this is how I
> > expect it to function. Maybe you can explain why it would not be so, if
> > you think otherwise.
> 
> Using the term "lightly loaded" to describe any system sustaining
> concurrent 10GB/s block IO and NFS throughput doesn't seem to be an
> accurate statement.  I think you're confusing theoretical maximum
> hardware performance with real world IO performance.  The former is
> always significantly higher than the latter.  With this in mind, as with
> any well designed system, I specified this system to have some headroom,
> as I previously stated.  Everything we've discussed so far WRT this
> system has been strictly parallel reads.

The disks themselves should be cabable of doing about 60 GB/s so 10 GB/s
is only a 15 % use of the disks. And most of the IO is concurrent
sequential reading of big files.

> Now, if 10 cluster nodes are added with an application that performs
> streaming writes, occurring concurrently with the 50 streaming reads,
> we've just significantly increased the amount of head seeking on our
> disks.  The combined IO workload is now a mixed heavy random read/write
> workload.  This is the most difficult type of workload for any RAID
> subsystem.  It would bring most parity RAID arrays to their knees.  This
> is one of the reasons why RAID10 is the only suitable RAID level for
> this type of system.

Yes, I agree. And that is why I also suggest you use a mirrored raid in
the form of Linux MD RAID 10, F2, for better striping performance and disk
access performance than traditional RAID1+0.

Anyway the system was not specified to have additional 10 heavy writing processes.

> >> In summary, concatenating many relatively low stripe spindle count
> >> arrays, and using XFS allocation groups to achieve parallel scalability,
> >> gives us the performance we want without the problems associated with
> >> other configurations.
> 
> > it is probably not the concurrency of XFS that makes the parallelism of
> > the IO. 
> 
> It most certainly is the parallelism of XFS.  There are some caveats to
> the amount of XFS IO parallelism that are workload dependent.  But
> generally, with multiple processes/threads reading/writing multiple
> files in multiple directories, the device parallelism is very high.  For
> example:
> 
> If you have 50 NFS clients all reading the same large 20GB file
> concurrently, IO parallelism will be limited to the 12 stripe spindles
> on the single underlying RAID array upon which the AG holding this file
> resides.  If no other files in the AG are being accessed at the time,
> you'll get something like 1.8GB/s throughput for this 20GB file.  Since
> the bulk, if not all, of this file will get cached during the read, all
> 50 NFS clients will likely be served from cache at their line rate of
> 200MB/s, or 10GB/s aggregate.  There's that magic 10GB/s number again.
> ;)  As you can see I put some serious thought into this system
> specification.
> 
> If you have all 50 NFS clients accessing 50 different files in 50
> different directories you have no cache benefit.  But we will have files
> residing in all allocations groups on all 16 arrays.  Since XFS evenly
> distributes new directories across AGs when the directories are created,
> we can probably assume we'll have parallel IO across all 16 arrays with
> this workload.  Since each array can stream reads at 1.8GB/s, that's
> potential parallel throughput of 28GB/s, saturating our PCIe bus
> bandwidth of 16GB/s.

Hmm, yes RAID1+0 can probably only stream read at 1.8 GB/s. Linux MD
RAID10,F2 can stream read at around 3.6 GB/s, on an array of 24
spindles 15000 rpm, given that each spindle is capable of stream
reading at about 150 MB/s.

> Now change this to 50 clients each doing 10,000 4KB file reads in a
> directory along with 10,000 4KB file writes.  The throughput of each 12
> disk array may now drop by over a factor of approximately 128, as each
> disk can only sustain about 300 head seeks/second, dropping its
> throughput to 300 * 4096 bytes = 1.17MB/s.  Kernel readahead may help
> some, but it'll still suck.
> 
> It is the occasional workload such as that above that dictates
> overbuilding the disk subsystem.  Imagine adding a high IOPS NFS client
> workload to this server after it went into production to "only" serve
> large streaming reads.  The random workload above would drop the
> performance of this 384 disk array with 15k spindles from a peak
> streaming rate of 28.4GB/s to 18MB/s--yes, that's megabytes.

Yes, random reading can diminish performance a lot.
If the mix of random/sequential reading is still with a good sequential
part, then I tink the system should still perform well. I think we lack
measurements for things like that, for maybe incremental sequential
reading speed on a non-saturated file system. I am not sure on how to
define such measures, tho.

> With one workload the disks can saturate the PCIe bus by almost a factor
> of two.  With an opposite workload the disks can only transfer one
> 14,000th of the PCIe bandwidth.  This is why Fortune 500 companies and
> others with extremely high random IO workloads such as databases, and
> plenty of cash, have farms with multiple thousands of disks attached to
> database and other servers.

Or use SSD.

> > It is more likely the IO system, and that would also work for
> > other file system types, like ext4. 
> 
> No.  Upper kernel layers doesn't provide this parallelism.  This is
> strictly an XFS feature, although JFS had something similar (and JFS is
> now all but dead), though not as performant.  BTRFS might have something
> similar but I've read nothing about BTRFS internals.  Because XFS has
> simply been the king of scalable filesystems for 15 years, and added
> great new capability along the way, all of the other filesystem
> developers have started to steal ideas from XFS.   IIRC Ted T'so stole
> some things from XFS for use in EXT4, but allocation groups wasn't one
> of them.
> 
> > I do not see anything in the XFS allocation
> > blocks with any knowledge of the underlying disk structure. 
> 
> The primary structure that allows for XFS parallelism is
> xfs_agnumber_t    sb_agcount
> 
> Making the filesystem with
> mkfs.xfs -d agcount=16
> 
> creates 16 allocations groups of 1.752TB each in our case, 1 per 12
> spindle array.  XFS will read/write to all 16 AGs in parallel, under the
> right circumstances, with 1 or multiple  IO streams to/from each 12
> spindle array.  XFS is the only Linux filesystem with this type of
> scalability, again, unless BTRFS has something similar.
> 
> > What the file system does is only to administer the scheduling of the
> > IO, in combination with the rest of the kernel.
> 
> Given that XFS has 64xxx lines of code, BTRFS has 46xxx, and EXT4 has
> 29xxx, I think there's a bit more to it than that Keld. ;)  Note that
> XFS has over twice the code size of EXT4.  That's not bloat but
> features, one them being allocation groups.  If your simplistic view of
> this was correct we'd have only one Linux filesystem.  Filesystem code
> does much much more than you realize.

Oh, well, of cause the file system does a lot of things. And I have done
a number of designs and patches to a number of file systems during the years.
But I was talking about the overall picture. The CPU power should not be the
bottleneck, the bottleneck is the IO. So we use the kernel code to
administer the IO in the best possible way.  I am also using XFS for
many file systems, but I am also using EXT3, and I think I get
about the same results for the systems I do, which are also a mostly
sequential reading of many big files concurrently (a ftp server).

> > Anyway, thanks for the energy and expertise that you are supplying to
> > this thread.
> 
> High performance systems are one of my passions.  I'm glad to
> participate and share.  Speaking of sharing, after further reading on
> how the parallelism of AGs is done and some other related things, I'm
> changing my recommendation to using only 16 allocation groups of 1.752TB
> with this system, one AG per array, instead of 64 AGs of 438GB.  Using
> 64 AGs could potentially hinder parallelism in some cases.

Thank you again for your insights
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html