Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md

All of lore.kernel.org
 help / color / mirror / Atom feed

* Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md
@ 2014-04-15 12:23 Johannes Truschnigg
  2014-04-15 21:34 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Truschnigg @ 2014-04-15 12:23 UTC (permalink / raw)
  To: xfs

Hi list,

we're building a postgres streaming replication slave that's supposed to 
pick up work if our primary pg cluster (with an all-flash FC SAN 
appliance as its backend store) goes down. We'll be using consumer-grade 
hardware for this setup, which I detail below:

o 2x Intel Xeon E5-2630L (24 threads total)
o 512GB DDR3 ECC RDIMM
o Intel C606-based Dual 4-Port SATA/SAS HBA (PCIID 8086:1d68)
o 6x Samsung 830 SSD with 512GB each, 25% reserved for HPA
o Debian GNU/Linux 7.x "Wheezy" + backports kernel (3.13+)
o PostgreSQL 9.0

If there's anything else that is of critical interest that I forgot to 
mention, hardware- or software-wise, please let me know.

When benchmarking the individual SSDs with fio (using the libaio 
backend), the IOPS we've seen were in the 30k-35k range overall for 4K 
block sizes. The host will be on the receiving end of a pg9.0 streaming 
replication cluster setup where the master handles ~50k IOPS peak, and 
I'm thinking what'd be a good approach to design the local storage stack 
(with availability in mind) in a way that has a chance to keep up with 
our flash-based FC SAN.

After digging through linux-raid archives, I think the most sensible 
approach are two-disk pairs in RAID1 that are concatenated via either 
LVM2 or md (leaning towards the latter, since I'd expect that to have a 
tad less overhead), and xfs on top of that resulting block device. That 
should yield roughly 1.2TB of usable space (we need a minimum of 900GB 
for the DB). With this setup, it should be possible to have up to 3 CPUs 
busy with handling I/O on the block side of things, which raises the 
question what'd be a sensible value to choose for xfs' Allocation Group 
Count/agcount.

I've been trying to find information on that myself, but what I managed 
to dig up is, at times, so old that it seems rather outlandish today - 
some sources on the web (from 2003), for example, say that one AG per 
4GB of underlying diskspace makes sense, which seems excessive for a 
1200GB volume.

I've experimented with mkfs.xfs (on top of LVM only; I don't know if it 
takes into account lower block layers and seen that it supposedly 
chooses to default to an agcount of 4, which seems insufficient given 
the max. bandwidth our setup should be able to provide.

Apart from that, is there any kind of advice you can share for tuning 
xfs to run postgres (9.0 initially, but we're planning to upgrade to 9.3 
or later eventually) on in 2014, especially performance-wise?

Thanks, regards:
- Johannes

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md
  2014-04-15 12:23 Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md Johannes Truschnigg
@ 2014-04-15 21:34 ` Dave Chinner
  2014-04-16  8:21   ` Johannes Truschnigg
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2014-04-15 21:34 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: xfs

On Tue, Apr 15, 2014 at 02:23:07PM +0200, Johannes Truschnigg wrote:
> Hi list,
> 
> we're building a postgres streaming replication slave that's
> supposed to pick up work if our primary pg cluster (with an
> all-flash FC SAN appliance as its backend store) goes down. We'll be
> using consumer-grade hardware for this setup, which I detail below:
> 
> o 2x Intel Xeon E5-2630L (24 threads total)
> o 512GB DDR3 ECC RDIMM
> o Intel C606-based Dual 4-Port SATA/SAS HBA (PCIID 8086:1d68)

How much write cache does this have?

> o 6x Samsung 830 SSD with 512GB each, 25% reserved for HPA

830? That's the previous generation of drives - do you mean 840?

> o Debian GNU/Linux 7.x "Wheezy" + backports kernel (3.13+)
> o PostgreSQL 9.0
> 
> If there's anything else that is of critical interest that I forgot
> to mention, hardware- or software-wise, please let me know.
> 
> When benchmarking the individual SSDs with fio (using the libaio
> backend), the IOPS we've seen were in the 30k-35k range overall for
> 4K block sizes.

They don't sustain that performance over 20+ minutes of constant IO,
though. Even if you have 840s (I have 840 EVOs in my test rig), the
sustained performance of 4k random write IOPS is somewhere around
4-6k each. See, for example, the performance consistency graphs here:

http://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/6

Especially the last one that shows a zoomed view of the steady state
behaviour between 1400s and 2000s of constant load.

The 830 series are old enough that they were reviewed before this
was considered an important metric for SSD comparison, and so there
is no equivalent information available for them. However, they are
likely to be significantly slower and less deterministic in their
behaviour than the 840s under the same load...

> The host will be on the receiving end of a pg9.0
> streaming replication cluster setup where the master handles ~50k
> IOPS peak, and I'm thinking what'd be a good approach to design the
> local storage stack (with availability in mind) in a way that has a
> chance to keep up with our flash-based FC SAN.

I'd be surprised if it can keep up after a couple of months of
production level IO going to the SSDs...

> After digging through linux-raid archives, I think the most sensible
> approach are two-disk pairs in RAID1 that are concatenated via
> either LVM2 or md (leaning towards the latter, since I'd expect that
> to have a tad less overhead),

I'd stripe them (i.e. RAID10), not concantenate them so as to load
both RAID1 legs evenly.

> and xfs on top of that resulting block
> device. That should yield roughly 1.2TB of usable space (we need a
> minimum of 900GB for the DB). With this setup, it should be possible
> to have up to 3 CPUs busy with handling I/O on the block side of
> things, which raises the question what'd be a sensible value to
> choose for xfs' Allocation Group Count/agcount.
> 
> I've been trying to find information on that myself, but what I
> managed to dig up is, at times, so old that it seems rather
> outlandish today - some sources on the web (from 2003), for example,
> say that one AG per 4GB of underlying diskspace makes sense, which
> seems excessive for a 1200GB volume.
> 
> I've experimented with mkfs.xfs (on top of LVM only; I don't know if
> it takes into account lower block layers and seen that it supposedly
> chooses to default to an agcount of 4, which seems insufficient
> given the max. bandwidth our setup should be able to provide.

The number of AGs has no bearing on acheivable bandwidth. The number
of AGs affects allocation concurrency. Hence if you have 24 CPU
cores, I'd expect that you want 32 AGs. Normally with a RAID array
this will be the default, but it seems that RAID1 is not triggering
the "optimise for allocation concurrency" heuristics in mkfs....

> Apart from that, is there any kind of advice you can share for
> tuning xfs to run postgres (9.0 initially, but we're planning to
> upgrade to 9.3 or later eventually) on in 2014, especially
> performance-wise?

Apart from the AG count and perhaps tuning the sunit/swidth to match
the RAID0 part of the equation, I wouldn't touch a thing unless you
know that there's a problem that needs fixing and you know exactly
what knob will fix the problem you have...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md
  2014-04-15 21:34 ` Dave Chinner
@ 2014-04-16  8:21   ` Johannes Truschnigg
  2014-04-16  9:31     ` Dave Chinner
  2014-04-16 23:31     ` Stan Hoeppner
  0 siblings, 2 replies; 5+ messages in thread
From: Johannes Truschnigg @ 2014-04-16  8:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On 04/15/2014 11:34 PM, Dave Chinner wrote:
> On Tue, Apr 15, 2014 at 02:23:07PM +0200, Johannes Truschnigg wrote:
>> Hi list,
>> [...]
>> o Intel C606-based Dual 4-Port SATA/SAS HBA (PCIID 8086:1d68)
>
> How much write cache does this have?

It's a plain HBA; it doesn't have write cache (or a BBU) of its own.

>> o 6x Samsung 830 SSD with 512GB each, 25% reserved for HPA
>
> 830? That's the previous generation of drives - do you mean 840?

No, I really mean 830 - we've tested 840 EVO as well, and they performed 
quite well, too, however from what I've seen on the web the longevity of 
Samsung's TLC flash choice in 840 disks isn't as promising as those of 
the 830s MLC variant. We might be switching over to 840 EVO or one of 
their successors once the 830s wear out, or we need to expand capacity, 
but we do have a number of 830s in stock that we'll use first.

>> When benchmarking the individual SSDs with fio (using the libaio
>> backend), the IOPS we've seen were in the 30k-35k range overall for
>> 4K block sizes.
>
> They don't sustain that performance over 20+ minutes of constant IO,
> though. Even if you have 840s (I have 840 EVOs in my test rig), the
> sustained performance of 4k random write IOPS is somewhere around
> 4-6k each. See, for example, the performance consistency graphs here:
>
> http://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/6
>
> Especially the last one that shows a zoomed view of the steady state
> behaviour between 1400s and 2000s of constant load.

I used tkperf[0] to benchmark the devices, both on Intel's SAS HBA and 
on a LSI 2108 SAS RAID-Controller. I did runs for the 512GB 830 with 25% 
over-provisioning, and runs for 1TB 840 EVO with 0% op and 25% op (two 
different disks with the same firmware). tkperf tries hard to achieve 
steady state by torturing the devices for a few hours before the actual 
benchmarking takes place, and will only do so after that steady state 
has been reached.

 From what I've seen, the over-provisioning is absolutely crucial to get 
anywhere near acceptable performance; since Anandtech doesn't seem to 
use it, I'll trust my tests more.

For reference: the 750GB usable-space EVO clocked in at ~35k 4k IOPS on 
the LSI 2108, whilst the 1000GB usable-space sister disk still hasn't 
finished the benchmark run, because it's _so much slower_. The benchmark 
was started about ten days ago for both disks; the 750GB disk finished 
after some 2 or 3 days, and I'm _still_ waiting for the 1000GB disk to 
finish benchmarking. Only then I'll be able to look at the pretty graphs 
and tables tkperf generates, but when tailing the log and watching 
iostat, I can already draw some early conclusions as to how these two 
configurations perform, and they're not in the same ballpark at all.

> The 830 series are old enough that they were reviewed before this
> was considered an important metric for SSD comparison, and so there
> is no equivalent information available for them. However, they are
> likely to be significantly slower and less deterministic in their
> behaviour than the 840s under the same load...

Afaik, 840 EVO's relatively high peak performance stems from the DRAM 
buffer these disks supposedly have built in, while the 830 lacks that 
kind of trick. Given that the EVO's performance drops after that buffer 
has worked its magic, I'd actually expect the 830 to perform _more 
consistent_ (not necessarily better, even on average, though) than the 
840 EVO. We'll see if that holds true if/when we put 840 EVOs into 
service, I guess.

>> The host will be on the receiving end of a pg9.0
>> streaming replication cluster setup where the master handles ~50k
>> IOPS peak, and I'm thinking what'd be a good approach to design the
>> local storage stack (with availability in mind) in a way that has a
>> chance to keep up with our flash-based FC SAN.
>
> I'd be surprised if it can keep up after a couple of months of
> production level IO going to the SSDs...

Yeah, that remains to be seen, and it'll be very interesting - if 
anyone's interested, I'll be happy to share our learnings from this 
project once we have enough data worth talking about. Remember, the 
numbers I posted are _peak_ load at the master though, most of the time, 
we don't exceed 10k IOPS, and some of the time, the system is 
practically idle. That might give the SSD controllers enough time to 
work their garbage collection secret sauce magic, and sustain high(er) 
performance over most of their lifetimes.

>> After digging through linux-raid archives, I think the most sensible
>> approach are two-disk pairs in RAID1 that are concatenated via
>> either LVM2 or md (leaning towards the latter, since I'd expect that
>> to have a tad less overhead),
>
> I'd stripe them (i.e. RAID10), not concantenate them so as to load
> both RAID1 legs evenly.

Afaik, the problem with md is that each array (I'm pretty convinced that 
also holds true for RAID10, but I'm not 100% sure) only has one 
associated kernel thread for writes, which should make that kind of 
setup worse, at least in theory and in terms of achiveable parallelism, 
than the setup I described. I'd be very happy to see a comparison 
between the two setups for high-IOPS devices, but I haven't yet found 
one anywhere.

 > [...]
>> I've experimented with mkfs.xfs (on top of LVM only; I don't know if
>> it takes into account lower block layers and seen that it supposedly
>> chooses to default to an agcount of 4, which seems insufficient
>> given the max. bandwidth our setup should be able to provide.
>
> The number of AGs has no bearing on acheivable bandwidth. The number
> of AGs affects allocation concurrency. Hence if you have 24 CPU
> cores, I'd expect that you want 32 AGs. Normally with a RAID array
> this will be the default, but it seems that RAID1 is not triggering
> the "optimise for allocation concurrency" heuristics in mkfs....

Thanks, that is a very useful heads-up! What's the formula used to get 
to 32 AGs for 24 CPUs - just (num_cpus * 4/3), and is there a simple 
explanation for why this is an ideal starting point? And is that an 
advisable rule of thumb for xfs in general?

>> Apart from that, is there any kind of advice you can share for
>> tuning xfs to run postgres (9.0 initially, but we're planning to
>> upgrade to 9.3 or later eventually) on in 2014, especially
>> performance-wise?
>
> Apart from the AG count and perhaps tuning the sunit/swidth to match
> the RAID0 part of the equation, I wouldn't touch a thing unless you
> know that there's a problem that needs fixing and you know exactly
> what knob will fix the problem you have...

OK, I'll read up on stripe width impact and will (hopefully) have enough 
time to test a number of configs that should make sense.

Many thanks for your contribution and advice! :)

[0]: http://www.thomas-krenn.com/en/oss/tkperf.html

-- 
Mit freundlichen Grüßen
Johannes Truschnigg
Senior System Administrator
--
mailto:johannes.truschnigg@geizhals.at (in dringenden Fällen bitte an 
info@geizhals.at)

Geizhals(R) - Preisvergleich Internet Services AG
Obere Donaustrasse 63/2
A-1020 Wien
Tel: +43 1 5811609/87
Fax: +43 1 5811609/55
http://geizhals.at => Preisvergleich für Österreich
http://geizhals.de => Preisvergleich für Deutschland
http://geizhals.eu => Preisvergleich EU-weit
Handelsgericht Wien | FN 197241K | Firmensitz Wien

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md
  2014-04-16  8:21   ` Johannes Truschnigg
@ 2014-04-16  9:31     ` Dave Chinner
  2014-04-16 23:31     ` Stan Hoeppner
  1 sibling, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2014-04-16  9:31 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: xfs

On Wed, Apr 16, 2014 at 10:21:44AM +0200, Johannes Truschnigg wrote:
> Hi Dave,
> 
> On 04/15/2014 11:34 PM, Dave Chinner wrote:
> >On Tue, Apr 15, 2014 at 02:23:07PM +0200, Johannes Truschnigg wrote:
> >>Hi list,
> >>[...]
> >>o Intel C606-based Dual 4-Port SATA/SAS HBA (PCIID 8086:1d68)
> >
> >How much write cache does this have?
> 
> It's a plain HBA; it doesn't have write cache (or a BBU) of its own.

Ok, so nothing to isolate nasty bad IO patterns from the drives,
or to soak up write peaks. IOWs, what the drives give you is all
you're going to get. You might want to think about dropping $1000 on
a good quality LSI SAS RAID HBA and putting the disks behind that...

> >>o 6x Samsung 830 SSD with 512GB each, 25% reserved for HPA
> >
> >830? That's the previous generation of drives - do you mean 840?
> 
> No, I really mean 830 - we've tested 840 EVO as well, and they
> performed quite well, too, however from what I've seen on the web
> the longevity of Samsung's TLC flash choice in 840 disks isn't as
> promising as those of the 830s MLC variant. We might be switching
> over to 840 EVO or one of their successors once the 830s wear out,
> or we need to expand capacity, but we do have a number of 830s in
> stock that we'll use first.

What I've read is "there's really no difference". Yes, there are
less write/erase cycles for the 21nm TLC compared to the 27nm MLC in
the 830s, but the controller in the 840 is far better at handling
wear levelling.

> >>When benchmarking the individual SSDs with fio (using the libaio
> >>backend), the IOPS we've seen were in the 30k-35k range overall for
> >>4K block sizes.
> >
> >They don't sustain that performance over 20+ minutes of constant IO,
> >though. Even if you have 840s (I have 840 EVOs in my test rig), the
> >sustained performance of 4k random write IOPS is somewhere around
> >4-6k each. See, for example, the performance consistency graphs here:
> >
> >http://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/6
> >
> >Especially the last one that shows a zoomed view of the steady state
> >behaviour between 1400s and 2000s of constant load.
> 
> I used tkperf[0] to benchmark the devices, both on Intel's SAS HBA
> and on a LSI 2108 SAS RAID-Controller. I did runs for the 512GB 830
> with 25% over-provisioning, and runs for 1TB 840 EVO with 0% op and
> 25% op (two different disks with the same firmware). tkperf tries
> hard to achieve steady state by torturing the devices for a few
> hours before the actual benchmarking takes place, and will only do
> so after that steady state has been reached.
> 
> From what I've seen, the over-provisioning is absolutely crucial to
> get anywhere near acceptable performance; since Anandtech doesn't
> seem to use it, I'll trust my tests more.

Oh, they do, just not in every SSD review they do:

http://anandtech.com/show/7864/crucial-m550-review-128gb-256gb-512gb-and-1tb-models-tested/3

Unfortunately, there aren't 25% spare area numbers for the 840EVO...

> For reference: the 750GB usable-space EVO clocked in at ~35k 4k IOPS
> on the LSI 2108, whilst the 1000GB usable-space sister disk still
> hasn't finished the benchmark run, because it's _so much slower_.

Yes, apart from validation effort, that's the main difference
between consumer and enterprise SSDs; enterprise SSDs usually run
20-25% over provisioned space but are otherwise mostly identical
hardware and firmware to the consumer drives.  That's why you get
200, 400 and 800GB enterprise drives rather than 250, 500, and 1TB
capacities...

> >>After digging through linux-raid archives, I think the most sensible
> >>approach are two-disk pairs in RAID1 that are concatenated via
> >>either LVM2 or md (leaning towards the latter, since I'd expect that
> >>to have a tad less overhead),
> >
> >I'd stripe them (i.e. RAID10), not concantenate them so as to load
> >both RAID1 legs evenly.
> 
> Afaik, the problem with md is that each array (I'm pretty convinced
> that also holds true for RAID10, but I'm not 100% sure) only has one
> associated kernel thread for writes,

I think it used to have a single thread for parity calculations,
which is not used for raid 0/1/10, so I don't think that's true
anymore. There were patches to multithread the parity calculations,
no idea what the status of that work ended being...

> which should make that kind of
> setup worse, at least in theory and in terms of achiveable
> parallelism, than the setup I described. I'd be very happy to see a
> comparison between the two setups for high-IOPS devices, but I
> haven't yet found one anywhere.

I don't think it makes any difference at all. I have both LVM and MD
RAID 0 SSD stripes, and neither MD nor DM are the performance
limiting factor, nor do they show up anywhere in profiles.

> > [...]
> >>I've experimented with mkfs.xfs (on top of LVM only; I don't know if
> >>it takes into account lower block layers and seen that it supposedly
> >>chooses to default to an agcount of 4, which seems insufficient
> >>given the max. bandwidth our setup should be able to provide.
> >
> >The number of AGs has no bearing on acheivable bandwidth. The number
> >of AGs affects allocation concurrency. Hence if you have 24 CPU
> >cores, I'd expect that you want 32 AGs. Normally with a RAID array
> >this will be the default, but it seems that RAID1 is not triggering
> >the "optimise for allocation concurrency" heuristics in mkfs....
> 
> Thanks, that is a very useful heads-up! What's the formula used to
> get to 32 AGs for 24 CPUs - just (num_cpus * 4/3), and is there a
> simple explanation for why this is an ideal starting point? And is
> that an advisable rule of thumb for xfs in general?

Simple explanation: 32 is the default for RAID5/6 based devices
between 1-32TB in size.

General rule of thumb:

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

> >>Apart from that, is there any kind of advice you can share for
> >>tuning xfs to run postgres (9.0 initially, but we're planning to
> >>upgrade to 9.3 or later eventually) on in 2014, especially
> >>performance-wise?
> >
> >Apart from the AG count and perhaps tuning the sunit/swidth to match
> >the RAID0 part of the equation, I wouldn't touch a thing unless you
> >know that there's a problem that needs fixing and you know exactly
> >what knob will fix the problem you have...
> 
> OK, I'll read up on stripe width impact and will (hopefully) have
> enough time to test a number of configs that should make sense.

http://xfs.org/index.php/XFS_FAQ#Q:_How_to_calculate_the_correct_sunit.2Cswidth_values_for_optimal_performance

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md
  2014-04-16  8:21   ` Johannes Truschnigg
  2014-04-16  9:31     ` Dave Chinner
@ 2014-04-16 23:31     ` Stan Hoeppner
  1 sibling, 0 replies; 5+ messages in thread
From: Stan Hoeppner @ 2014-04-16 23:31 UTC (permalink / raw)
  To: Johannes Truschnigg, Dave Chinner; +Cc: NeilBrown, xfs

On 4/16/2014 3:21 AM, Johannes Truschnigg wrote:
> On 04/15/2014 11:34 PM, Dave Chinner wrote:
...
>>> After digging through linux-raid archives, I think the most sensible
>>> approach are two-disk pairs in RAID1 that are concatenated via
>>> either LVM2 or md (leaning towards the latter, since I'd expect that
>>> to have a tad less overhead),
>>
>> I'd stripe them (i.e. RAID10), not concantenate them so as to load
>> both RAID1 legs evenly.
> 
> Afaik, the problem with md is that each array (I'm pretty convinced that
> also holds true for RAID10, but I'm not 100% sure) only has one
> associated kernel thread for writes, which should make that kind of

Neil will surely correct me if I missed any relatively recent patches
may have changed these.  Single write thread personalities (ones people
actually use):

RAID 1, 10, 5, 6

Unbound personalities:

RAID 0, linear

> setup worse, at least in theory and in terms of achiveable parallelism,
> than the setup I described. I'd be very happy to see a comparison
> between the two setups for high-IOPS devices, but I haven't yet found
> one anywhere.

I can't provide such a head-to-head comparison but I can provide some
insight.  With a plain HBA, 6 SSDs, and md you should test RAID50 for
this workload, an md RAID0 over two 3 drive RAID5 arrays.

Your dual socket 6 core Sandy Bridge 15MB L3 parts are 2GHz, boost clock
2.5GHz.  I've been doing tuning for a colleague with a single socket 4
core Ivy Bridge 8MB L3 part at 3.3GHz, boost clock 3.7GHz, Intel board
w/C202 ASIC, 8GB two channel DDR3, 9211-8i PCIe 2.0 x8 HBA (LSISAS 2008
ASIC), and currently 7, previously 5, Intel 520 series 480GB consumer
SSDs, no over provisioning.  These use the SandForce 2281 controller
which relies on compression for peak performance.

The array is md RAID5, metadata 1.2, 64KB chunk, stripe_cache_size 4096,
reshaped from 5 to 7 drives recently.  The system is an iSCSI target
server, poor man's SAN, and has no filesystems for the most part.  The
md device is carved into multiple LVs which are exported as LUNs, w/one
50GB LV reserved for testing/benchmarking.  With the single RAID5 write
thread we're achieving 318k parallel FIO 4KB random read IOPS, 45k per
drive as all 7 drives are in play for reads as there is no parity block
skipping as with rust.  We see a shade over 59k 4KB random write IOPS,
~10k IOPS per drive, using parallel submission, zero_buffers for
compressibility, libaio, etc.  The apparently low 59k figure appears
entirely due to GC, as you can see the latency start small and ramp up
quickly two paragraphs below.

10k per drive is in line with Intel's lowest number for the 520s 480GB
model of 9.5k IOPS, but theirs is for incompressible data.  Given Dave's
4-6k for the 840 EVO I'd say this is probably representative of hi-po
consumer SSDs with no over provisioning being saturated and not being
TRIM'd.

Cpu core burn during the write test averaged ~50% with peak of ~58%, 15
%us and 35 %sy, with the 15% being IO submission, 35% the RAID5 thread,
w/average 40-50 %wa.

> Starting 32 threads
> 
> read: (groupid=0, jobs=16): err= 0: pid=36459
>   read : io=74697MB, bw=1244.1MB/s, iops=318691 , runt= 60003msec
>     slat (usec): min=0 , max=999873 , avg= 5.90, stdev=529.35
>     clat (usec): min=0 , max=1002.4K, avg=795.43, stdev=5201.15
>      lat (usec): min=0 , max=1002.4K, avg=801.56, stdev=5233.38
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[  213], 10.00th=[  286], 20.00th=[ 366],
>      | 30.00th=[  438], 40.00th=[  516], 50.00th=[  604], 60.00th=[ 708],
>      | 70.00th=[  860], 80.00th=[ 1096], 90.00th=[ 1544], 95.00th=[ 1928],
>      | 99.00th=[ 2608], 99.50th=[ 2800], 99.90th=[ 3536], 99.95th=[ 4128],
>      | 99.99th=[15424]
>     bw (KB/s)  : min=22158, max=245376, per=6.39%, avg=81462.59, stdev=22339.85
>     lat (usec) : 2=3.34%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>     lat (usec) : 100=0.01%, 250=3.67%, 500=31.43%, 750=24.55%, 1000=13.33%
>     lat (msec) : 2=19.37%, 4=4.25%, 10=0.04%, 20=0.01%, 50=0.01%
>     lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%
>   cpu          : usr=30.27%, sys=236.67%, ctx=239859018, majf=0, minf=64588
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=19122474/w=0/d=0, short=r=0/w=0/d=0
> write: (groupid=1, jobs=16): err= 0: pid=38376
>   write: io=13885MB, bw=236914KB/s, iops=59228 , runt= 60016msec
>     slat (usec): min=2 , max=25554K, avg=25.74, stdev=17219.99
>     clat (usec): min=122 , max=43459K, avg=4294.06, stdev=100111.47
>      lat (usec): min=129 , max=43459K, avg=4319.92, stdev=101581.66
>     clat percentiles (usec):
>      |  1.00th=[  482],  5.00th=[  628], 10.00th=[  748], 20.00th=[ 996],
>      | 30.00th=[ 1320], 40.00th=[ 1784], 50.00th=[ 2352], 60.00th=[ 3056],
>      | 70.00th=[ 4192], 80.00th=[ 5920], 90.00th=[ 8384], 95.00th=[10816],
>      | 99.00th=[17536], 99.50th=[20096], 99.90th=[57088], 99.95th=[67072],
>      | 99.99th=[123392]
>     bw (KB/s)  : min=   98, max=25256, per=6.74%, avg=15959.71, stdev=2969.06
>     lat (usec) : 250=0.01%, 500=1.25%, 750=8.72%, 1000=10.13%
>     lat (msec) : 2=23.87%, 4=24.78%, 10=24.87%, 20=5.85%, 50=0.39%
>     lat (msec) : 100=0.11%, 250=0.01%, 750=0.01%, 2000=0.01%, >=2000=0.01%
>   cpu          : usr=5.47%, sys=39.74%, ctx=54762279, majf=0, minf=62375
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=3554662/d=0, short=r=0/w=0/d=0 

If a 3.3GHz Ivy Bridge core w/8MB shared L3 can do ~60k random write
IOPS, 70k w/parity with one md RAID5 thread and 64KB chunk, at ~50% core
utilization, it seems reasonable it could do ~120/140k IOPs w/wo parity
at 100% core utilization.

A 2GHz Sandy Bridge core has 61% of the clock, and with almost double
the L3 should have ~66% of the performance.

((0.66*140k)/3= 30.8k IOPS per drive)*2 drives)= ~61k RAID5 4KB IOPS

Thus, two 3 drive md RAID5 arrays nested in an md RAID0 stripe and
optimally configured (see below) should yield up to ~122k or more random
4KB IOPS, SSD limited.  With 3x mirrors and your 830s you get ~35k per
spindle, ~105k IOPS aggregate with 3 write threads using maybe 5-10%
each of 3 cores.  You get redundancy against SSD controller/power
circuit, board failure, but not against flash wear failure as each
mirror sees 100% of the redundant byte writes.

Given you have 12 cores (disable HT as it will decrease md performance),
10 of them likely perennially idle, the better solution may be RAID50.

Doing this...

1.  Tweak IRQ affinity to keep interrupts off the md thread cores
2.  Pin RAID5 threads to cores on different NUMA nodes,
    different L3 domains, so each has 15MB of L3 available,
    core 5 on each socket is good as the process scheduler will
    hit them last
3.  Use 16KB RAID5 chunk, 32KB RAID0 chunk yielding 64KB outer stripe
4.  Set stripe_cache_size to 4096

Should get gain you this...

1.  +~17k IOPS over 3x mirrors
2.  +1 drive capacity or +85GB/drive over provisioning
3.  ~33% lower flash wear and bandwidth

For the cost of two fully utilized cores at peak IOPS.

You have primarily a DB replication workload and the master's workload
in a failover situation.  In both cases your write IO will be to one or
more journals and one or more DB files, and some indexes.  Very few
files will be created and the existing files will be modified in place
via mmap or simply appended in the case of the journals.  So this
workload has little if any allocation.  Is this correct?

If so you'd want a small stripe width and chunk size to get relatively
decent IO distribution across the nested RAID5 arrays.  Matching chunk
size to the erase block size as some recommend is irrelevant here
because all your random IOs are a tiny fraction of the erase block size.
 The elevator (assuming noop) will merge some IOs as well as the the SSD
itself, so you won't get erase block rewrites for each 4KB IO.  md will
be unable to write full stripes, so using a big chunk/stripe is pretty
useless here and just adds read overhead.

If you mkfs.xfs this md RAID0 device using the defaults it will align to
su=32KB sw=2 and create 16 AGs, unless the default has changed.
Regardless, XFS alignment to RAID geometry should be largely irrelevant
for a transactional DB workload that performs very few allocations but
mostly mmap'd modify-in-place and append operations to a small set of files.

>> [...]
>>> I've experimented with mkfs.xfs (on top of LVM only; I don't know if
>>> it takes into account lower block layers and seen that it supposedly
>>> chooses to default to an agcount of 4, which seems insufficient
>>> given the max. bandwidth our setup should be able to provide.
>>
>> The number of AGs has no bearing on acheivable bandwidth... 

with striped storage.  With concat setups it can make a big difference.
 Concat is out of scope for this discussion, but it will be covered in
detail in the documentation I'm currently working on with much expert
input from Dave.

>> The number
>> of AGs affects allocation concurrency. Hence if you have 24 CPU
>> cores, I'd expect that you want 32 AGs. Normally with a RAID array
>> this will be the default, 

You mean just striped md/dm arrays right?  AFAIK we can't yet poll
hardware RAIDs for geometry as no standard exists.  Also, was the
default agcount for striped md/dm arrays changed from the static 16 to
32, or was some intelligence added?  I admit I don't keep up with all
the patches, but if this was in the subject I'd think it would have
caught my eye.  This info would be meaningful and useful to me whereas
most patches are over my head. :(

>> but it seems that RAID1 is not triggering
>> the "optimise for allocation concurrency" heuristics in mkfs....

I thought XFS only do this for md/dm arrays with stripe geometry.  Using
a nested stripe it should kick in though.

> Thanks, that is a very useful heads-up! What's the formula used to get
> to 32 AGs for 24 CPUs - just (num_cpus * 4/3), 

Note Dave says "allocation concurrency", and what I stated up above
about typical database workloads not doing much allocation.  If yours is
typical then more AGs won't yield any additional performance.

> and is there a simple
> explanation for why this is an ideal starting point? And is that an
> advisable rule of thumb for xfs in general?

More AGs can be useful if you have parallel allocation to at least one
directory in each AG.  However with striping this doesn't provide a lot
of extra bang for the buck.  With concatenated storage and proper
file/dir/AG layout it can provide large parallel scalability of IOPS
and/or throughput depending on the hardware, for both files and metadta.
 Wait a few months for me to finish the docs.  Explaining AG
optimization requires too much text for an email exchange.  Dave and I
have done it before, somewhat piecemeal, and that's in the archives.
For your workload and SSDs AGs make zero difference.

>>> Apart from that, is there any kind of advice you can share for
>>> tuning xfs to run postgres (9.0 initially, but we're planning to
>>> upgrade to 9.3 or later eventually) on in 2014, especially
>>> performance-wise?
>>
>> Apart from the AG count and perhaps tuning the sunit/swidth to match
>> the RAID0 part of the equation, I wouldn't touch a thing unless you
>> know that there's a problem that needs fixing and you know exactly
>> what knob will fix the problem you have...

Nothing more than has already been stated.

> OK, I'll read up on stripe width impact and will (hopefully) have enough
> time to test a number of configs that should make sense.

Again, chunk/stripe won't matter much for a typical transactional DB if
using few files and no allocation.

Hope my added input is useful, valuable, and that Dave knows I was
appending some of his remarks for clarity, not attempting to correct
them. :)

Cheers,

Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-04-16 23:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-15 12:23 Seeking XFS tuning advice for PostgreSQL on SATA SSDs/Linux-md Johannes Truschnigg
2014-04-15 21:34 ` Dave Chinner
2014-04-16  8:21   ` Johannes Truschnigg
2014-04-16  9:31     ` Dave Chinner
2014-04-16 23:31     ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.