All of lore.kernel.org
 help / color / mirror / Atom feed
* gather write metrics on multiple files
@ 2014-10-09  4:40 Stan Hoeppner
  2014-10-09  4:49 ` Joe Landman
  2014-10-09 21:07 ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Stan Hoeppner @ 2014-10-09  4:40 UTC (permalink / raw)
  To: xfs

Does anyone know of a utility that can track writes to files in an XFS directory tree, or filesystem wide for that matter, and gather filesystem blocks written per second data, or simply KiB/s, etc?  I need to analyze an application's actual IO behavior to see if it matches what I'm being told the application is supposed to be doing.

Thanks,
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-09  4:40 gather write metrics on multiple files Stan Hoeppner
@ 2014-10-09  4:49 ` Joe Landman
  2014-10-09  5:24   ` Stan Hoeppner
  2014-10-09 21:07 ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Joe Landman @ 2014-10-09  4:49 UTC (permalink / raw)
  To: xfs

On 10/09/2014 12:40 AM, Stan Hoeppner wrote:
> Does anyone know of a utility that can track writes to files in an
> XFS directory tree, or filesystem wide for that matter, and gather
> filesystem blocks written per second data, or simply KiB/s, etc?  I
> need to analyze an application's actual IO behavior to see if it
> matches what I'm being told the application is supposed to be doing.
>

We've written a few for this purpose (local IO probing).

Start with collectl (looks at /proc/diskstats), and others.  Our tools 
go to /proc/diskstats, and use this to compute BW and IOPs per device.

If you need to log it for a long time, set up a time series database (we 
use influxdb and the graphite plugin).  Then grab your favorite metrics 
tool that talks to graphite/influxdb (I like 
https://github.com/joelandman/sios-metrics for obvious reasons), and 
start collecting data.

> Thanks, Stan
>
> _______________________________________________ xfs mailing list
> xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-09  4:49 ` Joe Landman
@ 2014-10-09  5:24   ` Stan Hoeppner
  2014-10-09 21:13     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-10-09  5:24 UTC (permalink / raw)
  To: xfs

On 10/08/2014 11:49 PM, Joe Landman wrote:
> On 10/09/2014 12:40 AM, Stan Hoeppner wrote:
>> Does anyone know of a utility that can track writes to files in an
>> XFS directory tree, or filesystem wide for that matter, and gather
>> filesystem blocks written per second data, or simply KiB/s, etc?  I
>> need to analyze an application's actual IO behavior to see if it
>> matches what I'm being told the application is supposed to be doing.
>>
> 
> We've written a few for this purpose (local IO probing).
> 
> Start with collectl (looks at /proc/diskstats), and others.  Our tools go to /proc/diskstats, and use this to compute BW and IOPs per device.
> 
> If you need to log it for a long time, set up a time series database (we use influxdb and the graphite plugin).  Then grab your favorite metrics tool that talks to graphite/influxdb (I like https://github.com/joelandman/sios-metrics for obvious reasons), and start collecting data.

I'm told we have 800 threads writing to nearly as many files concurrently on a single XFS on a 12+2 spindle RAID6 LUN.  Achieved data rate is currently ~300 MiB/s.  Some of these are files are supposedly being written at a rate of only 32KiB every 2-3 seconds, while some (two) are ~50 MiB/s.  I need to determine how many bytes we're writing to each of the low rate files, and how many files, to figure out RMW mitigation strategies.  Out of the apparent 800 streams 700 are these low data rate suckers, one stream writing per file.  

Nary a stock RAID controller is going to be able to assemble full stripes out of these small slow writes.  With a 768 KiB stripe that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO?  I've been playing with bcache for a few days but it actually drops throughput by about 30% no matter how I turn its knobs.  Unless I can get Kent to respond to some of my questions bcache will be a dead end.  I had high hopes for it, thinking it would turn these small random IOs into larger sequential writes.  It may actually be doing so, but it's doing something else too, and badly.  IO times go through the roof once bcache starts gobbling IOs, and throughput to the LUNs drops significantly even though bcache is writing 50-100 MIB/s to the SSD.  Not sure what's causing that.


>> Thanks, Stan
>>
>> _______________________________________________ xfs mailing list
>> xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
>>
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-09  4:40 gather write metrics on multiple files Stan Hoeppner
  2014-10-09  4:49 ` Joe Landman
@ 2014-10-09 21:07 ` Dave Chinner
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2014-10-09 21:07 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Wed, Oct 08, 2014 at 11:40:47PM -0500, Stan Hoeppner wrote:
> Does anyone know of a utility that can track writes to files in an
> XFS directory tree, or filesystem wide for that matter, and gather
> filesystem blocks written per second data, or simply KiB/s, etc?
> I need to analyze an application's actual IO behavior to see if it
> matches what I'm being told the application is supposed to be
> doing.

iotop uses the task accounting infrastructure to give you
per-process IO stats:

Total DISK READ :      27.18 K/s | Total DISK WRITE :     395.10 M/s
Actual DISK READ:      27.18 K/s | Actual DISK WRITE:     395.10 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
14822 be/4 dave       27.18 K/s  395.10 M/s  0.00 % 87.43 % dd if=/dev/zero of=foo bs=1024k count=16384 oflag=direct
 5430 be/4 dave        0.00 B/s    3.88 K/s  0.00 %  0.00 % plasma-desktop
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init [2]
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
....

For whole device stats, "iostat -d -x -m <device> 5" will get you 5
second averages of the IO to a device, request sizes, queue depths,
latencies, etc. blktrace/blkparse can give you way more information
that you care about in the IO subsystem - iowatcher (formerly
seekwatcher) is a good visualation tool that sits on top of
blktrace...

PCP (i.e using pmchart for live graphing of performance metrics) is
probably useful here, too.

As for limiting the stats gathering to a subset of the filesystem
there isn't really anything that does that. You probably would need
to do some perf or system tap magic to filter read/write ops
in some way. A good place to start might be here:

http://www.brendangregg.com/perf.html

Specifically his iosnoop tool is a simple template that you can customise to your
own needs:

http://www.brendangregg.com/blog/2014-07-16/iosnoop-for-linux.html
http://www.brendangregg.com/blog/2014-07-23/linux-iosnoop-latency-heat-maps.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-09  5:24   ` Stan Hoeppner
@ 2014-10-09 21:13     ` Dave Chinner
  2014-10-09 22:30       ` Stan Hoeppner
  2014-10-18  6:03       ` Stan Hoeppner
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2014-10-09 21:13 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Thu, Oct 09, 2014 at 12:24:20AM -0500, Stan Hoeppner wrote:
> On 10/08/2014 11:49 PM, Joe Landman wrote:
> > On 10/09/2014 12:40 AM, Stan Hoeppner wrote:
> >> Does anyone know of a utility that can track writes to files in
> >> an XFS directory tree, or filesystem wide for that matter, and
> >> gather filesystem blocks written per second data, or simply
> >> KiB/s, etc?  I need to analyze an application's actual IO
> >> behavior to see if it matches what I'm being told the
> >> application is supposed to be doing.
> >>
> > 
> > We've written a few for this purpose (local IO probing).
> > 
> > Start with collectl (looks at /proc/diskstats), and others.  Our
> > tools go to /proc/diskstats, and use this to compute BW and IOPs
> > per device.
> > 
> > If you need to log it for a long time, set up a time series
> > database (we use influxdb and the graphite plugin).  Then grab
> > your favorite metrics tool that talks to graphite/influxdb (I
> > like https://github.com/joelandman/sios-metrics for obvious
> > reasons), and start collecting data.
> 
> I'm told we have 800 threads writing to nearly as many files
> concurrently on a single XFS on a 12+2 spindle RAID6 LUN.
> Achieved data rate is currently ~300 MiB/s.  Some of these are
> files are supposedly being written at a rate of only 32KiB every
> 2-3 seconds, while some (two) are ~50 MiB/s.  I need to determine
> how many bytes we're writing to each of the low rate files, and
> how many files, to figure out RMW mitigation strategies.  Out of
> the apparent 800 streams 700 are these low data rate suckers, one
> stream writing per file.  
> 
> Nary a stock RAID controller is going to be able to assemble full
> stripes out of these small slow writes.  With a 768 KiB stripe
> that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO?

Raid controllers don't typically have the resources to track
hundreds of separate write streams at a time. Most don't have the
memory available to track that many active write streams, and those
that do probably can't proritise writeback sanely given how slowly
most cachelines would be touched. The fast writers would simply tune
over the slower writer caches way too quickly.

Perhaps you need to change the application to make the slow writers
buffer stripe sized writes in memory and flush them 768k at a
time...

> I've been playing with bcache for a few days but it actually drops
> throughput by about 30% no matter how I turn its knobs.  Unless I
> can get Kent to respond to some of my questions bcache will be a
> dead end.  I had high hopes for it, thinking it would turn these
> small random IOs into larger sequential writes.  It may actually
> be doing so, but it's doing something else too, and badly.  IO
> times go through the roof once bcache starts gobbling IOs, and
> throughput to the LUNs drops significantly even though bcache is
> writing 50-100 MIB/s to the SSD.  Not sure what's causing that.

Have you tried dm-cache?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-09 21:13     ` Dave Chinner
@ 2014-10-09 22:30       ` Stan Hoeppner
  2014-10-18  6:03       ` Stan Hoeppner
  1 sibling, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2014-10-09 22:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 10/09/2014 04:13 PM, Dave Chinner wrote:
> On Thu, Oct 09, 2014 at 12:24:20AM -0500, Stan Hoeppner wrote:
>> On 10/08/2014 11:49 PM, Joe Landman wrote:
>>> On 10/09/2014 12:40 AM, Stan Hoeppner wrote:
>>>> Does anyone know of a utility that can track writes to files in
>>>> an XFS directory tree, or filesystem wide for that matter, and
>>>> gather filesystem blocks written per second data, or simply
>>>> KiB/s, etc?  I need to analyze an application's actual IO
>>>> behavior to see if it matches what I'm being told the
>>>> application is supposed to be doing.
>>>>
>>>
>>> We've written a few for this purpose (local IO probing).
>>>
>>> Start with collectl (looks at /proc/diskstats), and others.  Our
>>> tools go to /proc/diskstats, and use this to compute BW and IOPs
>>> per device.
>>>
>>> If you need to log it for a long time, set up a time series
>>> database (we use influxdb and the graphite plugin).  Then grab
>>> your favorite metrics tool that talks to graphite/influxdb (I
>>> like https://github.com/joelandman/sios-metrics for obvious
>>> reasons), and start collecting data.
>>
>> I'm told we have 800 threads writing to nearly as many files
>> concurrently on a single XFS on a 12+2 spindle RAID6 LUN.
>> Achieved data rate is currently ~300 MiB/s.  Some of these are
>> files are supposedly being written at a rate of only 32KiB every
>> 2-3 seconds, while some (two) are ~50 MiB/s.  I need to determine
>> how many bytes we're writing to each of the low rate files, and
>> how many files, to figure out RMW mitigation strategies.  Out of
>> the apparent 800 streams 700 are these low data rate suckers, one
>> stream writing per file.  
>>
>> Nary a stock RAID controller is going to be able to assemble full
>> stripes out of these small slow writes.  With a 768 KiB stripe
>> that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO?
> 
> Raid controllers don't typically have the resources to track
> hundreds of separate write streams at a time. Most don't have the
> memory available to track that many active write streams, and those
> that do probably can't proritise writeback sanely given how slowly
> most cachelines would be touched. The fast writers would simply tune
> over the slower writer caches way too quickly.
> 
> Perhaps you need to change the application to make the slow writers
> buffer stripe sized writes in memory and flush them 768k at a
> time...

Just started earlier today.  Turns out the buffer sizes and buffer count are configurable, not hard coded, at least in the test harness.  AIUI the actual application creates variable sized buffers on the fly.  In which case the test harness doesn't accurately simulate the real app.  So the numbers we might achieve optimizing the harness may not reflect reality, same for the storage subsystem tweaks for that matter.  Which brings up a whole other set of questions regarding what we're actually doing....
 
>> I've been playing with bcache for a few days but it actually drops
>> throughput by about 30% no matter how I turn its knobs.  Unless I
>> can get Kent to respond to some of my questions bcache will be a
>> dead end.  I had high hopes for it, thinking it would turn these
>> small random IOs into larger sequential writes.  It may actually
>> be doing so, but it's doing something else too, and badly.  IO
>> times go through the roof once bcache starts gobbling IOs, and
>> throughput to the LUNs drops significantly even though bcache is
>> writing 50-100 MIB/s to the SSD.  Not sure what's causing that.
> 
> Have you tried dm-cache?

Not yet.  I have a feeler out to our Dell rep WRT LSI's CacheCade, since Dell PERCs are OEM LSIs.  Initially it appears it is optimized for read caching as bcache seems to be.  I've tested bcache on 3.12 and 3.17 and its write throughput on the latter is even worse, both being ~30-40 lower than native.  Latency goes through the roof, and iostat shows the distribution across the LUNs is horribly uneven.  Atop that, running iotop at 1 second intervals shows no IO on one LUN or the other for 1-2 seconds at a time.  And when both do show IO the rates are up and down all over the place.  Runnig native IO the rates are constant and the spread between LUNs in 2-3%.  Not sure what the problem is here with bcache, but it certainly doesn't behave as I expected it would.  It's really quite horrible for this workload.  And that's when attempting to push only ~360 MB/s per LUN.

Kent doesn't seem interested in assisting thus far.  Which is a shame.  Having bcache running on hundreds of systems of this caliber would be a feather in his cap, and a validation of his work.  Natively we're currently achieving about 2.3 GB/s write throughput across 14 RAID6 LUNs (12+2) on two controllers in an active-active multipath setup.  If bcache was performing how I'd think it should, with the right number of SSDs I'd think we could be knocking on the 3 GB/s door.

Thanks,
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-09 21:13     ` Dave Chinner
  2014-10-09 22:30       ` Stan Hoeppner
@ 2014-10-18  6:03       ` Stan Hoeppner
  2014-10-18 18:16         ` Stan Hoeppner
  1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-10-18  6:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 10/09/2014 04:13 PM, Dave Chinner wrote:
...
>> I'm told we have 800 threads writing to nearly as many files
>> concurrently on a single XFS on a 12+2 spindle RAID6 LUN.
>> Achieved data rate is currently ~300 MiB/s.  Some of these are
>> files are supposedly being written at a rate of only 32KiB every
>> 2-3 seconds, while some (two) are ~50 MiB/s.  I need to determine
>> how many bytes we're writing to each of the low rate files, and
>> how many files, to figure out RMW mitigation strategies.  Out of
>> the apparent 800 streams 700 are these low data rate suckers, one
>> stream writing per file.  
>>
>> Nary a stock RAID controller is going to be able to assemble full
>> stripes out of these small slow writes.  With a 768 KiB stripe
>> that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO?
> 
> Raid controllers don't typically have the resources to track
> hundreds of separate write streams at a time. Most don't have the
> memory available to track that many active write streams, and those
> that do probably can't proritise writeback sanely given how slowly
> most cachelines would be touched. The fast writers would simply tune
> over the slower writer caches way too quickly.
> 
> Perhaps you need to change the application to make the slow writers
> buffer stripe sized writes in memory and flush them 768k at a
> time...

All buffers are now 768K multiples--6144, 768, 768, and I'm told the app should be writing out full buffers.  However I'm not seeing the throughput increase I should given the amount that the RMWs should have decreased, which, if my math is correct, should be about half (80) the raw actuator seek rate of these drives (7.2k SAS).  Something isn't right.  I'm guessing it's the controller firmware, maybe the test app, or both.  The test app backs off then ramps up when response times at the controller go up and back down.  And it's not super accurate or timely about it.  The lowest interval setting possible is 10 seconds.  Which is way too high when a controller goes into congestion.

Does XFS give alignment hints with O_DIRECT writes into preallocated files?  The filesystems were aligned at make time w/768K stripe width, so each prealloc file should be aligned on a stripe boundary.  I've played with the various queue settings, even tried deadline instead of noop hoping more LBAs could be sorted before hitting the controller.  Can't seem to get a repeatable increase.  I've nr_requests at 524288, rq_affinity 2, read_ahead_kb 0 since reads are <20% of the IO, add_random 0, etc.  Nothing seems to help really.

Thanks,
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-18  6:03       ` Stan Hoeppner
@ 2014-10-18 18:16         ` Stan Hoeppner
  2014-10-19 22:24           ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-10-18 18:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 10/18/2014 01:03 AM, Stan Hoeppner wrote:
> On 10/09/2014 04:13 PM, Dave Chinner wrote:
> ...
>>> I'm told we have 800 threads writing to nearly as many files
>>> concurrently on a single XFS on a 12+2 spindle RAID6 LUN.
>>> Achieved data rate is currently ~300 MiB/s.  Some of these are
>>> files are supposedly being written at a rate of only 32KiB every
>>> 2-3 seconds, while some (two) are ~50 MiB/s.  I need to determine
>>> how many bytes we're writing to each of the low rate files, and
>>> how many files, to figure out RMW mitigation strategies.  Out of
>>> the apparent 800 streams 700 are these low data rate suckers, one
>>> stream writing per file.  
>>>
>>> Nary a stock RAID controller is going to be able to assemble full
>>> stripes out of these small slow writes.  With a 768 KiB stripe
>>> that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO?
>>
>> Raid controllers don't typically have the resources to track
>> hundreds of separate write streams at a time. Most don't have the
>> memory available to track that many active write streams, and those
>> that do probably can't proritise writeback sanely given how slowly
>> most cachelines would be touched. The fast writers would simply tune
>> over the slower writer caches way too quickly.
>>
>> Perhaps you need to change the application to make the slow writers
>> buffer stripe sized writes in memory and flush them 768k at a
>> time...
> 
> All buffers are now 768K multiples--6144, 768, 768, and I'm told the app should be writing out full buffers.  However I'm not seeing the throughput increase I should given the amount that the RMWs should have decreased, which, if my math is correct, should be about half (80) the raw actuator seek rate of these drives (7.2k SAS).  Something isn't right.  I'm guessing it's the controller firmware, maybe the test app, or both.  The test app backs off then ramps up when response times at the controller go up and back down.  And it's not super accurate or timely about it.  The lowest interval setting possible is 10 seconds.  Which is way too high when a controller goes into congestion.
> 
> Does XFS give alignment hints with O_DIRECT writes into preallocated files?  The filesystems were aligned at make time w/768K stripe width, so each prealloc file should be aligned on a stripe boundary.  I've played with the various queue settings, even tried deadline instead of noop hoping more LBAs could be sorted before hitting the controller.  Can't seem to get a repeatable increase.  I've nr_requests at 524288, rq_affinity 2, read_ahead_kb 0 since reads are <20% of the IO, add_random 0, etc.  Nothing seems to help really.

Some additional background:

    Num. Streams     = 350
    WRITING:
        Num. Write Threads  = 100
        Avg. Write Rate     =       72 KiB/s
        Avg. Write Intvl    = 10666.666 ms
        Num. Write Buffers  = 426
        Write Buffer Size   = 768 KiB
        Write Buffer Mem.   = 327168 KiB
        Group Write Rate    =    25200 KiB/s
        Avg. Buffer Rate    = 32.812 bufs/s
        Avg. Buffer Intvl.  = 30.476 ms
        Avg. Thread Intvl.  = 3047.600 ms

The 350 streams are written to 350 preallocated files in parallel.  Yes, a seek monster.  Writing without AIO currently.  I'm bumping the rate to 2x during the run but that isn't reflected above.  The above is the default setup.  The app can't dump the running setup.  The previous non buffer aligned config used 160KB write buffers.

Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-18 18:16         ` Stan Hoeppner
@ 2014-10-19 22:24           ` Dave Chinner
  2014-10-21 23:56             ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2014-10-19 22:24 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

[ please word wrap your emails at 68-72 columns ]

On Sat, Oct 18, 2014 at 01:16:58PM -0500, Stan Hoeppner wrote:
> On 10/18/2014 01:03 AM, Stan Hoeppner wrote:
> > On 10/09/2014 04:13 PM, Dave Chinner wrote:
> > ...
> >>> I'm told we have 800 threads writing to nearly as many files
> >>> concurrently on a single XFS on a 12+2 spindle RAID6 LUN.
> >>> Achieved data rate is currently ~300 MiB/s.  Some of these are
> >>> files are supposedly being written at a rate of only 32KiB every
> >>> 2-3 seconds, while some (two) are ~50 MiB/s.  I need to determine
> >>> how many bytes we're writing to each of the low rate files, and
> >>> how many files, to figure out RMW mitigation strategies.  Out of
> >>> the apparent 800 streams 700 are these low data rate suckers, one
> >>> stream writing per file.  
> >>>
> >>> Nary a stock RAID controller is going to be able to assemble full
> >>> stripes out of these small slow writes.  With a 768 KiB stripe
> >>> that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO?
> >>
> >> Raid controllers don't typically have the resources to track
> >> hundreds of separate write streams at a time. Most don't have the
> >> memory available to track that many active write streams, and those
> >> that do probably can't proritise writeback sanely given how slowly
> >> most cachelines would be touched. The fast writers would simply tune
> >> over the slower writer caches way too quickly.
> >>
> >> Perhaps you need to change the application to make the slow writers
> >> buffer stripe sized writes in memory and flush them 768k at a
> >> time...
> > 
> > All buffers are now 768K multiples--6144, 768, 768, and I'm told
> > the app should be writing out full buffers.  However I'm not
> > seeing the throughput increase I should given the amount that
> > the RMWs should have decreased, which, if my math is correct,

Maybe that's not your problem. What's the storage array tell you
about RMW cycles? What's it tell you about lun utilisation - is it
even or do you have hot luns?

> > should be about half (80) the raw actuator seek rate of these
> > drives (7.2k SAS).

Not all drives seek at the same rate. Typically for a RAID 6 array,
every disk you add to the width of the lun slows the seek rate for
full stripe writes by 2-3%. So a 12+2 lun is going to have an
average seek rate of 25-30% lower than a 2+1 lun on full stripe
writes....

> > Something isn't right.  I'm guessing it's
> > the controller firmware, maybe the test app, or both.  The test
> > app backs off then ramps up when response times at the
> > controller go up and back down.  And it's not super accurate or
> > timely about it.  The lowest interval setting possible is 10
> > seconds.  Which is way too high when a controller goes into
> > congestion.

The controller should not have any problems with this. If the
controller IO response times are varying significantly, then you're
doing something wrong - most probably caching in BBWC rather than
writing through to disk immediately...

> > Does XFS give alignment hints with O_DIRECT writes into
> > preallocated files?

What do you mean? if the file is preallocated and aligned, then
the IO alignment is wholly up to the application. i.e. if the
application is not doing aligned IO, then there's nothing the
filesystem can do to align it...

> > The filesystems were aligned at make time
> > w/768K stripe width, so each prealloc file should be aligned on
> > a stripe boundary.

"should be aligned"? You haven't verified they are aligned by using
with 'xfs_bmap -vp'?

> > I've played with the various queue settings,
> > even tried deadline instead of noop hoping more LBAs could be
> > sorted before hitting the controller.  Can't seem to get a
> > repeatable increase.  I've nr_requests at 524288, rq_affinity 2,
> > read_ahead_kb 0 since reads are <20% of the IO, add_random 0,
> > etc.  Nothing seems to help really.

nr_requests = 524288? Why do you want to queue half a million IOs
once the CTQ depth has overflowed? That's a major latency problem
right there.

You've got latency problems, so your should be removing any source
of potential or variable latency in the IO stack. e.g. turning off
all IO scheduler queuing, reducing CTQ depth and using write through
caching so you can observe the behaviour of the raw luns. Strip it
right back, then observe...

> Some additional background:
> 
>     Num. Streams     = 350
>     WRITING:
>         Num. Write Threads  = 100
>         Avg. Write Rate     =       72 KiB/s
>         Avg. Write Intvl    = 10666.666 ms
>         Num. Write Buffers  = 426
>         Write Buffer Size   = 768 KiB
>         Write Buffer Mem.   = 327168 KiB
>         Group Write Rate    =    25200 KiB/s
>         Avg. Buffer Rate    = 32.812 bufs/s
>         Avg. Buffer Intvl.  = 30.476 ms
>         Avg. Thread Intvl.  = 3047.600 ms
> 
> The 350 streams are written to 350 preallocated files in parallel.

And they layout of those files are? If you don't know the physical
layout of the files and what disks in the storage array they map to,
then you can't determine what the seek times should be. If you can't
work out what the seek times should be, then you don't know what the
stream capacity of the storage should be.

Keep in mind that single extent files are optimised for read
performance, not write performance. i.e. by default XFS trades off
some write performance to improve file read performance.  Optimising
for highest write speeds means linearising all writes (i.e. reducing
seeks), while XFS's default behaviour is to separate them into
different regions of the disk (increasing seeks).

IOWs, write rates are likely to go up if you allow files to be
fragmented and interleaved to make writes more sequential.
The down side is that reads will then seek, but if reads aren't the
primary workload, nor a performance sensitive operation, then
perhaps you're optimising for the wrong operation....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-19 22:24           ` Dave Chinner
@ 2014-10-21 23:56             ` Stan Hoeppner
  2014-10-25  2:28               ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-10-21 23:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


On 10/19/2014 05:24 PM, Dave Chinner wrote:
> [ please word wrap your emails at 68-72 columns ]
> 
> On Sat, Oct 18, 2014 at 01:16:58PM -0500, Stan Hoeppner wrote:
>> On 10/18/2014 01:03 AM, Stan Hoeppner wrote:
>>> On 10/09/2014 04:13 PM, Dave Chinner wrote:
>>> ...
>>>>> I'm told we have 800 threads writing to nearly as many files
>>>>> concurrently on a single XFS on a 12+2 spindle RAID6 LUN.
>>>>> Achieved data rate is currently ~300 MiB/s.  Some of these are
>>>>> files are supposedly being written at a rate of only 32KiB every
>>>>> 2-3 seconds, while some (two) are ~50 MiB/s.  I need to determine
>>>>> how many bytes we're writing to each of the low rate files, and
>>>>> how many files, to figure out RMW mitigation strategies.  Out of
>>>>> the apparent 800 streams 700 are these low data rate suckers, one
>>>>> stream writing per file.  
>>>>>
>>>>> Nary a stock RAID controller is going to be able to assemble full
>>>>> stripes out of these small slow writes.  With a 768 KiB stripe
>>>>> that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO?
>>>>
>>>> Raid controllers don't typically have the resources to track
>>>> hundreds of separate write streams at a time. Most don't have the
>>>> memory available to track that many active write streams, and those
>>>> that do probably can't proritise writeback sanely given how slowly
>>>> most cachelines would be touched. The fast writers would simply tune
>>>> over the slower writer caches way too quickly.
>>>>
>>>> Perhaps you need to change the application to make the slow writers
>>>> buffer stripe sized writes in memory and flush them 768k at a
>>>> time...
>>>
>>> All buffers are now 768K multiples--6144, 768, 768, and I'm told
>>> the app should be writing out full buffers.  However I'm not
>>> seeing the throughput increase I should given the amount that
>>> the RMWs should have decreased, which, if my math is correct,
> 
> Maybe that's not your problem. What's the storage array tell you
> about RMW cycles? What's it tell you about lun utilisation - is it
> even or do you have hot luns?

Maybe not.  If what I'm told about the controller statistics screen is
correct, RMWs, or "small destages", are far less than 0.5% of total
destages.  However that rate didn't change noticeably when I switched to
stripe aligned buffer sizes of 768K vs 160K.  Watching the stats in real
time shows zero small destages for long periods of time, then a burst of
them, then nothing again.  I'm told the firmware ignores all low rate
IOs so cache lines can be dedicated to the fast writers, and it only
waits 3 seconds to assemble full stripes for writeback.  So what I'm
seeing maybe doesn't match what I'm being told.  I've been given no docs
for the controllers because they haven't apparently been written yet.  I
must trust what I'm told.  Again, these controllers are in a beta stage
of development.

Hot LUNs isn't an issue as we have one filesystem per LUN, and one LUN
per controller.  At least in this test rig.

>>> should be about half (80) the raw actuator seek rate of these
>>> drives (7.2k SAS).
> 
> Not all drives seek at the same rate. Typically for a RAID 6 array,
> every disk you add to the width of the lun slows the seek rate for
> full stripe writes by 2-3%. So a 12+2 lun is going to have an
> average seek rate of 25-30% lower than a 2+1 lun on full stripe
> writes....

Right.  And partial stripe writes will hit a subset of disks, thus the
associated RMW read will cause extra seeks on one or more of these, and
possibly two others to read parity (RAID6).  The rig I'm testing at the
moment has two 12+1 RAID5 arrays so, only one parity seek on RMW.

>>> Something isn't right.  I'm guessing it's
>>> the controller firmware, maybe the test app, or both.  The test
>>> app backs off then ramps up when response times at the
>>> controller go up and back down.  And it's not super accurate or
>>> timely about it.  The lowest interval setting possible is 10
>>> seconds.  Which is way too high when a controller goes into
>>> congestion.
> 
> The controller should not have any problems with this. If the
> controller IO response times are varying significantly, then you're
> doing something wrong - most probably caching in BBWC rather than
> writing through to disk immediately...

When a controller goes into congestion I see await and avgqu-sz in
iostat jump from 15-50ms steady state up into the hundreds, then into
the thousands of ms if we don't back down the IOs being submitted.  This
is with O_DIRECT, and with and without using AIO.  Once we do back it
down the controller eventually recovers after tens of seconds to a
minute or so, and wait and queue size drop back down to 'normal'.

Due to the number of streams write-through mode would simply make every
IO an RMW and throughput would be abysmal.  I've been testing a two LUN
config with 402 streams per LUN, per XFS filesystem, but the design is
up to 14 LUNs.  So we're talking in excess of 5600 IO streams with the
test harness, possible over 10k in customer hands, or 2600 to 5000 IO
streams per controller.  So writeback and sorting high rate sectors into
stripes is mandatory.

With the to LUN setup I'm working with I see the controllers go into
congestion and iostats await jumps from 10-50ms steady state up into the
hundreds and low thousands of ms.  And avgqu-sz just soars.  Whether
this is due to poor writeback performance or seeking the drives to death
remains to be seen.  Could be a combination of both.

>>> Does XFS give alignment hints with O_DIRECT writes into
>>> preallocated files?
> 
> What do you mean? if the file is preallocated and aligned, then
> the IO alignment is wholly up to the application. i.e. if the
> application is not doing aligned IO, then there's nothing the
> filesystem can do to align it...

I mean during writeout to the block layer.  O_DIRECT writes from the app
must be multiples of 4K.  Does XFS do anything different on writeout if
the app writes 160k vs 768k, when the FS was created with alignment,
writing to files created with posix_fallocate()?  Does XFS group them
into clusters of 1536 sectors?  Or does it just sling pages (8 sectors)
to the block layer?

Forgive my ignorance.  Our mentoring sessions never got this deep into
the stack.  Though we did touch the surface on CDBs and DMA from memory
to the HBA in one discussion.

>>> The filesystems were aligned at make time
>>> w/768K stripe width, so each prealloc file should be aligned on
>>> a stripe boundary.
> 
> "should be aligned"? You haven't verified they are aligned by using
> with 'xfs_bmap -vp'?

If I divide the start of the block range by 192 (768k/4k) those files
checked so far return a fractional value.  So I assume this means these
files are not stripe aligned.  What might cause that given I formatted
with alignment?

>>> I've played with the various queue settings,
>>> even tried deadline instead of noop hoping more LBAs could be
>>> sorted before hitting the controller.  Can't seem to get a
>>> repeatable increase.  I've nr_requests at 524288, rq_affinity 2,
>>> read_ahead_kb 0 since reads are <20% of the IO, add_random 0,
>>> etc.  Nothing seems to help really.
> 
> nr_requests = 524288? Why do you want to queue half a million IOs
> once the CTQ depth has overflowed? That's a major latency problem
> right there.

As I said I was hoping this would give the elevator a larger window in
which to sort IOs into sequential writes.  The documentation of
nr_requests is pretty sparse.  Says the kernel will use only as many as
needed, IIRC.  The default is 128 and I saw additional throughput with
8192.  I bumped it up to 131072, then 524288 as a test.  Neither of the
last two seems to help or hurt, but 8192 helped, with noop.

> You've got latency problems, so your should be removing any source

Latency is only a problem once the controller becomes saturated and
congested.  This occurs somewhere between 250-400 MB/s, but is variable.
 It seems to depend on which sets of files are being written at a given
moment.  Due to the scattered file layout across all 44 AGs it seems
logical to me that seeking up/down the platters is the primary problem.
 We're riting 403 files in parallel, albeit at different rates.  If at
one moment we're mostly hitting AGs 0-10 we're not seeking all that
much.  The next moment we may be writing two high rates files, one in
AG0 and one in AG44, and 50 medium rate files in AGs 12-35.  The
application data rate hasn't changed, but our seek distance, pattern,
and times, are dramatically increased.

I've not yet performed a full file location analysis as we generate over
27k files, and I've not figured out a way to automate this.  But I have
already recommended we optimize the file layout, if possible, to avoid
this situation, as I know we already have this seek latency problem to
some degree.

> of potential or variable latency in the IO stack. e.g. turning off
> all IO scheduler queuing, reducing CTQ depth and using write through
> caching so you can observe the behaviour of the raw luns. Strip it
> right back, then observe...

As I said we can't do write-through.  And I'm pretty sure the latency is
seek latency, not IO path latency.  The disks are slow, 7.2k, in parity
RAID, and we're writing 400 files concurrently--2 fast, 50 medium, and
350 slow, along with 20% random reads thrown in, so reading 80 files
concurrently with the writes.  All against 12 effective 7.2k spindles in
RAID5, or RAID6.

Common sense, or should I say experience, tells me the performance cliff
is insufficient actuator bandwidth for the workload as we currently lay
out the files across the AGs.  So this is where I'm focusing my efforts
at the moment.

>> Some additional background:
>>
>>     Num. Streams     = 350
>>     WRITING:
>>         Num. Write Threads  = 100
>>         Avg. Write Rate     =       72 KiB/s
>>         Avg. Write Intvl    = 10666.666 ms
>>         Num. Write Buffers  = 426
>>         Write Buffer Size   = 768 KiB
>>         Write Buffer Mem.   = 327168 KiB
>>         Group Write Rate    =    25200 KiB/s
>>         Avg. Buffer Rate    = 32.812 bufs/s
>>         Avg. Buffer Intvl.  = 30.476 ms
>>         Avg. Thread Intvl.  = 3047.600 ms
>>
>> The 350 streams are written to 350 preallocated files in parallel.
> 
> And they layout of those files are? If you don't know the physical
> layout of the files and what disks in the storage array they map to,
> then you can't determine what the seek times should be. If you can't
> work out what the seek times should be, then you don't know what the
> stream capacity of the storage should be.

Precisely.  Currently working this issue as mentioned.  Interestingly, I
tried to explain this on day one during my site visit, but nobody wanted
to listen:  "We don't have to worry about file layout with EXT4.  We
shouldn't have to with XFS.  We should just be able to create our
directories and files how we want on a single mount point. etc, etc".  6
weeks later, they're finally ready to listen, somewhat, after all other
tweaking has led to very few gains.

Nobody wants to rewrite their app, whether the test harness group, or
the production app group, to get performance.  This is their first time
through this.  AAIU, their previous product didn't use a filesystem, but
wrote raw to the storage, similar to how some DB vendors used to do it.
 So simply getting them to listen to knew ways of doing things is
difficult.  I guess on the plus side they may keep extending my contract
as they find more value in the advice and information I'm providing.
Moving so slow and chewing through concrete walls is frustrating, however.

> Keep in mind that single extent files are optimised for read
> performance, not write performance. i.e. by default XFS trades off
> some write performance to improve file read performance.  Optimising
> for highest write speeds means linearising all writes (i.e. reducing
> seeks), while XFS's default behaviour is to separate them into
> different regions of the disk (increasing seeks).

Ok, so their idea in using preallocated files was to guarantee space and
prevent file and free space fragmentation.  They loop through the files
once they fill, overwriting them at some point for reuse, IIUC.

The large stream files are 2.5-4.8 GB, and those are the largest, the
mediums are 1.5-2.7 GB, the smalls are 197-314 MB.  We should be able to
split them up across the AGs in a manner in which the heads are sweeping
only one or two adjacent AGs at a time for each 402 IOs, walking from
the outer platter edge to inner as we progress through the files.  I've
checked a few of the large and they are two extents each, one very large
one in AG13 and a very small one in AG15.  This is a result of spillage
when AG13 filled, I assume.  A binary creates the directories and files
and I've not seen the source yet.  I'm guessing it's done in parallel
instead of serially, so the directories are likely scattered across the
AGs in a random order.

Speaking of this, when I straighten this out, how does one create a
large number of directories serially as to ensure placement on
sequential AGs?  Do waits need to be added between each mkdir, for example?

> IOWs, write rates are likely to go up if you allow files to be
> fragmented and interleaved to make writes more sequential.

With this many write streams and slow disks I think the primary goal
should be minimizing large distance seeks during writes (i.e. AG0 to
AG43 and back, platter edge to platter edge).  Proper file placement
matching the application's write pattern should achieve this.  Does it
matter then if we use preallocated or allocated files?  Sticking with
prealloc files prevents fragmentation, and thus free space btree lookup
slowdowns.  Or am I missing you here?

> The down side is that reads will then seek, but if reads aren't the
> primary workload, nor a performance sensitive operation, then
> perhaps you're optimising for the wrong operation....

Perhaps.  I think it's more likely we just haven't been on exactly the
same page, probably because I'd not explained things thoroughly enough
to this point.

My next test is to be 44 O_DIRECT write threads in parallel, writing one
allocated file in each AG, then 22 files each in AG0 and AG1.  This to
demonstrate the throughput differences due to full stroke platter
seeking vs localized short stroke seeking.  Sure, I'll lose some
allocation parallelism but it should still demonstrate the point.  I
need something to convince the guys that modifying their app has promise.

Thanks Dave,
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: gather write metrics on multiple files
  2014-10-21 23:56             ` Stan Hoeppner
@ 2014-10-25  2:28               ` Stan Hoeppner
  0 siblings, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2014-10-25  2:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


On 10/21/2014 06:56 PM, Stan Hoeppner wrote:
> 
> On 10/19/2014 05:24 PM, Dave Chinner wrote:
...
>>>> The filesystems were aligned at make time
>>>> w/768K stripe width, so each prealloc file should be aligned on
>>>> a stripe boundary.
>>
>> "should be aligned"? You haven't verified they are aligned by using
>> with 'xfs_bmap -vp'?
> 
> If I divide the start of the block range by 192 (768k/4k) those files
> checked so far return a fractional value.  So I assume this means these
> files are not stripe aligned.  What might cause that given I formatted
> with alignment?

It seems file alignment isn't a big problem after all as the controllers
are doing relatively few small destages, about 1/250th of full stripe
destages.  And some of these are journal and metadata writes.

...
>>> The 350 streams are written to 350 preallocated files in parallel.
>>
>> And they layout of those files are? If you don't know the physical
>> layout of the files and what disks in the storage array they map to,
>> then you can't determine what the seek times should be. If you can't
>> work out what the seek times should be, then you don't know what the
>> stream capacity of the storage should be.

Took some time but I worked out a rough map of the files.  SG0, SG1, and
SG2 are the large, medium, and small file counts respectively.

AG  SG0 SG1 SG2		AG  SG0 SG1 SG2		AG  SG0 SG1 SG2		AG  SG0 SG1 SG2

 0    0 129 520		11  162 132 514		22  160 131 519		33  164 133 522
 1  164 129 518		12  160 132 520		23  161 132 518		34  161 130 517
 2  164 133 521		13  164 129 522		24  163 131 521		35  162 132 518
 3  159 129 518		14  164 130 522		25  162 129 519		36  161 131 518
 4   92 257 518		15  163 130 522		26  163 128 520		37  158 131 515
 5   91 256 516		16  163 131 521		27  162 130 523		38    0 132 518
 6   91 263 519		17  161 130 518		28  161 130 524		39    0 128 523
 7   92 261 518		18  165 127 520		29  163 129 517		40    0 131 521
 8   91 253 515		19  161 130 517		30  166 129 520		41    0 130 522
 9   94 257 451		20  167 128 525		31  162 129 521		42    0 128 517
10  172 129 455		21  164 130 515		32  161 129 515		43    0 131 516

All 3 file sizes are fairly evenly spread across all AGs, and that is a
problem.  The directory structure is setup so that each group directory
has one subdir per stream and multiple files which are written in
succession as they fill, and we start with the first file in each
directory.  SG0 has two streams/subdirs, SG1 has 50, and SG2 has 350.
Write stream rates:

SG0	  2@ 94.0  MB/s,
SG1	 50@  2.4  MB/s
SG2	350@  0.14 MB/s

This is 357 MB/s aggregate targeted at a 12+1 RAID5 or 12+2 RAID6, the
former in this case.  In either case we can't maintain this rate.  A
~36-45 hour run writes all files once.  During this duration we see the
controller go into congestion hundreds of times.  Wait goes up,
bandwidth down, and we drop application buffers because they're on a
timer.  If we can't write a buffer in X seconds we drop it.

The directory/file layout indicates highly variable AG access patterns
throughout the run, thus lots of AG-to-AG seeking, thus seeking lots of
platter surface all the time.  It also indicates large sweeps of the
actuators when concurrent file accesses are in low and high numbered
AGs.  And this tends to explain the relatively stable throughput some of
the time with periods of high IO wait and low bandwidth at other times.
 Too much seek delay with these latter access patterns.

I haven't profiled the application to verify which files are written in
parallel at a given point in the run, but I think that would be a waste
of time given the file/AG distribution we see above.  And I don't have
enough time left on my contract to do it anyway.

I've can attach tree or 'ls -lavR' output if that would help paint a
clearer picture of ho the filesystem is organized.


Thanks,
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-10-25  2:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-09  4:40 gather write metrics on multiple files Stan Hoeppner
2014-10-09  4:49 ` Joe Landman
2014-10-09  5:24   ` Stan Hoeppner
2014-10-09 21:13     ` Dave Chinner
2014-10-09 22:30       ` Stan Hoeppner
2014-10-18  6:03       ` Stan Hoeppner
2014-10-18 18:16         ` Stan Hoeppner
2014-10-19 22:24           ` Dave Chinner
2014-10-21 23:56             ` Stan Hoeppner
2014-10-25  2:28               ` Stan Hoeppner
2014-10-09 21:07 ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.