All of lore.kernel.org
 help / color / mirror / Atom feed
* realtime section bugs still around
@ 2012-07-27  8:14 Jason Newton
  2012-07-27  9:56 ` Stan Hoeppner
  2012-07-30  3:03 ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Jason Newton @ 2012-07-27  8:14 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 4572 bytes --]

Hi,

I think the following bug is still around:

http://oss.sgi.com/archives/xfs/2011-11/msg00179.html

I get the same stack trace.  There's another report out there somewhere
with another similar stack trace.  I know the realtime code is not
maintained so much but it seems to be a waste to let it fall out of
maintenance when it's the only thing on linux that seems to fill the
realtime io niche.

So this email is mainly about the null pointer deref on the spinlock in
_xfs_buf_find on realtime files, but I figure I might also ask a few more
questions.

What kind of differences should one expect between GRIO and realtime files?

What kind of on latencies of writes should one expect for realtime files vs
normal?

My use case is diagnostic tracing on an embedded system as well as saving
raw video to disk (3 high res 10bit video streams, 5.7MB per frame, at 20hz
so effectively 60fps total).   I use 2 512GB OCZ vertex 4 SSDs which
support ~450MB/s each.  I've soft-raided them together (raid 0) with a 4k
chunksize and I get about 900MB/s avg in a benchmark program I wrote to
simulate my videostream logging needs.  I only save one file per
videostream (only 1 videostream modeled in simulation), which I append to
in a loop with a single write call, which records the frame, over and over
while keeping track of timing.  The frame is in memory and nonzero with
some interesting pattern to defeat compression if its in the pipeline
anywhere.  I get 180-300MB/s with O_DIRECT, so better performance without
O_DIRECT (maybe because it's soft-raid?).  The problem is that I
occationally get hickups in latency... there's nothing else using the disk
(embedded system, no other pid's running + root is RO).  I use the deadline
io scheduler on both my SSDs.

I only have 50 milliseconds per frame and latencies exceeding this would
result in dropped frames (bad).

Benchmarks (all time values in milliseconds per frame for the write call to
complete), with 4k chunksizes for raid-0 (85-95% CPU):
[04:42:08.450483000] [6] min: 4 max: 375 avg: 6.6336148 std: 4.6589185
count = 163333, transferred 900.33G
[07:52:21.204783000] [6] min: 4 max: 438 avg: 6.4564963 std: 3.9554192
count = 34854, transferred 192.12G (total time=226.65sec, ~154fps)

O_DIRECT (60-80% CPU):
[07:46:08.912902000] [6] min: 13 max: 541 avg: 25.9286739 std: 10.3084094
count = 17527, transferred 96.61G


Some benchmarks of last nights 32k chunksizes for raid-0:
vectorized write (prior to d_mem aligned, tightly packed frames):
[05:46:02.481997000] [6] min: 4 max: 50 avg: 6.3724173 std: 3.1656021 count
= 3523, transferred 19.42G
[06:14:19.416474000] [6] min: 4 max: 906 avg: 6.6565749 std: 9.2845644
count = 22538, transferred 124.23G
[06:15:58.029818000] [6] min: 4 max: 485 avg: 6.4346011 std: 5.6314630
count = 12180, transferred 67.14G
[06:33:24.125104000] [6] min: 4 max: 1640 avg: 6.7820190 std: 9.9053959
count = 40862, transferred 225.24G
[06:47:00.812176000] [6] min: 4 max: 503 avg: 6.7217849 std: 5.8866980
count = 13099, transferred 72.20G
[07:03:55.334832000] [6] min: 4 max: 505 avg: 6.5297441 std: 8.0027016
count = 14636, transferred 80.68G

non vectorized (many write calls):
[05:46:55.839896000] [6] min: 5 max: 341 avg: 7.1133700 std: 7.3144947
count = 2878, transferred 15.86G
[06:03:00.353392000] [6] min: 5 max: 464 avg: 7.8846180 std: 5.5350027
count = 27966, transferred 154.16G

O_DIRECT:
[07:51:45.467037000] [6] min: 9 max: 486 avg: 11.6206933 std: 6.9021786
count = 9603, transferred 52.93G
[07:59:04.404820000] [6] min: 9 max: 490 avg: 11.8425485 std: 6.6553718
count = 32172, transferred 177.34G


xfs_info of my video raid:
meta-data=/dev/md2               isize=256    agcount=32, agsize=7380047
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=236161504, imaxpct=25
         =                       sunit=1      swidth=2 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=115313, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

I'm using 3.2.22 with the rt34 patchset.

If it's desired I can post my benchmark code. I intend to rework it a
little so it only does 60fps capped since this is my real workload.

If anyone has any tips for reducing latencies of the write calls or cpu
usage, I'd be interested for sure.

Apologies for the long email!  I figured I had an interesting use case with
lots of numbers at my disposal.

-Jason

[-- Attachment #1.2: Type: text/html, Size: 5111 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-07-27  8:14 realtime section bugs still around Jason Newton
@ 2012-07-27  9:56 ` Stan Hoeppner
  2012-07-30  3:03 ` Dave Chinner
  1 sibling, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2012-07-27  9:56 UTC (permalink / raw)
  To: xfs

On 7/27/2012 3:14 AM, Jason Newton wrote:

> raw video to disk (3 high res 10bit video streams, 5.7MB per frame, at 20hz
> so effectively 60fps total).   I use 2 512GB OCZ vertex 4 SSDs which
> support ~450MB/s each.  I've soft-raided them together (raid 0) with a 4k
> chunksize and I get about 900MB/s avg in a benchmark program I wrote to
> simulate my videostream logging needs.
...
> I only have 50 milliseconds per frame and latencies exceeding this would
> result in dropped frames (bad).
...
max: 375
transferred 900.33G
...
max: 438
transferred 192.12G
...
max: 541
transferred 96.61G
...
max: 50
transferred 19.42G
...
max: 906
transferred 124.23G

etc.

> xfs_info of my video raid:
> meta-data=/dev/md2               isize=256    agcount=32, agsize=7380047
> blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=236161504, imaxpct=25
>          =                       sunit=1      swidth=2 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=115313, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> I'm using 3.2.22 with the rt34 patchset.
> 
> If it's desired I can post my benchmark code. I intend to rework it a
> little so it only does 60fps capped since this is my real workload.
> 
> If anyone has any tips for reducing latencies of the write calls or cpu
> usage, I'd be interested for sure.

I don't think your write latency problem is software related.

What do you think the odds are that the wear leveling routine is kicking
in and causing your half second max latencies?  In one test you wrote
over 90% of the user cells of the devices, and most of your test writes
were over 100GB--10% of the user cells.  That's an extremely large wear
load for an SSD over a short period.

What happens when you format each SSD directly and write to the two XFS
filesystems, without md/RAID0, two streams to one SSD and one to the
other?  That'll free up serious cycles allowing you to eliminate CPU
saturation.

WRT CPU consumption, at these data rates, md/RAID0 is going to eat
massive cycles, even though it is not bound by a single thread as are
RAID1/10/5/6.  A linear concat will eat the same as RAID0.  The others
would simply peak one core and scale no further.  Both 0/linear are
fully threaded and simply pass an offset to the block layer, so using an
embedded CPU with more cores would help.  One with a faster clock would
as well obviously, but not as much as more cores.

Interesting topic Jason.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-07-27  8:14 realtime section bugs still around Jason Newton
  2012-07-27  9:56 ` Stan Hoeppner
@ 2012-07-30  3:03 ` Dave Chinner
       [not found]   ` <CAGou9MheeBWxajd65szNfDB2L+VVoZ7SypEdUKj7np3L0H8fHA@mail.gmail.com>
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2012-07-30  3:03 UTC (permalink / raw)
  To: Jason Newton; +Cc: xfs

On Fri, Jul 27, 2012 at 01:14:17AM -0700, Jason Newton wrote:
> Hi,
> 
> I think the following bug is still around:
> 
> http://oss.sgi.com/archives/xfs/2011-11/msg00179.html
> 
> I get the same stack trace.

Not surprising, I doubt anyone has looked at it much. Indeed,
xfs/090 assert fails immediately in the rt allocator for me....

> There's another report out there somewhere
> with another similar stack trace.  I know the realtime code is not
> maintained so much but it seems to be a waste to let it fall out of
> maintenance when it's the only thing on linux that seems to fill the
> realtime io niche.

The XFS "realtime" device has nothing to do with "realtime IO".

If anything, it's probably much worse at "realtime IO" than the
normal data device, especially at scale, because it is bitmap rather
than btree based. And it is single threaded.

That's why it really isn't maintained - the data device is as good
or better in RT workloads as the "realtime" device....

> So this email is mainly about the null pointer deref on the spinlock in
> _xfs_buf_find on realtime files, but I figure I might also ask a few more
> questions.
> 
> What kind of differences should one expect between GRIO and realtime files?

Linux doesn't support GRIO. It's an Irix only thing, and that
required special hardware support for bandwidth reservation, special
frame schedulers in the IO path, etc. The XFS realtime device was
just one part of the whole GRIO framework. Anyway, if you don't have
15 year old SGI hardware you can't use GRIO.

If you are talking about GRIOv2, then, well, you aren't running
CXFS...

> What kind of on latencies of writes should one expect for realtime files vs
> normal?

How long is a piece of string?

> raw video to disk (3 high res 10bit video streams, 5.7MB per frame, at 20hz
> so effectively 60fps total).   I use 2 512GB OCZ vertex 4 SSDs which
> support ~450MB/s each.  I've soft-raided them together (raid 0) with a 4k
> chunksize

There's your first problem. You are storing 5.7MB files, so why
would you use a 4k chunk size? You'd do better with something on the
order of 1MB chunk size (2MB stripe width) so that you are forming
as large IOs as possible with the minimum of software overhead (i.e
no merging of 4k IOs into larger IOs in the IO scheduler).

Note that you are also writing hundreds of GB to the SSDs, which
will be triggering internal garbage collection, and that will have
significant impact on Io completion latency. It's not uncommon to
see 500ms IO latencies occur on consumer level SSDs when garbage
collect kicks in. If you are going to use SATA SSDs, then you're
going to have to design your application to be able to handle such
write latencies...

> and I get about 900MB/s avg in a benchmark program I wrote to
> simulate my videostream logging needs.  I only save one file per
> videostream (only 1 videostream modeled in simulation), which I append to
> in a loop with a single write call, which records the frame, over and over
> while keeping track of timing.

The typical format for high bandwidth video stream is file per
frame. That's exactly what the filestreams allocator is designed for
- ingest of multiple streams and keeping them in separate locations
(AGs) on disk. This means allocation remains concurrent and doesn't
serialise, causing excess, unpredicatble latencies.

Indeed, if you use file per frame, and a RAID0 chunk size of 3MB
(6MB stripe width), then XFs will align the data in each file to the
same stripe unit boundary for all files. There will be 300kb of free
space between them, but having everything nicely aligned to the
underlying geometry tends to help maintain allocation determinism
until the filesystem is 5.7/6 * 100% = 95% full.....

> The frame is in memory and nonzero with
> some interesting pattern to defeat compression if its in the pipeline
> anywhere.  I get 180-300MB/s with O_DIRECT, so better performance without
> O_DIRECT (maybe because it's soft-raid?).

It sounds like you are using in line write(2) calls, which means the
IO is synchronous (i.e. occurs within the write syscall), which
means throughput is bound by IO completion latency. AIO+DIO solves
this problem as it implies application level frame buffering - this
is a common way of ensuring that IO latencies don't cause dropped
frames

Using buffered IO means the write(2) operates at memory speed, but
you then have no control over allocation and writeback, and memory
allocation and reclaim becomes a major source of latency that direct
IO does not have. Doing buffered IO to the realtime device is, well,
even less well tested than the realtime device, as historically the
RT device only supported direct IO. It's supposed to work, but it's
never really been well tested, and I don't know anyone who uses it
in production....

> The problem is that I
> occationally get hickups in latency... there's nothing else using the disk
> (embedded system, no other pid's running + root is RO).  I use the deadline
> io scheduler on both my SSDs.

Yep, that'll be because you are using buffered IO. It'll be faster
than a naive Direct IO implementation, but you'll have latency
issues that cannot be avoided or predicted.

> xfs_info of my video raid:
> meta-data=/dev/md2               isize=256    agcount=32, agsize=7380047

Lots of little AGs - that will stress the freespace management of
the filesystem pretty quickly.....

> blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=236161504, imaxpct=25
>          =                       sunit=1      swidth=2 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=115313, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

And no realtime device. It doesn't look like you're testing what you
think you are testing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* realtime section bugs still around
       [not found]   ` <CAGou9MheeBWxajd65szNfDB2L+VVoZ7SypEdUKj7np3L0H8fHA@mail.gmail.com>
@ 2012-07-31 23:01     ` Jason Newton
  2012-07-31 23:46       ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Newton @ 2012-07-31 23:01 UTC (permalink / raw)
  To: Dave Chinner, xfs


[-- Attachment #1.1: Type: text/plain, Size: 9533 bytes --]

On Sun, Jul 29, 2012 at 8:03 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Fri, Jul 27, 2012 at 01:14:17AM -0700, Jason Newton wrote:
> > Hi,
> >
> > I think the following bug is still around:
> >
> > http://oss.sgi.com/archives/xfs/2011-11/msg00179.html
> >
> > I get the same stack trace.
>
> Not surprising, I doubt anyone has looked at it much. Indeed,
> xfs/090 assert fails immediately in the rt allocator for me....
>
> > There's another report out there somewhere
> > with another similar stack trace.  I know the realtime code is not
> > maintained so much but it seems to be a waste to let it fall out of
> > maintenance when it's the only thing on linux that seems to fill the
> > realtime io niche.
>
> The XFS "realtime" device has nothing to do with "realtime IO".
>
> If anything, it's probably much worse at "realtime IO" than the
> normal data device, especially at scale, because it is bitmap rather
> than btree based. And it is single threaded.
>
> That's why it really isn't maintained - the data device is as good
> or better in RT workloads as the "realtime" device....
>

This wasn't expected, thanks for the clarifications.   What was the
original point of RT files?

>
> > So this email is mainly about the null pointer deref on the spinlock in
> > _xfs_buf_find on realtime files, but I figure I might also ask a few more
> > questions.
> >
> > What kind of differences should one expect between GRIO and realtime
> files?
>
> Linux doesn't support GRIO. It's an Irix only thing, and that
> required special hardware support for bandwidth reservation, special
> frame schedulers in the IO path, etc. The XFS realtime device was
> just one part of the whole GRIO framework. Anyway, if you don't have
> 15 year old SGI hardware you can't use GRIO.
>
> If you are talking about GRIOv2, then, well, you aren't running
> CXFS...
>
> > What kind of on latencies of writes should one expect for realtime files
> vs
> > normal?
>
> How long is a piece of string?
>
Well, I had meant with say one block of io.


>
> > raw video to disk (3 high res 10bit video streams, 5.7MB per frame, at
> 20hz
> > so effectively 60fps total).   I use 2 512GB OCZ vertex 4 SSDs which
> > support ~450MB/s each.  I've soft-raided them together (raid 0) with a 4k
> > chunksize
>
> There's your first problem. You are storing 5.7MB files, so why
> would you use a 4k chunk size? You'd do better with something on the
> order of 1MB chunk size (2MB stripe width) so that you are forming
> as large IOs as possible with the minimum of software overhead (i.e
> no merging of 4k IOs into larger IOs in the IO scheduler).
>
> I went to the intel builtin raid0 and I found chunksize 4k, 64k, and 128k,
doesn't actually affect much in terms of latency, throughput with the
simulation application I've written - nor CPU.  Even directly streaming to
the raid partition still gobbles 40% cpu (single thread, single stream @
60fps, higher avg latency than xfs).  XFS on any of these chunksizes is
60-70% CPU with 3 streams, 1 per thread.  For XFS single thread, single
stream @ 60fps it looked like the same as direct, maybe getting up to 45,
and 50% CPU occasionally. All these numbers are seemingly dependent on the
mood of the SSD, along with how often there were latency overruns
(sometimes none for 45 minutes, sometimes every second - perhaps there's a
pattern to the behavior).  I'd be interested in trying larger blocksizes
than 4k (I don't mean raid0 chunksize) but that doesn't seem possible with
x86_64 and linux...


> Note that you are also writing hundreds of GB to the SSDs, which
> will be triggering internal garbage collection, and that will have
> significant impact on Io completion latency. It's not uncommon to
> see 500ms IO latencies occur on consumer level SSDs when garbage
> collect kicks in. If you are going to use SATA SSDs, then you're
> going to have to design your application to be able to handle such
> write latencies...
>
> 500ms does look like to be in the neighborhood for the garbage collection
for these drives.  Maybe 4-450 on the avg.  This neighborhood is an obvious
outlier in some tests.


> > and I get about 900MB/s avg in a benchmark program I wrote to
> > simulate my videostream logging needs.  I only save one file per
> > videostream (only 1 videostream modeled in simulation), which I append to
> > in a loop with a single write call, which records the frame, over and
> over
> > while keeping track of timing.
>
> The typical format for high bandwidth video stream is file per
> frame. That's exactly what the filestreams allocator is designed for
> - ingest of multiple streams and keeping them in separate locations
> (AGs) on disk. This means allocation remains concurrent and doesn't
> serialise, causing excess, unpredicatble latencies.
>
> Ah, that is interesting.  I used to save tiffs but I figured that would be
more variable in latency and cpu usage since it's opening and closing files
constantly.  However you have a definite point since it's not serialized to
one stream, that there's some extra concurrency to exploit.  I'll have to
benchmark with multiple files again.


> Indeed, if you use file per frame, and a RAID0 chunk size of 3MB
> (6MB stripe width), then XFs will align the data in each file to the
> same stripe unit boundary for all files. There will be 300kb of free
> space between them, but having everything nicely aligned to the
> underlying geometry tends to help maintain allocation determinism
> until the filesystem is 5.7/6 * 100% = 95% full.....
>
>
> The frame is in memory and nonzero with
> > some interesting pattern to defeat compression if its in the pipeline
> > anywhere.  I get 180-300MB/s with O_DIRECT, so better performance without
> > O_DIRECT (maybe because it's soft-raid?).
>
> It sounds like you are using in line write(2) calls, which means the
> IO is synchronous (i.e. occurs within the write syscall), which
> means throughput is bound by IO completion latency. AIO+DIO solves
> this problem as it implies application level frame buffering - this
> is a common way of ensuring that IO latencies don't cause dropped
> frames
>
> Yes, I don't really want to convolute the main program with AIO, it's
complex enough as is.


> Using buffered IO means the write(2) operates at memory speed, but
> you then have no control over allocation and writeback, and memory
> allocation and reclaim becomes a major source of latency that direct
> IO does not have. Doing buffered IO to the realtime device is, well,
> even less well tested than the realtime device, as historically the
> RT device only supported direct IO. It's supposed to work, but it's
> never really been well tested, and I don't know anyone who uses it
> in production....
>
> > The problem is that I
> > occationally get hickups in latency... there's nothing else using the
> disk
> > (embedded system, no other pid's running + root is RO).  I use the
> deadline
> > io scheduler on both my SSDs.
>
> Yep, that'll be because you are using buffered IO. It'll be faster
> than a naive Direct IO implementation, but you'll have latency
> issues that cannot be avoided or predicted.
>

Interesting, what constitutes a proper Direct IO implementation? AIO + an
recording structures who's size is a multiple of in this case 4k?

>
> > xfs_info of my video raid:
> > meta-data=/dev/md2               isize=256    agcount=32, agsize=7380047
>
> Lots of little AGs - that will stress the freespace management of
> the filesystem pretty quickly.....
>
> > blks
> >          =                       sectsz=512   attr=2
> > data     =                       bsize=4096   blocks=236161504,
> imaxpct=25
> >          =                       sunit=1      swidth=2 blks
> > naming   =version 2              bsize=4096   ascii-ci=0
> > log      =internal               bsize=4096   blocks=115313, version=2
> >          =                       sectsz=512   sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> And no realtime device. It doesn't look like you're testing what you
> think you are testing....
>
> Sorry, the topic quickly moved from something of a bug report / query to
an involved benchmark and testing.  This xfs_info was not when I had the
realtime section, it was just for 4k chunksize raid0.  After a few crashes
on the realtime section I moved on to other testing since I doubted there
was little that could be done.  I've since performed alot of testing (to be
discussed hopefully in the next week, I'm getting to be pretty short on
time) and rewrote the framelogging component of the application with
average bandwidth in mind and decoupled the saving of frame data from the
framegrabber threads.  Basically I just have a configurable circular buffer
of up to 2 seconds of frames.  I think that is the best answer for now as
from my naive point of view, its some combination of linux related (FS path
was never RT)  and SSD (garbage collection was unplanned... who knows what
else the firmware is doing).

I'm still interested in finding out why streaming a few hundred MB to disk
has so much over head in comparison to the calculations I do in userspace,
though.  Straight copies of frames (in the real program, copied because of
limitations of the framegrabber driver's DMA engine) don't use as much cpu
as writing to a single SSD.  It takes a little over a millisecond to copy a
frame.  On hardware, while it's an embedded system it's got an 2.2ghz
2-core i7 in it, the southbridge is BD82QM67-PCH.

-Jason

[-- Attachment #1.2: Type: text/html, Size: 12260 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-07-31 23:01     ` Jason Newton
@ 2012-07-31 23:46       ` Stan Hoeppner
       [not found]         ` <CAGou9MhneejOuhX4c8G06c3Zh7dxF-OtZ+=mT-7fho_u1Q3zWw@mail.gmail.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2012-07-31 23:46 UTC (permalink / raw)
  To: Jason Newton; +Cc: xfs

On 7/31/2012 6:01 PM, Jason Newton wrote:

> I'm still interested in finding out why streaming a few hundred MB to disk
> has so much over head in comparison to the calculations I do in userspace,

1.  md eats a lot of cycles at high data rates
2.  ATA overhead
3.  IRQ/MSI overhead
4.  Etc.

All these small bits add up to more than negligible CPU overhead at high
data rates.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
       [not found]         ` <CAGou9MhneejOuhX4c8G06c3Zh7dxF-OtZ+=mT-7fho_u1Q3zWw@mail.gmail.com>
@ 2012-08-01  3:55           ` Stan Hoeppner
  2012-08-01  5:55             ` Jason Newton
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2012-08-01  3:55 UTC (permalink / raw)
  To: Jason Newton; +Cc: xfs

On 7/31/2012 6:55 PM, Jason Newton wrote:
> On Tue, Jul 31, 2012 at 4:46 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
> 
>> On 7/31/2012 6:01 PM, Jason Newton wrote:
>>
>>> I'm still interested in finding out why streaming a few hundred MB to
>> disk
>>> has so much over head in comparison to the calculations I do in
>> userspace,
>>
>> 1.  md eats a lot of cycles at high data rates
>>
> 
> md with intel's raid0?  I stopped using linux/softraid, but  I've read
> intel's is a mix between hardware and software raid...

Intel Matrix RAID is fakeraid.  Designed for consumer workloads.  You're
shoving a decidedly non consumer, high b/w IO stream through it.  Don't
expect much.  In fact I'm surprised you're using consumer grade gear for
this application.  You are designing this software/system for a
commercial use case, correct?  If so I'd get some better hardware.

CPU overhead for fakeraid will be similar to md/RAID, depending on the
vendor and implementation.  In some cases it may be much higher than md.

> 2.  ATA overhead
>> 3.  IRQ/MSI overhead
>> 4.  Etc.
>>
>> All these small bits add up to more than negligible CPU overhead at high
>> data rates.
>>
> 
> Regarding the others, how would I go about measuring their overhead...

To what end?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-08-01  3:55           ` Stan Hoeppner
@ 2012-08-01  5:55             ` Jason Newton
  2012-08-02  0:39               ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Newton @ 2012-08-01  5:55 UTC (permalink / raw)
  To: stan; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1968 bytes --]

On Tue, Jul 31, 2012 at 8:55 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:

>
> Intel Matrix RAID is fakeraid.  Designed for consumer workloads.  You're
> shoving a decidedly non consumer, high b/w IO stream through it.  Don't
> expect much.  In fact I'm surprised you're using consumer grade gear for
> this application.  You are designing this software/system for a
> commercial use case, correct?  If so I'd get some better hardware.
>
> CPU overhead for fakeraid will be similar to md/RAID, depending on the
> vendor and implementation.  In some cases it may be much higher than md.
>

I see.  Its important things stay COTS and small... things are sort of in a
prototyping phase with some size and power constraints.  We had problems
packaging what we already have and consider we already have some
specialized io hardware we've had to account for..  There's just not much
if any room available anymore.  We're getting refined tasks in the future
and requirements will change as well... in particular this disk streaming
component is perhaps a one-off  thing that we were notified of late in the
game.

I did read around that from intel sources that Matrix Storage it really
more of a hybrid solution... after all, they make sata controllers... and
they already have to put up with 6Gb/s in hardware. But maybe they save a
penny on the real-estate.. so maybe it's just fluff from intel PR.  What
kind of hardware do you need in addition to make hardware raid 0 or 1
though.... .

>
> > 2.  ATA overhead
> >> 3.  IRQ/MSI overhead
> >> 4.  Etc.
> >>
> >> All these small bits add up to more than negligible CPU overhead at high
> >> data rates.
> >>
> >
> > Regarding the others, how would I go about measuring their overhead...
>
> To what end?
>
> Just to figure out for sure what the bottlenecks are and whether they can
be dealt with rather than looking at it as opaque system and assuming
nothing can be done.  Also as a learning experience.

--
> Stan
>
>

[-- Attachment #1.2: Type: text/html, Size: 2830 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-08-01  5:55             ` Jason Newton
@ 2012-08-02  0:39               ` Stan Hoeppner
  2012-08-02  2:38                 ` Jason Newton
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2012-08-02  0:39 UTC (permalink / raw)
  To: Jason Newton; +Cc: xfs

On 8/1/2012 12:55 AM, Jason Newton wrote:

>> Just to figure out for sure what the bottlenecks are and whether they can
> be dealt with rather than looking at it as opaque system and assuming
> nothing can be done.  Also as a learning experience.

Jason, have you considered something like this to solve your problems?

RAM is cheap.  Far cheaper than attacking this problem with any other
hardware type.  And you can't easily solve it by rewriting to use AIO,
given the effort involved with that.

You should be able to fit 32GB of RAM on the board.  Create a 24GB RAM
disk and use that for writing your 5.7MB frame files in real time.  This
eliminates any latency and stutter issues during capture.  Treat the RAM
disk as a FIFO, taking each new file and copying it out to SSD after
it's been closed, then delete the original.  This gives you in essence a
very fast buffer.  If my math is correct, 24,000MB / 300MB/s = roughly
80 seconds of buffer at a 300MB/s streaming capture rate, 40 seconds at
600MB/s.

This should be very easy to implement, and cheaper than all other
alternatives.  It should eliminate all possible latency issues, though
it will increase CPU cycles due to the data movement to/from the RAM
disk, though how much I can't guess at this point.  8GB RAM disk will
give you 26 seconds of buffering at 300MB/s, and a 4GB RAM disk will
give you 13 seconds of buffering.  If 13 seconds is sufficient, you can
implement this on a machine with only 8GB RAM, assuming you need no more
than 4GB for kernel/user space/application.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-08-02  0:39               ` Stan Hoeppner
@ 2012-08-02  2:38                 ` Jason Newton
  2012-08-02 10:39                   ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Newton @ 2012-08-02  2:38 UTC (permalink / raw)
  To: stan; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2637 bytes --]

On Wed, Aug 1, 2012 at 5:39 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:

> On 8/1/2012 12:55 AM, Jason Newton wrote:
>
> >> Just to figure out for sure what the bottlenecks are and whether they
> can
> > be dealt with rather than looking at it as opaque system and assuming
> > nothing can be done.  Also as a learning experience.
>
> Jason, have you considered something like this to solve your problems?
>
> RAM is cheap.  Far cheaper than attacking this problem with any other
> hardware type.  And you can't easily solve it by rewriting to use AIO,
> given the effort involved with that.
>
> You should be able to fit 32GB of RAM on the board.  Create a 24GB RAM
> disk and use that for writing your 5.7MB frame files in real time.  This
> eliminates any latency and stutter issues during capture.  Treat the RAM
> disk as a FIFO, taking each new file and copying it out to SSD after
> it's been closed, then delete the original.  This gives you in essence a
> very fast buffer.  If my math is correct, 24,000MB / 300MB/s = roughly
> 80 seconds of buffer at a 300MB/s streaming capture rate, 40 seconds at
> 600MB/s.
>

The system has a single slot of SODIMM. We have an 8GB DDR3 stick in it.
I've added a circular buffer for each frame and limit it to some number of
frames (so far 2 seconds, I haven't had time to experiment with it yet).
The serialization thread is now separate and consumes the circular buffer
so we're effectively talking about the same thing sans files.  This solves
the problem but as I mentioned before... I do have a desire to seek out the
sources of the latency and cpu usage... I'm not really sure of how to go
about it though.

>
> This should be very easy to implement, and cheaper than all other
> alternatives.  It should eliminate all possible latency issues, though
> it will increase CPU cycles due to the data movement to/from the RAM
> disk, though how much I can't guess at this point.  8GB RAM disk will
> give you 26 seconds of buffering at 300MB/s, and a 4GB RAM disk will
> give you 13 seconds of buffering.  If 13 seconds is sufficient, you can
> implement this on a machine with only 8GB RAM, assuming you need no more
> than 4GB for kernel/user space/application.
>
> Agreed that it's the easiest and cheapest solution. Average performance
probably won't change but burst of cpu will as it compensates for the high
latency writes in future cycles... this is undesirable but I think OK (the
important stuff is at high priority on SCHED_RR, these serialization
threads are high priority SCHED_OTHER).  Again, I haven't had time to test
it as I've been putting out other fires.

-Jason

[-- Attachment #1.2: Type: text/html, Size: 3302 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-08-02  2:38                 ` Jason Newton
@ 2012-08-02 10:39                   ` Stan Hoeppner
  2012-08-03 11:28                     ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2012-08-02 10:39 UTC (permalink / raw)
  To: Jason Newton; +Cc: xfs

On 8/1/2012 9:38 PM, Jason Newton wrote:

> The system has a single slot of SODIMM. We have an 8GB DDR3 stick in it.
> I've added a circular buffer for each frame and limit it to some number of
> frames (so far 2 seconds, I haven't had time to experiment with it yet).
> The serialization thread is now separate and consumes the circular buffer
> so we're effectively talking about the same thing sans files.  This solves
> the problem but as I mentioned before... 

Same idea, but your solution is far more elegant I think.

> I do have a desire to seek out the
> sources of the latency and cpu usage... I'm not really sure of how to go
> about it though.

We already gave you the biggest cause of your latency, which is garbage
collection/wear leveling.  You can't see inside the SSDs, but you can
see the latency jump with either top (%wa) or iostat (await,
milliseconds).  Run

iostat -x -d 1 20

and you get 20 reports 1 second apart.  1s is minimum granularity.  This
should clearly show the latency spikes caused by the SSDs.  Maybe even
execute it for 60 seconds and pipe to a file.

Regarding CPU burn, the quickest, and probably least exact, way to see
this is with top.  On Linux sorting top by CPU usage should be the
default.  If not just hit Shift+P to toggle to that sort method.  This
should be sufficient to find out who is eating the cycles.  I'd think
running top and iostat while pushing 3 streams should do the trick.  But
I'm sure you've already looked at top.  Which makes me wonder why you
were unable to see what's burning the cycles.

> Agreed that it's the easiest and cheapest solution. Average performance
> probably won't change but burst of cpu will as it compensates for the high
> latency writes in future cycles... this is undesirable but I think OK (the
> important stuff is at high priority on SCHED_RR, these serialization
> threads are high priority SCHED_OTHER).  Again, I haven't had time to test
> it as I've been putting out other fires.

Keep us posted.  BTW, do you mind sharing the make/model of that mobo,
and exactly which model that i7 is?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: realtime section bugs still around
  2012-08-02 10:39                   ` Stan Hoeppner
@ 2012-08-03 11:28                     ` Stan Hoeppner
  0 siblings, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2012-08-03 11:28 UTC (permalink / raw)
  To: stan; +Cc: Jason Newton, xfs

On 8/2/2012 5:39 AM, Stan Hoeppner wrote:

> We already gave you the biggest cause of your latency, which is garbage
> collection/wear leveling.  You can't see inside the SSDs, but you can
> see the latency jump with either top (%wa) or iostat (await,
> milliseconds).  Run
> 
> iostat -x -d 1 20
> 
> and you get 20 reports 1 second apart.  1s is minimum granularity.  This
> should clearly show the latency spikes caused by the SSDs.  Maybe even
> execute it for 60 seconds and pipe to a file.

The above assumes Linux can see the individual devices.  I've never used
Intel's fakeraid.  If its driver presents a single device to the kernel
instead of both SSD devices, iostat won't show which SSD's garbage
collection is kicking in and/or when.  It would be most beneficial if
you could see the iostat data for both SSD devices as it would tell you
exactly when each drive's GC/leveling kicks in.  If the Intel fakeraid
doesn't allow you to see both devices, you'll need to switch to md/RAID.
 I'm sure that will be problematic as you're very likely booting from
the Intel RAIDed SSD device.

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-08-03 11:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-27  8:14 realtime section bugs still around Jason Newton
2012-07-27  9:56 ` Stan Hoeppner
2012-07-30  3:03 ` Dave Chinner
     [not found]   ` <CAGou9MheeBWxajd65szNfDB2L+VVoZ7SypEdUKj7np3L0H8fHA@mail.gmail.com>
2012-07-31 23:01     ` Jason Newton
2012-07-31 23:46       ` Stan Hoeppner
     [not found]         ` <CAGou9MhneejOuhX4c8G06c3Zh7dxF-OtZ+=mT-7fho_u1Q3zWw@mail.gmail.com>
2012-08-01  3:55           ` Stan Hoeppner
2012-08-01  5:55             ` Jason Newton
2012-08-02  0:39               ` Stan Hoeppner
2012-08-02  2:38                 ` Jason Newton
2012-08-02 10:39                   ` Stan Hoeppner
2012-08-03 11:28                     ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.