Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes

All of lore.kernel.org
 help / color / mirror / Atom feed

* Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes
@ 2015-12-23 19:20 Doug Dumitru
  2015-12-23 20:28 ` Robert Kierski
  2015-12-25  7:58 ` NeilBrown
  0 siblings, 2 replies; 6+ messages in thread
From: Doug Dumitru @ 2015-12-23 19:20 UTC (permalink / raw)
  To: linux-raid

The issue:

The background thread in RAID-5 can wake up in the middle of a process
populating stripe cache entries with a long write.  If the long write
contains a complete stripe, the background thread "should" be able to
process the require without doing any reads.

Sometimes the background thread is too quick at starting up a write
and schedules a RMW (Read Modify Write) even though the needed blocks
will soon be available.

Seeing this happen:

You can see this happen by creating an MD set with a small stripe size
and then doing DIRECT_IO writes that are exactly aligned on a stripe.
For example, with 4 disks and 64K stripes, write 192K blocks aligned
on 192K boundaries.  You can do this from C or with 'dd' or 'fio'.

If you have this running, you can then run iostat and you should see
absolutely no read activity on the disks.

The probability of this happening goes up when there are more disks.
It may also go up the faster the disks are.  My use case is 24 SSDs.

The problem with this:

There are really three issues.

1)  The code does not need to work this way.  It is not "broken" but
just seems wrong.
2)  There is a performance penalty here.
3)  There is a Flash wear penalty here.

It is 3) that most interests me.

The fix:

Create a waitq or semaphore based lock so that if a write includes a
complete stripe, the background thread will wait for the write to
completely populate the thread.

I would do this with a small array of locks.  When a write includes a
complete stripe, it sets a lock (stripe_number % sizeof_lock_array).
This lock is released as soon as the write finishes populating the
stripe cache.  The background thread checks this lock before it starts
a write.  If the lock is set, it waits until the stripe cache is
completely populated which should eliminate the RMW.

If no writes are full stripes, then the lock never gets set, so most
code runs without any real overhead.

Implementing this:

I am happy to implement this.  I have quite a bit of experience with
lock structures like this.  I can also test on x86 and x86_64, but
will need help with other arch's.

Then again, if this is too much of an "edge case", I will just keep my
patches in-house.

-- 
Doug Dumitru
WildFire Storage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes
  2015-12-23 19:20 Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes Doug Dumitru
@ 2015-12-23 20:28 ` Robert Kierski
  2015-12-23 20:37   ` Doug Dumitru
  2015-12-25  7:58 ` NeilBrown
  1 sibling, 1 reply; 6+ messages in thread
From: Robert Kierski @ 2015-12-23 20:28 UTC (permalink / raw)
  To: doug, linux-raid

Hey Doug,

I have a comment about the last line of your message -- 'Then again, if this is too much of an "edge case", I will just keep my patches in-house.'

As the use of SSD's in RAID configurations becomes more common, and SSD's become faster, the case you're talking about will be a major issue preventing customers from getting all that they expect from the equipment they purchased.  I would encourage you to continue pursuing better performance.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes
  2015-12-23 20:28 ` Robert Kierski
@ 2015-12-23 20:37   ` Doug Dumitru
  0 siblings, 0 replies; 6+ messages in thread
From: Doug Dumitru @ 2015-12-23 20:37 UTC (permalink / raw)
  To: Robert Kierski; +Cc: linux-raid

Bob,

In thinking about this, I suspect this is somewhat less of an edge
case than just the full stripe test.  Even partial stripes probably
fire early increasing the amount of IO.

My issue with SSD wear is that my particular application stack
generates writes that are "perfect" and even very subtle deviations
from this can scramble some SSDs FTLs.  Workloads that are more
typical will already push the SSDs into heavier wear amplification, so
this issue might be in the noise.



On Wed, Dec 23, 2015 at 12:28 PM, Robert Kierski <rkierski@cray.com> wrote:
> Hey Doug,
>
> I have a comment about the last line of your message -- 'Then again, if this is too much of an "edge case", I will just keep my patches in-house.'
>
> As the use of SSD's in RAID configurations becomes more common, and SSD's become faster, the case you're talking about will be a major issue preventing customers from getting all that they expect from the equipment they purchased.  I would encourage you to continue pursuing better performance.
>
> Bob Kierski
> Senior Storage Performance Engineer
> Cray Inc.
> 380 Jackson Street
> Suite 210
> St. Paul, MN 55101
> Tele: 651-967-9590
> Fax:  651-605-9001
> Cell: 651-890-7461
>



-- 
Doug Dumitru
WildFire Storage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes
  2015-12-23 19:20 Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes Doug Dumitru
  2015-12-23 20:28 ` Robert Kierski
@ 2015-12-25  7:58 ` NeilBrown
  2015-12-30 19:28   ` Doug Dumitru
  1 sibling, 1 reply; 6+ messages in thread
From: NeilBrown @ 2015-12-25  7:58 UTC (permalink / raw)
  To: doug, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2740 bytes --]

On Thu, Dec 24 2015, Doug Dumitru wrote:

> The issue:
>
> The background thread in RAID-5 can wake up in the middle of a process
> populating stripe cache entries with a long write.  If the long write
> contains a complete stripe, the background thread "should" be able to
> process the require without doing any reads.
>
> Sometimes the background thread is too quick at starting up a write
> and schedules a RMW (Read Modify Write) even though the needed blocks
> will soon be available.
>
> Seeing this happen:
>
> You can see this happen by creating an MD set with a small stripe size
> and then doing DIRECT_IO writes that are exactly aligned on a stripe.
> For example, with 4 disks and 64K stripes, write 192K blocks aligned
> on 192K boundaries.  You can do this from C or with 'dd' or 'fio'.
>
> If you have this running, you can then run iostat and you should see
> absolutely no read activity on the disks.
>
> The probability of this happening goes up when there are more disks.
> It may also go up the faster the disks are.  My use case is 24 SSDs.
>
> The problem with this:
>
> There are really three issues.
>
> 1)  The code does not need to work this way.  It is not "broken" but
> just seems wrong.
> 2)  There is a performance penalty here.
> 3)  There is a Flash wear penalty here.
>
> It is 3) that most interests me.
>
> The fix:
>
> Create a waitq or semaphore based lock so that if a write includes a
> complete stripe, the background thread will wait for the write to
> completely populate the thread.
>
> I would do this with a small array of locks.  When a write includes a
> complete stripe, it sets a lock (stripe_number % sizeof_lock_array).
> This lock is released as soon as the write finishes populating the
> stripe cache.  The background thread checks this lock before it starts
> a write.  If the lock is set, it waits until the stripe cache is
> completely populated which should eliminate the RMW.
>
> If no writes are full stripes, then the lock never gets set, so most
> code runs without any real overhead.
>
> Implementing this:
>
> I am happy to implement this.  I have quite a bit of experience with
> lock structures like this.  I can also test on x86 and x86_64, but
> will need help with other arch's.
>
> Then again, if this is too much of an "edge case", I will just keep my
> patches in-house.

Hi,
 this is certainly something that needs fixing.  I can't really say if
 your approach would work or not without seeing it and testing it on a
 variety of work loads.

 Certainly if you do implement something, please post it for other to
 test and review.  If it makes measurable improvements without causing
 significant regressions, it will likely be included upstream.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes
  2015-12-25  7:58 ` NeilBrown
@ 2015-12-30 19:28   ` Doug Dumitru
  2016-01-11  7:44     ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Doug Dumitru @ 2015-12-30 19:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

A new lock for RAID-4/5/6 to minimize read/modify/write operations.

The Problem:  The background thread for raid can wake up
asynchronously and will sometimes wake up and start processing a write
before the writing thread has finished updating the stripe cache
blocks.  If the calling thread write was a long write (longer than
chunk size), then the background thread will configure a raid write
operation that is sub-optimal resulting in extra IO operations, slower
performance, and higher wear on Flash storage.

The Easy Fix:  When the calling thread has a long write, it "locks"
the stripe number with a semaphore.  When the background thread wakes
up and starts working on a stripe, it locks the same lock, and then
immediately releases it.  This way the background thread will wait for
the write to fully populate the stripe caches before it start to build
a write request.

The Lock Structure:  ... is just a small, binary sized, array of
semaphores.  The lock is picked by just "stripe_number %
number_of_semaphores".  There will be collisions, but they should be
short and lots of IO is happening for each operation, so a semaphore
is cheap enough.

Memory Usage:  With 64 semaphores, this adds 1.5K to the raid control block.

A More Comprehensive Fix:  This Easy Fix assumes that updates are
contained within a single BIO request.  A better fix is to lock linear
operations that span BIOs by looking back on the queue.  I don't have
a lot of experience playing with queues, but this is probably workable
with only a little more complexity, provided that you don't try to do
this across cores and queues.  Again, I don't play much in queues, so
I may be totally missing the structures here.

The Really High Performance Fix:  If the application is well enough
behaved to write complete, perfect stripes contained in a single BIO
request, then the whole stripe cache logic can be bypassed.  This lets
you submit the member disk IO operations directly from the calling
thread.  I have this running in a patch in the field and it works
well, but the use case is very limited and something probably breaks
with more "normal" IO patterns.  I have hit 11GB/sec with RAID-5 and
8GB/sec with RAID-6 this way with 24 SSDs.

Tweak-ability:  All of these changes can be exposed in /sys to allow
sysadmins to tune their system possibly enabling or disabling
features.  Most useful for early code that might have broken use
cases.  Then again, too many knobs sometimes just increases confusion.

Asking for Feedback:  I am happy to write "all of the above" and
submit it and work with the group to get it tested etc.  If this
interests you, please comment on how far you think I should go.  Also,
if there are any notes on "submission style", how and where to post
patches, which kernel version to patch/develop against, documentation
style, sign-off requirements, etc. please point me at them.

Thanks in advance,

Doug Dumitru
WildFire Storage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes
  2015-12-30 19:28   ` Doug Dumitru
@ 2016-01-11  7:44     ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2016-01-11  7:44 UTC (permalink / raw)
  To: doug; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4190 bytes --]

On Thu, Dec 31 2015, Doug Dumitru wrote:

> A new lock for RAID-4/5/6 to minimize read/modify/write operations.
>
> The Problem:  The background thread for raid can wake up
> asynchronously and will sometimes wake up and start processing a write
> before the writing thread has finished updating the stripe cache
> blocks.  If the calling thread write was a long write (longer than
> chunk size), then the background thread will configure a raid write
> operation that is sub-optimal resulting in extra IO operations, slower
> performance, and higher wear on Flash storage.
>
> The Easy Fix:  When the calling thread has a long write, it "locks"
> the stripe number with a semaphore.  When the background thread wakes
> up and starts working on a stripe, it locks the same lock, and then
> immediately releases it.  This way the background thread will wait for
> the write to fully populate the stripe caches before it start to build
> a write request.

The code does something a lot like this already.
When the filesystem starts a write, it calls "blk_start_plug()"
When it finishes it calls "blk_finish_plug()".

md/raid5 detects this plugging and when it gets a bio and attaches it to
a "stripe_head", the stripe_head is queued on a delayed-list which is
not processed until the blk_finish_plug() is called.

So we should already not start processing a request until we have the
whole request.  But this sometimes doesn't work.  Understanding why it
doesn't work and what actually happens would be an important first step
to fixing the problem.

The plugging doesn't guarantee that a request will be delayed - doing
that can too easily lead to deadlocks.  Rather it just discourages early
processing.  If a memory shortage appears and RAID5 could free up some
memory by processing a request earlier, it is better to do that than to
wait indefinitely and possibly deadlock.
It is possible that this safety-valve code is triggering too early in
some cases.

>
> The Really High Performance Fix:  If the application is well enough
> behaved to write complete, perfect stripes contained in a single BIO
> request, then the whole stripe cache logic can be bypassed.  This lets
> you submit the member disk IO operations directly from the calling
> thread.  I have this running in a patch in the field and it works
> well, but the use case is very limited and something probably breaks
> with more "normal" IO patterns.  I have hit 11GB/sec with RAID-5 and
> 8GB/sec with RAID-6 this way with 24 SSDs.

It would certainly be interesting to find a general solution which
allowed full stripes to be submitted without a context switch.  It
should be possible.  There is already code which avoids copying from the
filesystem buffer into the stripe cache.  Detecting complete stripes and
submitting them immediately should be possible.  Combining several
stripe_heads into a whole stripe should be possible in many cases using
the new stripe-batching.

>
> Tweak-ability:  All of these changes can be exposed in /sys to allow
> sysadmins to tune their system possibly enabling or disabling
> features.  Most useful for early code that might have broken use
> cases.  Then again, too many knobs sometimes just increases confusion.
>
> Asking for Feedback:  I am happy to write "all of the above" and
> submit it and work with the group to get it tested etc.  If this
> interests you, please comment on how far you think I should go.  Also,
> if there are any notes on "submission style", how and where to post
> patches, which kernel version to patch/develop against, documentation
> style, sign-off requirements, etc. please point me at them.
>

Patches should be sent to linux-raid@vger.kernel.org
Documentation/SubmittingPatches should describe all the required style.

I think a really important first step is to make sure you understand how
the current code is supposed to work, and why it fails sometimes.  The
above short notes should get you started in following what should
happens.

I really don't think locks as you describe them would be part of a
solution.  Flag bits in the stripe_heads, different queues of
stripe_heads and different queuing disciplines might be.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-01-11  7:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-23 19:20 Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes Doug Dumitru
2015-12-23 20:28 ` Robert Kierski
2015-12-23 20:37   ` Doug Dumitru
2015-12-25  7:58 ` NeilBrown
2015-12-30 19:28   ` Doug Dumitru
2016-01-11  7:44     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.