RFC: use TRIM data from filesystems to speed up array rebuild?

All of lore.kernel.org
 help / color / mirror / Atom feed

* RFC: use TRIM data from filesystems to speed up array rebuild?
@ 2012-09-04 18:06 Chris Friesen
  2012-09-04 19:11 ` Ric Wheeler
  2012-09-04 20:21 ` NeilBrown
  0 siblings, 2 replies; 9+ messages in thread
From: Chris Friesen @ 2012-09-04 18:06 UTC (permalink / raw)
  To: Neil Brown, linux-raid

Hi,

I'm not really a filesystem guy so this may be a really dumb question.

We currently have an issue where we have a ~1TB RAID1 array that is 
mostly given over to LVM.  If we swap one of the disks it will rebuild 
everything, even though we may only be using a small fraction of the space.

This got me thinking.  Has anyone given thought to using the TRIM 
information from filesystems to allow the RAID code to maintain a 
bitmask of used disk blocks and only sync the ones that are actually used?

Presumably this bitmask would itself need to be stored on the disk.

Thanks,
Chris

-- 

Chris Friesen
Software Designer

3500 Carling Avenue
Ottawa, Ontario K2H 8E9
www.genband.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-04 18:06 RFC: use TRIM data from filesystems to speed up array rebuild? Chris Friesen
@ 2012-09-04 19:11 ` Ric Wheeler
  2012-09-04 20:24   ` NeilBrown
  2012-09-04 20:21 ` NeilBrown
  1 sibling, 1 reply; 9+ messages in thread
From: Ric Wheeler @ 2012-09-04 19:11 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Neil Brown, linux-raid

On 09/04/2012 02:06 PM, Chris Friesen wrote:
>
> Hi,
>
> I'm not really a filesystem guy so this may be a really dumb question.
>
> We currently have an issue where we have a ~1TB RAID1 array that is mostly 
> given over to LVM.  If we swap one of the disks it will rebuild everything, 
> even though we may only be using a small fraction of the space.
>
> This got me thinking.  Has anyone given thought to using the TRIM information 
> from filesystems to allow the RAID code to maintain a bitmask of used disk 
> blocks and only sync the ones that are actually used?
>
> Presumably this bitmask would itself need to be stored on the disk.
>
> Thanks,
> Chris
>

Device mapper has a "thin" target now that tracks blocks that are allocated or 
free (and works with discard).

That might be a basis for doing an focused RAID rebuild,

Ric


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-04 19:11 ` Ric Wheeler
@ 2012-09-04 20:24   ` NeilBrown
  2012-09-04 22:59     ` Ric Wheeler
  2012-09-06 17:17     ` Benjamin ESTRABAUD
  0 siblings, 2 replies; 9+ messages in thread
From: NeilBrown @ 2012-09-04 20:24 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Chris Friesen, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

On Tue, 04 Sep 2012 15:11:26 -0400 Ric Wheeler <ricwheeler@gmail.com> wrote:

> On 09/04/2012 02:06 PM, Chris Friesen wrote:
> >
> > Hi,
> >
> > I'm not really a filesystem guy so this may be a really dumb question.
> >
> > We currently have an issue where we have a ~1TB RAID1 array that is mostly 
> > given over to LVM.  If we swap one of the disks it will rebuild everything, 
> > even though we may only be using a small fraction of the space.
> >
> > This got me thinking.  Has anyone given thought to using the TRIM information 
> > from filesystems to allow the RAID code to maintain a bitmask of used disk 
> > blocks and only sync the ones that are actually used?
> >
> > Presumably this bitmask would itself need to be stored on the disk.
> >
> > Thanks,
> > Chris
> >
> 
> Device mapper has a "thin" target now that tracks blocks that are allocated or 
> free (and works with discard).
> 
> That might be a basis for doing an focused RAID rebuild,

I wonder how.... 
Maybe the block-later interface could grow something equivalent to
"SEEK_HOLE" and friends so that the upper level can find "holes" and
"allocated space" in the underlying device.
I wonder if it is time to discard the 'block device' abstraction and just use
files every .... but I seriously doubt it.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-04 20:24   ` NeilBrown
@ 2012-09-04 22:59     ` Ric Wheeler
  2012-09-06 17:17     ` Benjamin ESTRABAUD
  1 sibling, 0 replies; 9+ messages in thread
From: Ric Wheeler @ 2012-09-04 22:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chris Friesen, linux-raid, Joe Thornber, device-mapper development

On 09/04/2012 04:24 PM, NeilBrown wrote:
> On Tue, 04 Sep 2012 15:11:26 -0400 Ric Wheeler <ricwheeler@gmail.com> wrote:
>
>> On 09/04/2012 02:06 PM, Chris Friesen wrote:
>>> Hi,
>>>
>>> I'm not really a filesystem guy so this may be a really dumb question.
>>>
>>> We currently have an issue where we have a ~1TB RAID1 array that is mostly
>>> given over to LVM.  If we swap one of the disks it will rebuild everything,
>>> even though we may only be using a small fraction of the space.
>>>
>>> This got me thinking.  Has anyone given thought to using the TRIM information
>>> from filesystems to allow the RAID code to maintain a bitmask of used disk
>>> blocks and only sync the ones that are actually used?
>>>
>>> Presumably this bitmask would itself need to be stored on the disk.
>>>
>>> Thanks,
>>> Chris
>>>
>> Device mapper has a "thin" target now that tracks blocks that are allocated or
>> free (and works with discard).
>>
>> That might be a basis for doing an focused RAID rebuild,
> I wonder how....
> Maybe the block-later interface could grow something equivalent to
> "SEEK_HOLE" and friends so that the upper level can find "holes" and
> "allocated space" in the underlying device.
> I wonder if it is time to discard the 'block device' abstraction and just use
> files every .... but I seriously doubt it.
>
> NeilBrown

I don't think that we have to go to that extreme, but I think it would be very 
useful to see if the device mapper people have ideas in how the thin target 
might be used in combination with MD :)

ric


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-04 20:24   ` NeilBrown
  2012-09-04 22:59     ` Ric Wheeler
@ 2012-09-06 17:17     ` Benjamin ESTRABAUD
  2012-09-06 18:42       ` David Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Benjamin ESTRABAUD @ 2012-09-06 17:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: Ric Wheeler, Chris Friesen, linux-raid

On 04/09/12 21:24, NeilBrown wrote:
> On Tue, 04 Sep 2012 15:11:26 -0400 Ric Wheeler<ricwheeler@gmail.com>  wrote:
>
>> On 09/04/2012 02:06 PM, Chris Friesen wrote:
>>> Hi,
>>>
>>> I'm not really a filesystem guy so this may be a really dumb question.
>>>
>>> We currently have an issue where we have a ~1TB RAID1 array that is mostly
>>> given over to LVM.  If we swap one of the disks it will rebuild everything,
>>> even though we may only be using a small fraction of the space.
>>>
>>> This got me thinking.  Has anyone given thought to using the TRIM information
>>> from filesystems to allow the RAID code to maintain a bitmask of used disk
>>> blocks and only sync the ones that are actually used?
>>>
>>> Presumably this bitmask would itself need to be stored on the disk.
>>>
>>> Thanks,
>>> Chris
>>>
>> Device mapper has a "thin" target now that tracks blocks that are allocated or
>> free (and works with discard).
>>
>> That might be a basis for doing an focused RAID rebuild,
> I wonder how....
> Maybe the block-later interface could grow something equivalent to
> "SEEK_HOLE" and friends so that the upper level can find "holes" and
> "allocated space" in the underlying device.
> I wonder if it is time to discard the 'block device' abstraction and just use
> files every .... but I seriously doubt it.
>
> NeilBrown
Hi,

I've got a brief question about this feature that seems extremely promising:

You mentioned on your blog:

"A 'write' to a non-in-sync region should cause that region to be 
resynced. Writing zeros would in some sense be ideal, but to do that we 
would have to block the write, which would be unfortunate."

So, if we had a write on a "non-in-sync" region (let's imagine the 
bitmap allows for 1M granularity), we would compute the parity of every 
stripe that this write "touches" and update it? Is the solution zeroing 
the area used to save time reading and writing the data on the stripe to 
compute the parity, as well as any other stripes that are referenced by 
this "non-in-sync" region, even if the write wouldn't affect them, 
allowing us to then flip that entire region to "clean"?

Would this open the door to some "thin provisioned" MD RAID, where one 
could grow the underlying devices (in the case of a RAID built ontop of 
say LVM devices), and marking the new "space" as "non-in-sync" without 
disrupting (slowing) operations on the array with a sync?

In any case, seems like a great feature.

Regards,
Ben.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-06 17:17     ` Benjamin ESTRABAUD
@ 2012-09-06 18:42       ` David Brown
  2012-09-07  9:23         ` Benjamin ESTRABAUD
  0 siblings, 1 reply; 9+ messages in thread
From: David Brown @ 2012-09-06 18:42 UTC (permalink / raw)
  To: Benjamin ESTRABAUD; +Cc: NeilBrown, Ric Wheeler, Chris Friesen, linux-raid

On 06/09/12 19:17, Benjamin ESTRABAUD wrote:
> On 04/09/12 21:24, NeilBrown wrote:
>> On Tue, 04 Sep 2012 15:11:26 -0400 Ric Wheeler<ricwheeler@gmail.com>
>> wrote:
>>
>>> On 09/04/2012 02:06 PM, Chris Friesen wrote:
>>>> Hi,
>>>>
>>>> I'm not really a filesystem guy so this may be a really dumb question.
>>>>
>>>> We currently have an issue where we have a ~1TB RAID1 array that is
>>>> mostly
>>>> given over to LVM. If we swap one of the disks it will rebuild
>>>> everything,
>>>> even though we may only be using a small fraction of the space.
>>>>
>>>> This got me thinking. Has anyone given thought to using the TRIM
>>>> information
>>>> from filesystems to allow the RAID code to maintain a bitmask of
>>>> used disk
>>>> blocks and only sync the ones that are actually used?
>>>>
>>>> Presumably this bitmask would itself need to be stored on the disk.
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>> Device mapper has a "thin" target now that tracks blocks that are
>>> allocated or
>>> free (and works with discard).
>>>
>>> That might be a basis for doing an focused RAID rebuild,
>> I wonder how....
>> Maybe the block-later interface could grow something equivalent to
>> "SEEK_HOLE" and friends so that the upper level can find "holes" and
>> "allocated space" in the underlying device.
>> I wonder if it is time to discard the 'block device' abstraction and
>> just use
>> files every .... but I seriously doubt it.
>>
>> NeilBrown
> Hi,
>
> I've got a brief question about this feature that seems extremely
> promising:
>
> You mentioned on your blog:
>
> "A 'write' to a non-in-sync region should cause that region to be
> resynced. Writing zeros would in some sense be ideal, but to do that we
> would have to block the write, which would be unfortunate."
>
> So, if we had a write on a "non-in-sync" region (let's imagine the
> bitmap allows for 1M granularity), we would compute the parity of every
> stripe that this write "touches" and update it? Is the solution zeroing
> the area used to save time reading and writing the data on the stripe to
> compute the parity, as well as any other stripes that are referenced by
> this "non-in-sync" region, even if the write wouldn't affect them,
> allowing us to then flip that entire region to "clean"?

That would, I think, be correct.  All zeros are the easiest to calculate 
- the parities (raid5 and raid6) are all zeros too.  It is also the 
ideal pattern to write to SSDs - many SSDs these days implement 
transparent compression, and you don't get more compressible than zeros!

>
> Would this open the door to some "thin provisioned" MD RAID, where one
> could grow the underlying devices (in the case of a RAID built ontop of
> say LVM devices), and marking the new "space" as "non-in-sync" without
> disrupting (slowing) operations on the array with a sync?
>

Yes, that would work.  More importantly (because it would affect more 
people), it means that the creation of a md raid array on top of disks 
or partitions will immediately be "in sync", and there would be no need 
for a long and effectively useless re-sync process at creation.

> In any case, seems like a great feature.

Yes indeed.

>
> Regards,
> Ben.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-06 18:42       ` David Brown
@ 2012-09-07  9:23         ` Benjamin ESTRABAUD
  0 siblings, 0 replies; 9+ messages in thread
From: Benjamin ESTRABAUD @ 2012-09-07  9:23 UTC (permalink / raw)
  To: David Brown; +Cc: NeilBrown, Ric Wheeler, Chris Friesen, linux-raid

On 06/09/12 19:42, David Brown wrote:
> On 06/09/12 19:17, Benjamin ESTRABAUD wrote:
>> On 04/09/12 21:24, NeilBrown wrote:
>>> On Tue, 04 Sep 2012 15:11:26 -0400 Ric Wheeler<ricwheeler@gmail.com>
>>> wrote:
>>>
>>>> On 09/04/2012 02:06 PM, Chris Friesen wrote:
>>>>> Hi,
>>>>>
>>>>> I'm not really a filesystem guy so this may be a really dumb 
>>>>> question.
>>>>>
>>>>> We currently have an issue where we have a ~1TB RAID1 array that is
>>>>> mostly
>>>>> given over to LVM. If we swap one of the disks it will rebuild
>>>>> everything,
>>>>> even though we may only be using a small fraction of the space.
>>>>>
>>>>> This got me thinking. Has anyone given thought to using the TRIM
>>>>> information
>>>>> from filesystems to allow the RAID code to maintain a bitmask of
>>>>> used disk
>>>>> blocks and only sync the ones that are actually used?
>>>>>
>>>>> Presumably this bitmask would itself need to be stored on the disk.
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>
>>>> Device mapper has a "thin" target now that tracks blocks that are
>>>> allocated or
>>>> free (and works with discard).
>>>>
>>>> That might be a basis for doing an focused RAID rebuild,
>>> I wonder how....
>>> Maybe the block-later interface could grow something equivalent to
>>> "SEEK_HOLE" and friends so that the upper level can find "holes" and
>>> "allocated space" in the underlying device.
>>> I wonder if it is time to discard the 'block device' abstraction and
>>> just use
>>> files every .... but I seriously doubt it.
>>>
>>> NeilBrown
>> Hi,
>>
>> I've got a brief question about this feature that seems extremely
>> promising:
>>
>> You mentioned on your blog:
>>
>> "A 'write' to a non-in-sync region should cause that region to be
>> resynced. Writing zeros would in some sense be ideal, but to do that we
>> would have to block the write, which would be unfortunate."
>>
>> So, if we had a write on a "non-in-sync" region (let's imagine the
>> bitmap allows for 1M granularity), we would compute the parity of every
>> stripe that this write "touches" and update it? Is the solution zeroing
>> the area used to save time reading and writing the data on the stripe to
>> compute the parity, as well as any other stripes that are referenced by
>> this "non-in-sync" region, even if the write wouldn't affect them,
>> allowing us to then flip that entire region to "clean"?
>
> That would, I think, be correct.  All zeros are the easiest to 
> calculate - the parities (raid5 and raid6) are all zeros too.  It is 
> also the ideal pattern to write to SSDs - many SSDs these days 
> implement transparent compression, and you don't get more compressible 
> than zeros!
>
>>
>> Would this open the door to some "thin provisioned" MD RAID, where one
>> could grow the underlying devices (in the case of a RAID built ontop of
>> say LVM devices), and marking the new "space" as "non-in-sync" without
>> disrupting (slowing) operations on the array with a sync?
>>
>
> Yes, that would work.  More importantly (because it would affect more 
> people), it means that the creation of a md raid array on top of disks 
> or partitions will immediately be "in sync", and there would be no 
> need for a long and effectively useless re-sync process at creation.
>
>> In any case, seems like a great feature.
>
> Yes indeed.
>
>>
>> Regards,
>> Ben.
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Thank you very much for your reply!

Regards,
Ben.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-04 18:06 RFC: use TRIM data from filesystems to speed up array rebuild? Chris Friesen
  2012-09-04 19:11 ` Ric Wheeler
@ 2012-09-04 20:21 ` NeilBrown
  2012-09-04 20:28   ` Chris Friesen
  1 sibling, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-09-04 20:21 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]

On Tue, 04 Sep 2012 12:06:26 -0600 Chris Friesen <chris.friesen@genband.com>
wrote:

> 
> Hi,
> 
> I'm not really a filesystem guy so this may be a really dumb question.
> 
> We currently have an issue where we have a ~1TB RAID1 array that is 
> mostly given over to LVM.  If we swap one of the disks it will rebuild 
> everything, even though we may only be using a small fraction of the space.
> 
> This got me thinking.  Has anyone given thought to using the TRIM 
> information from filesystems to allow the RAID code to maintain a 
> bitmask of used disk blocks and only sync the ones that are actually used?
> 
> Presumably this bitmask would itself need to be stored on the disk.
> 
> Thanks,
> Chris
> 
> 

Something like this?
  http://neil.brown.name/blog/20110216044002#5

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: use TRIM data from filesystems to speed up array rebuild?
  2012-09-04 20:21 ` NeilBrown
@ 2012-09-04 20:28   ` Chris Friesen
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Friesen @ 2012-09-04 20:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 09/04/2012 02:21 PM, NeilBrown wrote:
> On Tue, 04 Sep 2012 12:06:26 -0600 Chris Friesen<chris.friesen@genband.com>
> wrote:
>> Hi,
>>
>> I'm not really a filesystem guy so this may be a really dumb question.
>>
>> We currently have an issue where we have a ~1TB RAID1 array that is
>> mostly given over to LVM.  If we swap one of the disks it will rebuild
>> everything, even though we may only be using a small fraction of the space.
>>
>> This got me thinking.  Has anyone given thought to using the TRIM
>> information from filesystems to allow the RAID code to maintain a
>> bitmask of used disk blocks and only sync the ones that are actually used?
>>
>> Presumably this bitmask would itself need to be stored on the disk.
> Something like this?
>    http://neil.brown.name/blog/20110216044002#5

Something like that would indeed cover the use-case that triggered this.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-09-07  9:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-04 18:06 RFC: use TRIM data from filesystems to speed up array rebuild? Chris Friesen
2012-09-04 19:11 ` Ric Wheeler
2012-09-04 20:24   ` NeilBrown
2012-09-04 22:59     ` Ric Wheeler
2012-09-06 17:17     ` Benjamin ESTRABAUD
2012-09-06 18:42       ` David Brown
2012-09-07  9:23         ` Benjamin ESTRABAUD
2012-09-04 20:21 ` NeilBrown
2012-09-04 20:28   ` Chris Friesen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.