All of lore.kernel.org
 help / color / mirror / Atom feed
* Hot data tracking / hybrid storage
@ 2016-05-15 12:12 Ferry Toth
  2016-05-15 21:11 ` Duncan
  2016-05-16 11:25 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 26+ messages in thread
From: Ferry Toth @ 2016-05-15 12:12 UTC (permalink / raw)
  To: linux-btrfs

Is there anything going on in this area?

We have btrfs in RAID10 using 4 HDD's for many years now with a rotating 
scheme of snapshots for easy backup. <10% files (bytes) change between 
oldest snapshot and the current state. 

However, the filesystem seems to become very slow, probably due to the 
RAID10 and the snapshots.

It would be fantastic if we could just add 4 SSD's to the pool and btrfs 
would just magically prefer to put often accessed files there and move 
older or less popular files to the HDD's.

In my simple mind this can not be done easily using bcache as that would 
require completely rebuilding the file system on top of bcache (can not 
just add a few SSD's to the pool), while implementing a cache inside btrfs 
is probably a complex thing with lots of overhead.

Simply telling the allocator to prefer new files to go to the ssd and 
move away unpopular stuff to hdd during balance should do the trick, or am 
I wrong?

Are none of the big users looking into this?

Ferry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-15 12:12 Hot data tracking / hybrid storage Ferry Toth
@ 2016-05-15 21:11 ` Duncan
  2016-05-15 23:05   ` Kai Krakow
  2016-05-16 11:25 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 26+ messages in thread
From: Duncan @ 2016-05-15 21:11 UTC (permalink / raw)
  To: linux-btrfs

Ferry Toth posted on Sun, 15 May 2016 12:12:09 +0000 as excerpted:

> Is there anything going on in this area?
> 
> We have btrfs in RAID10 using 4 HDD's for many years now with a rotating
> scheme of snapshots for easy backup. <10% files (bytes) change between
> oldest snapshot and the current state.
> 
> However, the filesystem seems to become very slow, probably due to the
> RAID10 and the snapshots.
> 
> It would be fantastic if we could just add 4 SSD's to the pool and btrfs
> would just magically prefer to put often accessed files there and move
> older or less popular files to the HDD's.
> 
> In my simple mind this can not be done easily using bcache as that would
> require completely rebuilding the file system on top of bcache (can not
> just add a few SSD's to the pool), while implementing a cache inside
> btrfs is probably a complex thing with lots of overhead.
> 
> Simply telling the allocator to prefer new files to go to the ssd and
> move away unpopular stuff to hdd during balance should do the trick, or
> am I wrong?
> 
> Are none of the big users looking into this?

Hot data tracking remains on the list of requested features, but at this 
point there's far more features on that list than developers working on 
them, so unless it's a developer's (or their employer/sponsor's) high 
priority, it's unlikely to see the light of day for some time, years, yet.

And given the availability of two hybrid solutions in the form of bcache 
and a device-mapper solution (the name of which I can't recall ATM), 
priority for a btrfs-builtin solution isn't going to be as high as it 
might be otherwise, so...


The good news for the dmapper solution is that AFAIK, it doesn't require 
reformatting like bcache does.

The bad news for it is that while we have list regulars using btrfs on 
bcache so it's a (relatively) well known and tested solution, we're 
lacking any regulars known to be using btrfs on the dmapper solution.  
Additionally, some posters looking at the dmapper choice have reported 
that it's not as mature as bcache and not really ready for use with 
btrfs, which is itself still stabilizing and maturing, and they weren't 
ready to deal with the complexities and reliability issues of two still 
stabilizing and maturing subsystems one on top of the other.

Of course, that does give you the opportunity of being that list regular 
using btrfs on top of that dmapper solution, should you be willing to 
undertake that task. =:^)


Meanwhile, you did mention backups, and of course as btrfs /is/ still 
maturing, use without backups (and snapshots aren't backups) ready if 
needed is highly discouraged in any case, so you do have the option of 
simply blowing away the existing filesystem and either redoing it as-is, 
which will likely speed it up dramatically, for a few more years, or 
throwing in those ssds and redoing it with bcache.

It's also worth noting that if you can add 4 ssds to the existing set, 
you obviously have the hookups available for four more devices, and with 
hdds cheap as they are compared to ssds, if necessary you should be able 
to throw four more hdds in there, formatting them with bcache or not 
first as desired, and creating a new btrfs on them, then copying 
everything over.  After that you could yank the old ones for use as 
spares or whatever, and replace them with ssds, which could be setup with 
bcache as well and then activated.  Given the cost of a single ssd, the 
total cost of four of them plus four hdds should still be below the cost 
of five ssds, and you're still not using more than the 8 total hookups 
you had already mentioned, so it should be quite reasonable to do it that 
way.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-15 21:11 ` Duncan
@ 2016-05-15 23:05   ` Kai Krakow
  2016-05-17  6:27     ` Ferry Toth
  0 siblings, 1 reply; 26+ messages in thread
From: Kai Krakow @ 2016-05-15 23:05 UTC (permalink / raw)
  To: linux-btrfs

Am Sun, 15 May 2016 21:11:11 +0000 (UTC)
schrieb Duncan <1i5t5.duncan@cox.net>:

> Ferry Toth posted on Sun, 15 May 2016 12:12:09 +0000 as excerpted:
> 
> > Is there anything going on in this area?
> > 
> > We have btrfs in RAID10 using 4 HDD's for many years now with a
> > rotating scheme of snapshots for easy backup. <10% files (bytes)
> > change between oldest snapshot and the current state.
> > 
> > However, the filesystem seems to become very slow, probably due to
> > the RAID10 and the snapshots.
> > 
> > It would be fantastic if we could just add 4 SSD's to the pool and
> > btrfs would just magically prefer to put often accessed files there
> > and move older or less popular files to the HDD's.
> > 
> > In my simple mind this can not be done easily using bcache as that
> > would require completely rebuilding the file system on top of
> > bcache (can not just add a few SSD's to the pool), while
> > implementing a cache inside btrfs is probably a complex thing with
> > lots of overhead.
> > 
> > Simply telling the allocator to prefer new files to go to the ssd
> > and move away unpopular stuff to hdd during balance should do the
> > trick, or am I wrong?
> > 
> > Are none of the big users looking into this?  
> 
> Hot data tracking remains on the list of requested features, but at
> this point there's far more features on that list than developers
> working on them, so unless it's a developer's (or their
> employer/sponsor's) high priority, it's unlikely to see the light of
> day for some time, years, yet.
> 
> And given the availability of two hybrid solutions in the form of
> bcache and a device-mapper solution (the name of which I can't recall
> ATM), priority for a btrfs-builtin solution isn't going to be as high
> as it might be otherwise, so...
> 
> 
> The good news for the dmapper solution is that AFAIK, it doesn't
> require reformatting like bcache does.
> 
> The bad news for it is that while we have list regulars using btrfs
> on bcache so it's a (relatively) well known and tested solution,
> we're lacking any regulars known to be using btrfs on the dmapper
> solution. Additionally, some posters looking at the dmapper choice
> have reported that it's not as mature as bcache and not really ready
> for use with btrfs, which is itself still stabilizing and maturing,
> and they weren't ready to deal with the complexities and reliability
> issues of two still stabilizing and maturing subsystems one on top of
> the other.
> 
> Of course, that does give you the opportunity of being that list
> regular using btrfs on top of that dmapper solution, should you be
> willing to undertake that task. =:^)
> 
> 
> Meanwhile, you did mention backups, and of course as btrfs /is/ still 
> maturing, use without backups (and snapshots aren't backups) ready if 
> needed is highly discouraged in any case, so you do have the option
> of simply blowing away the existing filesystem and either redoing it
> as-is, which will likely speed it up dramatically, for a few more
> years, or throwing in those ssds and redoing it with bcache.
> 
> It's also worth noting that if you can add 4 ssds to the existing
> set, you obviously have the hookups available for four more devices,
> and with hdds cheap as they are compared to ssds, if necessary you
> should be able to throw four more hdds in there, formatting them with
> bcache or not first as desired, and creating a new btrfs on them,
> then copying everything over.  After that you could yank the old ones
> for use as spares or whatever, and replace them with ssds, which
> could be setup with bcache as well and then activated.  Given the
> cost of a single ssd, the total cost of four of them plus four hdds
> should still be below the cost of five ssds, and you're still not
> using more than the 8 total hookups you had already mentioned, so it
> should be quite reasonable to do it that way.

You can go there with only one additional HDD as temporary storage.
Just connect it, format as bcache, then do a "btrfs dev replace". Now
wipe that "free" HDD (use wipefs), format as bcache, then... well, you
get the point. At the last step, remove the remaining HDD. Now add your
SSDs, format as caching device, and attach each individual HDD backing
bcache to each SSD caching bcache.

Devices don't need to be formatted and created at the same time. I'd
also recommend to add all SSDs only in the last step to not wear them
early with writes during device replacement.

If you want, you can add one additional step to get the temporary hard
disk back. But why not simply replace the oldest hard disk with the
newest. Take a look at smartctl to see which is the best candidate.

I went a similar route but without one extra HDD. I had three HDDs in
mraid1/draid0 and enough spare space. I just removed one HDD, prepared
it for bcache, then added it back and removed the next.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-15 12:12 Hot data tracking / hybrid storage Ferry Toth
  2016-05-15 21:11 ` Duncan
@ 2016-05-16 11:25 ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-16 11:25 UTC (permalink / raw)
  To: Ferry Toth, linux-btrfs

On 2016-05-15 08:12, Ferry Toth wrote:
> Is there anything going on in this area?
>
> We have btrfs in RAID10 using 4 HDD's for many years now with a rotating
> scheme of snapshots for easy backup. <10% files (bytes) change between
> oldest snapshot and the current state.
>
> However, the filesystem seems to become very slow, probably due to the
> RAID10 and the snapshots.
While it's not exactly what you're thinking of, have you tried running 
BTRFS in raid1 mode on top of two DM/MD RAID0 volumes?  This provides 
the same degree of data integrity that BTRFS raid10 does, but gets 
measurably better performance.
>
> It would be fantastic if we could just add 4 SSD's to the pool and btrfs
> would just magically prefer to put often accessed files there and move
> older or less popular files to the HDD's.
>
> In my simple mind this can not be done easily using bcache as that would
> require completely rebuilding the file system on top of bcache (can not
> just add a few SSD's to the pool), while implementing a cache inside btrfs
> is probably a complex thing with lots of overhead.
You may want to look into dm-cache, as that doesn't require reformatting 
the source device.  It doesn't quite get the same performance as bcache, 
but for me at least, the lower performance is a reasonable trade-off for 
being able to easily convert a device to use it, and being able to 
easily convert away from it if need be.
>
> Simply telling the allocator to prefer new files to go to the ssd and
> move away unpopular stuff to hdd during balance should do the trick, or am
> I wrong?
In theory this would work as a first implementation, but it would need 
to have automatic data migration as an option to be considered 
practical, and that's not as easy to do correctly.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-15 23:05   ` Kai Krakow
@ 2016-05-17  6:27     ` Ferry Toth
  2016-05-17 11:32       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 26+ messages in thread
From: Ferry Toth @ 2016-05-17  6:27 UTC (permalink / raw)
  To: linux-btrfs

Op Mon, 16 May 2016 01:05:24 +0200, schreef Kai Krakow:

> Am Sun, 15 May 2016 21:11:11 +0000 (UTC)
> schrieb Duncan <1i5t5.duncan@cox.net>:
> 
>> Ferry Toth posted on Sun, 15 May 2016 12:12:09 +0000 as excerpted:
>> 
<snip>
> 
> You can go there with only one additional HDD as temporary storage. Just
> connect it, format as bcache, then do a "btrfs dev replace". Now wipe
> that "free" HDD (use wipefs), format as bcache, then... well, you get
> the point. At the last step, remove the remaining HDD. Now add your
> SSDs, format as caching device, and attach each individual HDD backing
> bcache to each SSD caching bcache.
> 
> Devices don't need to be formatted and created at the same time. I'd
> also recommend to add all SSDs only in the last step to not wear them
> early with writes during device replacement.
> 
> If you want, you can add one additional step to get the temporary hard
> disk back. But why not simply replace the oldest hard disk with the
> newest. Take a look at smartctl to see which is the best candidate.
> 
> I went a similar route but without one extra HDD. I had three HDDs in
> mraid1/draid0 and enough spare space. I just removed one HDD, prepared
> it for bcache, then added it back and removed the next.
> 
That's what I mean, a lot of work. And it's still a cache, with 
unnecessary copying from the ssd to the hdd.

And what happens when either a hdd or ssd starts failing?

> --
> Regards,
> Kai
> 
> Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-17  6:27     ` Ferry Toth
@ 2016-05-17 11:32       ` Austin S. Hemmelgarn
  2016-05-17 18:33         ` Kai Krakow
  0 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-17 11:32 UTC (permalink / raw)
  To: Ferry Toth, linux-btrfs

On 2016-05-17 02:27, Ferry Toth wrote:
> Op Mon, 16 May 2016 01:05:24 +0200, schreef Kai Krakow:
>
>> Am Sun, 15 May 2016 21:11:11 +0000 (UTC)
>> schrieb Duncan <1i5t5.duncan@cox.net>:
>>
>>> Ferry Toth posted on Sun, 15 May 2016 12:12:09 +0000 as excerpted:
>>>
> <snip>
>>
>> You can go there with only one additional HDD as temporary storage. Just
>> connect it, format as bcache, then do a "btrfs dev replace". Now wipe
>> that "free" HDD (use wipefs), format as bcache, then... well, you get
>> the point. At the last step, remove the remaining HDD. Now add your
>> SSDs, format as caching device, and attach each individual HDD backing
>> bcache to each SSD caching bcache.
>>
>> Devices don't need to be formatted and created at the same time. I'd
>> also recommend to add all SSDs only in the last step to not wear them
>> early with writes during device replacement.
>>
>> If you want, you can add one additional step to get the temporary hard
>> disk back. But why not simply replace the oldest hard disk with the
>> newest. Take a look at smartctl to see which is the best candidate.
>>
>> I went a similar route but without one extra HDD. I had three HDDs in
>> mraid1/draid0 and enough spare space. I just removed one HDD, prepared
>> it for bcache, then added it back and removed the next.
>>
> That's what I mean, a lot of work. And it's still a cache, with
> unnecessary copying from the ssd to the hdd.
On the other hand, it's actually possible to do this all online with 
BTRFS because of the reshaping and device replacement tools.

In fact, I've done even more complex reprovisioning online before (for 
example, my home server system has 2 SSD's and 4 HDD's, running BTRFS on 
top of LVM, I've at least twice completely recreated the LVM layer 
online without any data loss and minimal performance degradation).
>
> And what happens when either a hdd or ssd starts failing?
I have absolutely no idea how bcache handles this, but I doubt it's any 
better than BTRFS.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-17 11:32       ` Austin S. Hemmelgarn
@ 2016-05-17 18:33         ` Kai Krakow
  2016-05-18 22:44           ` Ferry Toth
  0 siblings, 1 reply; 26+ messages in thread
From: Kai Krakow @ 2016-05-17 18:33 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 17 May 2016 07:32:11 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2016-05-17 02:27, Ferry Toth wrote:
> > Op Mon, 16 May 2016 01:05:24 +0200, schreef Kai Krakow:
> >  
> >> Am Sun, 15 May 2016 21:11:11 +0000 (UTC)
> >> schrieb Duncan <1i5t5.duncan@cox.net>:
> >>  
>  [...]  
> > <snip>  
> >>
> >> You can go there with only one additional HDD as temporary
> >> storage. Just connect it, format as bcache, then do a "btrfs dev
> >> replace". Now wipe that "free" HDD (use wipefs), format as bcache,
> >> then... well, you get the point. At the last step, remove the
> >> remaining HDD. Now add your SSDs, format as caching device, and
> >> attach each individual HDD backing bcache to each SSD caching
> >> bcache.
> >>
> >> Devices don't need to be formatted and created at the same time.
> >> I'd also recommend to add all SSDs only in the last step to not
> >> wear them early with writes during device replacement.
> >>
> >> If you want, you can add one additional step to get the temporary
> >> hard disk back. But why not simply replace the oldest hard disk
> >> with the newest. Take a look at smartctl to see which is the best
> >> candidate.
> >>
> >> I went a similar route but without one extra HDD. I had three HDDs
> >> in mraid1/draid0 and enough spare space. I just removed one HDD,
> >> prepared it for bcache, then added it back and removed the next.
> >>  
> > That's what I mean, a lot of work. And it's still a cache, with
> > unnecessary copying from the ssd to the hdd.  
> On the other hand, it's actually possible to do this all online with 
> BTRFS because of the reshaping and device replacement tools.
> 
> In fact, I've done even more complex reprovisioning online before
> (for example, my home server system has 2 SSD's and 4 HDD's, running
> BTRFS on top of LVM, I've at least twice completely recreated the LVM
> layer online without any data loss and minimal performance
> degradation).
> >
> > And what happens when either a hdd or ssd starts failing?  
> I have absolutely no idea how bcache handles this, but I doubt it's
> any better than BTRFS.

Bcache should in theory fall back to write-through as soon as an error
counter exceeds a threshold. This is adjustable with sysfs
io_error_halftime and io_error_limit. Tho I never tried what actually
happens when either the HDD (in bcache writeback-mode) or the SSD
fails. Actually, btrfs should be able to handle this (tho, according to
list reports, it doesn't handle errors very well at this point).

BTW: Unnecessary copying from SSD to HDD doesn't take place in bcache
default mode: It only copies from HDD to SSD in writeback mode (data
is written to the cache first, then persisted to HDD in the background).
You can also use "write through" (data is written to SSD and persisted
to HDD at the same time, reporting persistence to the application only
when both copies were written) and "write around" mode (data is written
to HDD only, and only reads are written to the SSD cache device).

If you want bcache behave as a huge IO scheduler for writes, use
writeback mode. If you have write-intensive applications, you may want
to choose write-around to not wear out the SSDs early. If you want
writes to be cached for later reads, you can choose write-through mode.
The latter two modes will ensure written data is always persisted to
HDD with the same guaranties you had without bcache. The last mode is
default and should not change behavior of btrfs if the HDD fails, and
if the SSD fails bcache would simply turn off and fall back to HDD.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-17 18:33         ` Kai Krakow
@ 2016-05-18 22:44           ` Ferry Toth
  2016-05-19 18:09             ` Kai Krakow
  0 siblings, 1 reply; 26+ messages in thread
From: Ferry Toth @ 2016-05-18 22:44 UTC (permalink / raw)
  To: linux-btrfs

Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:

> Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
> <ahferroin7@gmail.com>:
> 
>> On 2016-05-17 02:27, Ferry Toth wrote:
>> > Op Mon, 16 May 2016 01:05:24 +0200, schreef Kai Krakow:
>> >  
>> >> Am Sun, 15 May 2016 21:11:11 +0000 (UTC)
>> >> schrieb Duncan <1i5t5.duncan@cox.net>:
>> >>  
>>  [...]
>> > <snip>
>> >>
>> >> You can go there with only one additional HDD as temporary storage.
>> >> Just connect it, format as bcache, then do a "btrfs dev replace".
>> >> Now wipe that "free" HDD (use wipefs), format as bcache,
>> >> then... well, you get the point. At the last step, remove the
>> >> remaining HDD. Now add your SSDs, format as caching device, and
>> >> attach each individual HDD backing bcache to each SSD caching
>> >> bcache.
>> >>
>> >> Devices don't need to be formatted and created at the same time. I'd
>> >> also recommend to add all SSDs only in the last step to not wear
>> >> them early with writes during device replacement.
>> >>
>> >> If you want, you can add one additional step to get the temporary
>> >> hard disk back. But why not simply replace the oldest hard disk with
>> >> the newest. Take a look at smartctl to see which is the best
>> >> candidate.
>> >>
>> >> I went a similar route but without one extra HDD. I had three HDDs
>> >> in mraid1/draid0 and enough spare space. I just removed one HDD,
>> >> prepared it for bcache, then added it back and removed the next.
>> >>  
>> > That's what I mean, a lot of work. And it's still a cache, with
>> > unnecessary copying from the ssd to the hdd.
>> On the other hand, it's actually possible to do this all online with
>> BTRFS because of the reshaping and device replacement tools.
>> 
>> In fact, I've done even more complex reprovisioning online before (for
>> example, my home server system has 2 SSD's and 4 HDD's, running BTRFS
>> on top of LVM, I've at least twice completely recreated the LVM layer
>> online without any data loss and minimal performance degradation).
>> >
>> > And what happens when either a hdd or ssd starts failing?
>> I have absolutely no idea how bcache handles this, but I doubt it's any
>> better than BTRFS.
> 
> Bcache should in theory fall back to write-through as soon as an error
> counter exceeds a threshold. This is adjustable with sysfs
> io_error_halftime and io_error_limit. Tho I never tried what actually
> happens when either the HDD (in bcache writeback-mode) or the SSD fails.
> Actually, btrfs should be able to handle this (tho, according to list
> reports, it doesn't handle errors very well at this point).
> 
> BTW: Unnecessary copying from SSD to HDD doesn't take place in bcache
> default mode: It only copies from HDD to SSD in writeback mode (data is
> written to the cache first, then persisted to HDD in the background).
> You can also use "write through" (data is written to SSD and persisted
> to HDD at the same time, reporting persistence to the application only
> when both copies were written) and "write around" mode (data is written
> to HDD only, and only reads are written to the SSD cache device).
> 
> If you want bcache behave as a huge IO scheduler for writes, use
> writeback mode. If you have write-intensive applications, you may want
> to choose write-around to not wear out the SSDs early. If you want
> writes to be cached for later reads, you can choose write-through mode.
> The latter two modes will ensure written data is always persisted to HDD
> with the same guaranties you had without bcache. The last mode is
> default and should not change behavior of btrfs if the HDD fails, and if
> the SSD fails bcache would simply turn off and fall back to HDD.
> 

Hello Kai,

Yeah, lots of modes. So that means, none works well for all cases?

Our server has lots of old files, on smb (various size), imap (10000's 
small, 1000's large), postgresql server, virtualbox images (large), 50 or 
so snapshots and running synaptics for system upgrades is painfully slow. 

We are expecting slowness to be caused by fsyncs which appear to be much 
worse on a raid10 with snapshots. Presumably the whole thing would be 
fast enough with ssd's but that would be not very cost efficient.

All the overhead of the cache layer could be avoided if btrfs would just 
prefer to write small, hot, files to the ssd in the first place and clean 
up while balancing. A combination of 2 ssd's and 4 hdd's would be very 
nice (the mobo has 6 x sata, which is pretty common)

Moreover increasing the ssd's size in the future would then be just as 
simple as replacing a disk by a larger one.

I think many would sign up for such a low maintenance, efficient setup 
that doesn't require a PhD in IT to think out and configure.

Even at home, I would just throw in a low cost ssd next to the hdd if it 
was as simple as device add. But I wouldn't want to store my photo/video 
collection on just ssd, too expensive.

> Regards,
> Kai
> 
> Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-18 22:44           ` Ferry Toth
@ 2016-05-19 18:09             ` Kai Krakow
  2016-05-19 18:51               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 26+ messages in thread
From: Kai Krakow @ 2016-05-19 18:09 UTC (permalink / raw)
  To: linux-btrfs

Am Wed, 18 May 2016 22:44:55 +0000 (UTC)
schrieb Ferry Toth <ftoth@exalondelft.nl>:

> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
> 
> > Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
> > <ahferroin7@gmail.com>:
> >   
> >> On 2016-05-17 02:27, Ferry Toth wrote:  
>  [...]  
>  [...]  
> >>  [...]  
>  [...]  
>  [...]  
>  [...]  
> >> On the other hand, it's actually possible to do this all online
> >> with BTRFS because of the reshaping and device replacement tools.
> >> 
> >> In fact, I've done even more complex reprovisioning online before
> >> (for example, my home server system has 2 SSD's and 4 HDD's,
> >> running BTRFS on top of LVM, I've at least twice completely
> >> recreated the LVM layer online without any data loss and minimal
> >> performance degradation).  
>  [...]  
> >> I have absolutely no idea how bcache handles this, but I doubt
> >> it's any better than BTRFS.  
> > 
> > Bcache should in theory fall back to write-through as soon as an
> > error counter exceeds a threshold. This is adjustable with sysfs
> > io_error_halftime and io_error_limit. Tho I never tried what
> > actually happens when either the HDD (in bcache writeback-mode) or
> > the SSD fails. Actually, btrfs should be able to handle this (tho,
> > according to list reports, it doesn't handle errors very well at
> > this point).
> > 
> > BTW: Unnecessary copying from SSD to HDD doesn't take place in
> > bcache default mode: It only copies from HDD to SSD in writeback
> > mode (data is written to the cache first, then persisted to HDD in
> > the background). You can also use "write through" (data is written
> > to SSD and persisted to HDD at the same time, reporting persistence
> > to the application only when both copies were written) and "write
> > around" mode (data is written to HDD only, and only reads are
> > written to the SSD cache device).
> > 
> > If you want bcache behave as a huge IO scheduler for writes, use
> > writeback mode. If you have write-intensive applications, you may
> > want to choose write-around to not wear out the SSDs early. If you
> > want writes to be cached for later reads, you can choose
> > write-through mode. The latter two modes will ensure written data
> > is always persisted to HDD with the same guaranties you had without
> > bcache. The last mode is default and should not change behavior of
> > btrfs if the HDD fails, and if the SSD fails bcache would simply
> > turn off and fall back to HDD. 
> 
> Hello Kai,
> 
> Yeah, lots of modes. So that means, none works well for all cases?

Just three, and they all work well. It's just a decision wearing vs.
performance/safety. Depending on your workload you might benefit more or
less from write-behind caching - that's when you want to turn the knob.
Everything else works out of the box. In case of an SSD failure,
write-back is just less safe while the other two modes should keep your
FS intact in that case.

> Our server has lots of old files, on smb (various size), imap
> (10000's small, 1000's large), postgresql server, virtualbox images
> (large), 50 or so snapshots and running synaptics for system upgrades
> is painfully slow. 

I don't think that bcache even cares to cache imap accesses to mail
bodies - it won't help performance. Network is usually much slower than
SSD access. But it will cache fs meta data which will improve imap
performance a lot.

> We are expecting slowness to be caused by fsyncs which appear to be
> much worse on a raid10 with snapshots. Presumably the whole thing
> would be fast enough with ssd's but that would be not very cost
> efficient.
> 
> All the overhead of the cache layer could be avoided if btrfs would
> just prefer to write small, hot, files to the ssd in the first place
> and clean up while balancing. A combination of 2 ssd's and 4 hdd's
> would be very nice (the mobo has 6 x sata, which is pretty common)

Well, I don't want to advertise bcache. But there's nothing you
couldn't do with it in your particular case:

Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
here, you can use 1:n where n is the backing devices. There's no need
to clean up using balancing because bcache will track hot data by
default. You just have to decide which balance between wearing the SSD
vs. performance you prefer. If slow fsyncs are you primary concern, I'd
go with write-back caching. The small file contents are propably not
your performance problem anyways but the meta data management btrfs has
to do in the background. Bcache will help a lot here, especially in
write-back mode. I'd recommend against using balance too often and too
intensive (don't use too big usage% filters), it will invalidate your
block cache and probably also invalidate bcache if bcache is too small.
It will hurt performance more than you gain. You may want to increase
nr_requests in the IO scheduler for your situation.

> Moreover increasing the ssd's size in the future would then be just
> as simple as replacing a disk by a larger one.

It's as simple as detaching the HDDs from the caching SSD, replace it,
reattach it. It can be done online without reboot. SATA is usually
hotpluggable nowadays.

> I think many would sign up for such a low maintenance, efficient
> setup that doesn't require a PhD in IT to think out and configure.

Bcache is actually low maintenance, no knobs to turn. Converting to
bcache protective superblocks is a one-time procedure which can be done
online. The bcache devices act as normal HDD if not attached to a
caching SSD. It's really less pain than you may think. And it's a
solution available now. Converting back later is easy: Just detach the
HDDs from the SSDs and use them for some other purpose if you feel so
later. Having the bcache protective superblock still in place doesn't
hurt then. Bcache is a no-op without caching device attached.

> Even at home, I would just throw in a low cost ssd next to the hdd if
> it was as simple as device add. But I wouldn't want to store my
> photo/video collection on just ssd, too expensive.

Bcache won't store your photos if you copied them: Large copy
operations (like backups) and sequential access is detected and bypassed
by bcache. It won't invalidate your valuable "hot data" in the cache.
It works really well.

I'd even recommend to format filesystems with bcache protective
superblock (aka format backing device) even if you not gonna use
caching and not gonna insert an SSD now, just to have the option for
the future easily and without much hassle.

I don't think native hot data tracking will land in btrfs anytime soon
(read: in the next 5 years). Bcache is a general purpose solution for
all filesystems that works now (and properly works).

You maybe want to clone your current system and try to integrate bcache
to see the benefits. There's actually a really big impact on
performance from my testing (home machine, 3x 1TB HDD btrfs mraid1
draid0, 1x 500GB SSD as cache, hit rate >90%, cache utilization ~70%,
boot time improvement ~400%, application startup times almost instant,
workload: MariaDB development server, git usage, 3 nspawn containers,
VirtualBox Windows 7 + XP VMs, Steam gaming, daily rsync backups, btrfs
60% filled).

I'd recommend to not use a too small SSD because it wears out very fast
when used as cache (I think that generally applies and is not bcache
specific). My old 120GB SSD was specified for 85TB write performance,
and it was worn out after 12 months of bcache usage, which included 2
complete backup restores, multiple scrubs (which relocates and rewrites
every data block), and weekly balances with relatime enabled. I've
since used noatime+nossd, completely stopped using balance and never
used scrub yet, with the result of vastly reduced write accesses to the
caching SSD. This setup is able to write bursts of 800MB/s to the disk
and read up to 800MB/s from disk (if btrfs can properly distribute
reads to all disks). Bootchart shows up to 600 MB/s during cold booting
(with warmed SSD cache). My nspawn containers boot in 1-2 seconds and
do not add to the normal boot time at all (they are autostarted during
boot, 1x MySQL, 1x ElasticSearch, 1x idle/spare/testing container).
This is really impressive for a home machine, and c'mon: 3x 1TB HDD +
1x 500GB SSD is not that expensive nowadays. If you still prefer a
low-end SSD I'd recommend to use write-around only from my own
experience.

The cache usage of the 120GB of 100% with 70-80% hit rate, which means
it was constantly rewriting stuff. 500GB (which I use now) is a little
underutilized now but almost no writes happen after warming up, so it's
mostly a hot-data read cache (although I configured it as write-back).
Plus, bigger SSDs are usually faster - especially for write ops.

Conclusion: Btrfs + bcache make a very good pair. Btrfs is not really
optimized for good latency and that's where bcache comes in. Operating
noise from HDD reduces a lot as soon as bcache is warmed up.

BTW: If deployed, keep an eye on your SSD wearing (using smartctl). But
given you are using btrfs, you keep backups anyways. ;-)

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-19 18:09             ` Kai Krakow
@ 2016-05-19 18:51               ` Austin S. Hemmelgarn
  2016-05-19 21:01                 ` Kai Krakow
  2016-05-19 23:23                 ` Henk Slager
  0 siblings, 2 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-19 18:51 UTC (permalink / raw)
  To: linux-btrfs

On 2016-05-19 14:09, Kai Krakow wrote:
> Am Wed, 18 May 2016 22:44:55 +0000 (UTC)
> schrieb Ferry Toth <ftoth@exalondelft.nl>:
>
>> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
>>
>>> Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
>>> <ahferroin7@gmail.com>:
>>>
>>>> On 2016-05-17 02:27, Ferry Toth wrote:
>>  [...]
>>  [...]
>>>>  [...]
>>  [...]
>>  [...]
>>  [...]
>>>> On the other hand, it's actually possible to do this all online
>>>> with BTRFS because of the reshaping and device replacement tools.
>>>>
>>>> In fact, I've done even more complex reprovisioning online before
>>>> (for example, my home server system has 2 SSD's and 4 HDD's,
>>>> running BTRFS on top of LVM, I've at least twice completely
>>>> recreated the LVM layer online without any data loss and minimal
>>>> performance degradation).
>>  [...]
>>>> I have absolutely no idea how bcache handles this, but I doubt
>>>> it's any better than BTRFS.
>>>
>>> Bcache should in theory fall back to write-through as soon as an
>>> error counter exceeds a threshold. This is adjustable with sysfs
>>> io_error_halftime and io_error_limit. Tho I never tried what
>>> actually happens when either the HDD (in bcache writeback-mode) or
>>> the SSD fails. Actually, btrfs should be able to handle this (tho,
>>> according to list reports, it doesn't handle errors very well at
>>> this point).
>>>
>>> BTW: Unnecessary copying from SSD to HDD doesn't take place in
>>> bcache default mode: It only copies from HDD to SSD in writeback
>>> mode (data is written to the cache first, then persisted to HDD in
>>> the background). You can also use "write through" (data is written
>>> to SSD and persisted to HDD at the same time, reporting persistence
>>> to the application only when both copies were written) and "write
>>> around" mode (data is written to HDD only, and only reads are
>>> written to the SSD cache device).
>>>
>>> If you want bcache behave as a huge IO scheduler for writes, use
>>> writeback mode. If you have write-intensive applications, you may
>>> want to choose write-around to not wear out the SSDs early. If you
>>> want writes to be cached for later reads, you can choose
>>> write-through mode. The latter two modes will ensure written data
>>> is always persisted to HDD with the same guaranties you had without
>>> bcache. The last mode is default and should not change behavior of
>>> btrfs if the HDD fails, and if the SSD fails bcache would simply
>>> turn off and fall back to HDD.
>>
>> Hello Kai,
>>
>> Yeah, lots of modes. So that means, none works well for all cases?
>
> Just three, and they all work well. It's just a decision wearing vs.
> performance/safety. Depending on your workload you might benefit more or
> less from write-behind caching - that's when you want to turn the knob.
> Everything else works out of the box. In case of an SSD failure,
> write-back is just less safe while the other two modes should keep your
> FS intact in that case.
>
>> Our server has lots of old files, on smb (various size), imap
>> (10000's small, 1000's large), postgresql server, virtualbox images
>> (large), 50 or so snapshots and running synaptics for system upgrades
>> is painfully slow.
>
> I don't think that bcache even cares to cache imap accesses to mail
> bodies - it won't help performance. Network is usually much slower than
> SSD access. But it will cache fs meta data which will improve imap
> performance a lot.
Bcache caches anything that falls within it's heuristics as candidates 
for caching.  It pays no attention to what type of data you're 
accessing, just the access patterns.  This is also the case for 
dm-cache, and for Windows ReadyBoost (or whatever the hell they're 
calling it these days).  Unless you're shifting very big e-mails, it's 
pretty likely that ones that get accessed more than once in a short 
period of time will end up being cached.
>
>> We are expecting slowness to be caused by fsyncs which appear to be
>> much worse on a raid10 with snapshots. Presumably the whole thing
>> would be fast enough with ssd's but that would be not very cost
>> efficient.
>>
>> All the overhead of the cache layer could be avoided if btrfs would
>> just prefer to write small, hot, files to the ssd in the first place
>> and clean up while balancing. A combination of 2 ssd's and 4 hdd's
>> would be very nice (the mobo has 6 x sata, which is pretty common)
>
> Well, I don't want to advertise bcache. But there's nothing you
> couldn't do with it in your particular case:
>
> Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
> here, you can use 1:n where n is the backing devices. There's no need
> to clean up using balancing because bcache will track hot data by
> default. You just have to decide which balance between wearing the SSD
> vs. performance you prefer. If slow fsyncs are you primary concern, I'd
> go with write-back caching. The small file contents are propably not
> your performance problem anyways but the meta data management btrfs has
> to do in the background. Bcache will help a lot here, especially in
> write-back mode. I'd recommend against using balance too often and too
> intensive (don't use too big usage% filters), it will invalidate your
> block cache and probably also invalidate bcache if bcache is too small.
> It will hurt performance more than you gain. You may want to increase
> nr_requests in the IO scheduler for your situation.
This may not perform as well as you would think, depending on your 
configuration.  If things are in raid1 (or raid10) mode on the BTRFS 
side, then you can end up caching duplicate data (and on some workloads, 
you're almost guaranteed to cache duplicate data), which is a bigger 
issue when you're sharing a cache between devices, because it means they 
are competing for cache space.
>
>> Moreover increasing the ssd's size in the future would then be just
>> as simple as replacing a disk by a larger one.
>
> It's as simple as detaching the HDDs from the caching SSD, replace it,
> reattach it. It can be done online without reboot. SATA is usually
> hotpluggable nowadays.
>
>> I think many would sign up for such a low maintenance, efficient
>> setup that doesn't require a PhD in IT to think out and configure.
>
> Bcache is actually low maintenance, no knobs to turn. Converting to
> bcache protective superblocks is a one-time procedure which can be done
> online. The bcache devices act as normal HDD if not attached to a
> caching SSD. It's really less pain than you may think. And it's a
> solution available now. Converting back later is easy: Just detach the
> HDDs from the SSDs and use them for some other purpose if you feel so
> later. Having the bcache protective superblock still in place doesn't
> hurt then. Bcache is a no-op without caching device attached.
No, bcache is _almost_ a no-op without a caching device.  From a 
userspace perspective, it does nothing, but it is still another layer of 
indirection in the kernel, which does have a small impact on 
performance.  The same is true of using LVM with a single volume taking 
up the entire partition, it looks almost no different from just using 
the partition, but it will perform worse than using the partition 
directly.  I've actually done profiling of both to figure out base 
values for the overhead, and while bcache with no cache device is not as 
bad as the LVM example, it can still be a roughly 0.5-2% slowdown (it 
gets more noticeable the faster your backing storage is).

You also lose the ability to mount that filesystem directly on a kernel 
without bcache support (this may or may not be an issue for you).
>
>> Even at home, I would just throw in a low cost ssd next to the hdd if
>> it was as simple as device add. But I wouldn't want to store my
>> photo/video collection on just ssd, too expensive.
>
> Bcache won't store your photos if you copied them: Large copy
> operations (like backups) and sequential access is detected and bypassed
> by bcache. It won't invalidate your valuable "hot data" in the cache.
> It works really well.
>
> I'd even recommend to format filesystems with bcache protective
> superblock (aka format backing device) even if you not gonna use
> caching and not gonna insert an SSD now, just to have the option for
> the future easily and without much hassle.
>
> I don't think native hot data tracking will land in btrfs anytime soon
> (read: in the next 5 years). Bcache is a general purpose solution for
> all filesystems that works now (and properly works).
>
> You maybe want to clone your current system and try to integrate bcache
> to see the benefits. There's actually a really big impact on
> performance from my testing (home machine, 3x 1TB HDD btrfs mraid1
> draid0, 1x 500GB SSD as cache, hit rate >90%, cache utilization ~70%,
> boot time improvement ~400%, application startup times almost instant,
> workload: MariaDB development server, git usage, 3 nspawn containers,
> VirtualBox Windows 7 + XP VMs, Steam gaming, daily rsync backups, btrfs
> 60% filled).
>
> I'd recommend to not use a too small SSD because it wears out very fast
> when used as cache (I think that generally applies and is not bcache
> specific). My old 120GB SSD was specified for 85TB write performance,
> and it was worn out after 12 months of bcache usage, which included 2
> complete backup restores, multiple scrubs (which relocates and rewrites
> every data block), and weekly balances with relatime enabled. I've
> since used noatime+nossd, completely stopped using balance and never
> used scrub yet, with the result of vastly reduced write accesses to the
> caching SSD. This setup is able to write bursts of 800MB/s to the disk
> and read up to 800MB/s from disk (if btrfs can properly distribute
> reads to all disks). Bootchart shows up to 600 MB/s during cold booting
> (with warmed SSD cache). My nspawn containers boot in 1-2 seconds and
> do not add to the normal boot time at all (they are autostarted during
> boot, 1x MySQL, 1x ElasticSearch, 1x idle/spare/testing container).
> This is really impressive for a home machine, and c'mon: 3x 1TB HDD +
> 1x 500GB SSD is not that expensive nowadays. If you still prefer a
> low-end SSD I'd recommend to use write-around only from my own
> experience.
>
> The cache usage of the 120GB of 100% with 70-80% hit rate, which means
> it was constantly rewriting stuff. 500GB (which I use now) is a little
> underutilized now but almost no writes happen after warming up, so it's
> mostly a hot-data read cache (although I configured it as write-back).
> Plus, bigger SSDs are usually faster - especially for write ops.
>
> Conclusion: Btrfs + bcache make a very good pair. Btrfs is not really
> optimized for good latency and that's where bcache comes in. Operating
> noise from HDD reduces a lot as soon as bcache is warmed up.
>
> BTW: If deployed, keep an eye on your SSD wearing (using smartctl). But
> given you are using btrfs, you keep backups anyways. ;-)
Any decent SSD (read as 'any SSD of a major brand other than OCZ that 
you bought from a reputable source') will still take years to wear out 
unless you're constantly re-writing things and not using discard/trim 
support (and bcache does use discard).   Even if you're not using 
discard/trim, the typical wear-out point is well over 100x the size of 
the SSD for the good consumer devices.  For a point of reference, I've 
got a pair of 250GB Crucial MX100's (they cost less than 0.50 USD per GB 
when I got them and provide essentially the same power-loss protections 
that the high end Intel SSD's do) which have seen more than 2.5TB of 
data writes over their lifetime, combined from at least three different 
filesystem formats (BTRFS, FAT32, and ext4), swap space, and LVM 
management, and the wear-leveling indicator on each still says they have 
100% life remaining, and the similar 500GB one I just recently upgraded 
in my laptop had seen over 50TB of writes and was still saying 95% life 
remaining (and had been for months).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-19 18:51               ` Austin S. Hemmelgarn
@ 2016-05-19 21:01                 ` Kai Krakow
  2016-05-20 11:46                   ` Austin S. Hemmelgarn
  2016-05-19 23:23                 ` Henk Slager
  1 sibling, 1 reply; 26+ messages in thread
From: Kai Krakow @ 2016-05-19 21:01 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 19 May 2016 14:51:01 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> For a point of reference, I've 
> got a pair of 250GB Crucial MX100's (they cost less than 0.50 USD per
> GB when I got them and provide essentially the same power-loss
> protections that the high end Intel SSD's do) which have seen more
> than 2.5TB of data writes over their lifetime, combined from at least
> three different filesystem formats (BTRFS, FAT32, and ext4), swap
> space, and LVM management, and the wear-leveling indicator on each
> still says they have 100% life remaining, and the similar 500GB one I
> just recently upgraded in my laptop had seen over 50TB of writes and
> was still saying 95% life remaining (and had been for months).

The smaller Crucials are much worse at that: The MX100 128GB version I
had was specified for 85TB writes which I hit after about 12 months (97%
lifetime used according to smartctl) due to excessive write patterns.
I'm not sure how long it would have lasted but I decided to swap it for
a Samsung 500GB drive, and reconfigure my system for much less write
patterns.

What should I say: I liked the Crucial more, first: It has an easy
lifetime counter in smartctl, Samsung doesn't. And it had powerloss
protection which Samsung doesn't explicitly mention (tho I think it has
it).

At least, according to endurance tests, my Samsung SSD should take
about 1 PB of writes. I've already written 7 TB if I can trust the
smartctl raw value.

But I think you cannot compare specification values to a real endurance
test... I think it says 150TBW for 500GB 850 EVO.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-19 18:51               ` Austin S. Hemmelgarn
  2016-05-19 21:01                 ` Kai Krakow
@ 2016-05-19 23:23                 ` Henk Slager
  2016-05-20 12:03                   ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 26+ messages in thread
From: Henk Slager @ 2016-05-19 23:23 UTC (permalink / raw)
  To: linux-btrfs

On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-05-19 14:09, Kai Krakow wrote:
>>
>> Am Wed, 18 May 2016 22:44:55 +0000 (UTC)
>> schrieb Ferry Toth <ftoth@exalondelft.nl>:
>>
>>> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
>>>
>>>> Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
>>>> <ahferroin7@gmail.com>:
>>>>
>>>>> On 2016-05-17 02:27, Ferry Toth wrote:
>>>
>>>  [...]
>>>  [...]
>>>>>
>>>>>  [...]
>>>
>>>  [...]
>>>  [...]
>>>  [...]
>>>>>
>>>>> On the other hand, it's actually possible to do this all online
>>>>> with BTRFS because of the reshaping and device replacement tools.
>>>>>
>>>>> In fact, I've done even more complex reprovisioning online before
>>>>> (for example, my home server system has 2 SSD's and 4 HDD's,
>>>>> running BTRFS on top of LVM, I've at least twice completely
>>>>> recreated the LVM layer online without any data loss and minimal
>>>>> performance degradation).
>>>
>>>  [...]
>>>>>
>>>>> I have absolutely no idea how bcache handles this, but I doubt
>>>>> it's any better than BTRFS.
>>>>
>>>>
>>>> Bcache should in theory fall back to write-through as soon as an
>>>> error counter exceeds a threshold. This is adjustable with sysfs
>>>> io_error_halftime and io_error_limit. Tho I never tried what
>>>> actually happens when either the HDD (in bcache writeback-mode) or
>>>> the SSD fails. Actually, btrfs should be able to handle this (tho,
>>>> according to list reports, it doesn't handle errors very well at
>>>> this point).
>>>>
>>>> BTW: Unnecessary copying from SSD to HDD doesn't take place in
>>>> bcache default mode: It only copies from HDD to SSD in writeback
>>>> mode (data is written to the cache first, then persisted to HDD in
>>>> the background). You can also use "write through" (data is written
>>>> to SSD and persisted to HDD at the same time, reporting persistence
>>>> to the application only when both copies were written) and "write
>>>> around" mode (data is written to HDD only, and only reads are
>>>> written to the SSD cache device).
>>>>
>>>> If you want bcache behave as a huge IO scheduler for writes, use
>>>> writeback mode. If you have write-intensive applications, you may
>>>> want to choose write-around to not wear out the SSDs early. If you
>>>> want writes to be cached for later reads, you can choose
>>>> write-through mode. The latter two modes will ensure written data
>>>> is always persisted to HDD with the same guaranties you had without
>>>> bcache. The last mode is default and should not change behavior of
>>>> btrfs if the HDD fails, and if the SSD fails bcache would simply
>>>> turn off and fall back to HDD.
>>>
>>>
>>> Hello Kai,
>>>
>>> Yeah, lots of modes. So that means, none works well for all cases?
>>
>>
>> Just three, and they all work well. It's just a decision wearing vs.
>> performance/safety. Depending on your workload you might benefit more or
>> less from write-behind caching - that's when you want to turn the knob.
>> Everything else works out of the box. In case of an SSD failure,
>> write-back is just less safe while the other two modes should keep your
>> FS intact in that case.
>>
>>> Our server has lots of old files, on smb (various size), imap
>>> (10000's small, 1000's large), postgresql server, virtualbox images
>>> (large), 50 or so snapshots and running synaptics for system upgrades
>>> is painfully slow.
>>
>>
>> I don't think that bcache even cares to cache imap accesses to mail
>> bodies - it won't help performance. Network is usually much slower than
>> SSD access. But it will cache fs meta data which will improve imap
>> performance a lot.
>
> Bcache caches anything that falls within it's heuristics as candidates for
> caching.  It pays no attention to what type of data you're accessing, just
> the access patterns.  This is also the case for dm-cache, and for Windows
> ReadyBoost (or whatever the hell they're calling it these days).  Unless
> you're shifting very big e-mails, it's pretty likely that ones that get
> accessed more than once in a short period of time will end up being cached.
>>
>>
>>> We are expecting slowness to be caused by fsyncs which appear to be
>>> much worse on a raid10 with snapshots. Presumably the whole thing
>>> would be fast enough with ssd's but that would be not very cost
>>> efficient.
>>>
>>> All the overhead of the cache layer could be avoided if btrfs would
>>> just prefer to write small, hot, files to the ssd in the first place
>>> and clean up while balancing. A combination of 2 ssd's and 4 hdd's
>>> would be very nice (the mobo has 6 x sata, which is pretty common)
>>
>>
>> Well, I don't want to advertise bcache. But there's nothing you
>> couldn't do with it in your particular case:
>>
>> Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
>> here, you can use 1:n where n is the backing devices. There's no need
>> to clean up using balancing because bcache will track hot data by
>> default. You just have to decide which balance between wearing the SSD
>> vs. performance you prefer. If slow fsyncs are you primary concern, I'd
>> go with write-back caching. The small file contents are propably not
>> your performance problem anyways but the meta data management btrfs has
>> to do in the background. Bcache will help a lot here, especially in
>> write-back mode. I'd recommend against using balance too often and too
>> intensive (don't use too big usage% filters), it will invalidate your
>> block cache and probably also invalidate bcache if bcache is too small.
>> It will hurt performance more than you gain. You may want to increase
>> nr_requests in the IO scheduler for your situation.
>
> This may not perform as well as you would think, depending on your
> configuration.  If things are in raid1 (or raid10) mode on the BTRFS side,
> then you can end up caching duplicate data (and on some workloads, you're
> almost guaranteed to cache duplicate data), which is a bigger issue when
> you're sharing a cache between devices, because it means they are competing
> for cache space.
>>
>>
>>> Moreover increasing the ssd's size in the future would then be just
>>> as simple as replacing a disk by a larger one.
>>
>>
>> It's as simple as detaching the HDDs from the caching SSD, replace it,
>> reattach it. It can be done online without reboot. SATA is usually
>> hotpluggable nowadays.
>>
>>> I think many would sign up for such a low maintenance, efficient
>>> setup that doesn't require a PhD in IT to think out and configure.
>>
>>
>> Bcache is actually low maintenance, no knobs to turn. Converting to
>> bcache protective superblocks is a one-time procedure which can be done
>> online. The bcache devices act as normal HDD if not attached to a
>> caching SSD. It's really less pain than you may think. And it's a
>> solution available now. Converting back later is easy: Just detach the
>> HDDs from the SSDs and use them for some other purpose if you feel so
>> later. Having the bcache protective superblock still in place doesn't
>> hurt then. Bcache is a no-op without caching device attached.
>
> No, bcache is _almost_ a no-op without a caching device.  From a userspace
> perspective, it does nothing, but it is still another layer of indirection
> in the kernel, which does have a small impact on performance.  The same is
> true of using LVM with a single volume taking up the entire partition, it
> looks almost no different from just using the partition, but it will perform
> worse than using the partition directly.  I've actually done profiling of
> both to figure out base values for the overhead, and while bcache with no
> cache device is not as bad as the LVM example, it can still be a roughly
> 0.5-2% slowdown (it gets more noticeable the faster your backing storage
> is).
>
> You also lose the ability to mount that filesystem directly on a kernel
> without bcache support (this may or may not be an issue for you).

The bcache (protective) superblock is in an 8KiB block in front of the
file system device. In case the current, non-bcached HDD's use modern
partitioning, you can do a 5-minute remove or add of bcache, without
moving/copying filesystem data. So in case you have a bcache-formatted
HDD that had just 1 primary partition (512 byte logical sectors), the
partition start is at sector 2048 and the filesystem start is at 2064.
Hard removing bcache (so making sure the module is not
needed/loaded/used the next boot) can be done done by changing the
start-sector of the partition from 2048 to 2064. In gdisk one has to
change the alignment to 16 first, otherwise this it refuses. And of
course, also first flush+stop+de-register bcache for the HDD.

The other way around is also possible, i.e. changing the start-sector
from 2048 to 2032. So that makes adding bcache to an existing
filesystem a 5 minute action and not a GBs- or TBs copy action. It is
not online of course, but just one reboot is needed (or just umount,
gdisk, partprobe, add bcache etc).
For RAID setups, one could just do 1 HDD first.

There is also a tool doing the conversion in-place (I haven't used it
myself, my python(s) had trouble; I could do the partition table edit
much faster/easier):
https://github.com/g2p/blocks#bcache-conversion

>>> Even at home, I would just throw in a low cost ssd next to the hdd if
>>> it was as simple as device add. But I wouldn't want to store my
>>> photo/video collection on just ssd, too expensive.
>>
>>
>> Bcache won't store your photos if you copied them: Large copy
>> operations (like backups) and sequential access is detected and bypassed
>> by bcache. It won't invalidate your valuable "hot data" in the cache.
>> It works really well.
>>
>> I'd even recommend to format filesystems with bcache protective
>> superblock (aka format backing device) even if you not gonna use
>> caching and not gonna insert an SSD now, just to have the option for
>> the future easily and without much hassle.
>>
>> I don't think native hot data tracking will land in btrfs anytime soon
>> (read: in the next 5 years). Bcache is a general purpose solution for
>> all filesystems that works now (and properly works).
>>
>> You maybe want to clone your current system and try to integrate bcache
>> to see the benefits. There's actually a really big impact on
>> performance from my testing (home machine, 3x 1TB HDD btrfs mraid1
>> draid0, 1x 500GB SSD as cache, hit rate >90%, cache utilization ~70%,
>> boot time improvement ~400%, application startup times almost instant,
>> workload: MariaDB development server, git usage, 3 nspawn containers,
>> VirtualBox Windows 7 + XP VMs, Steam gaming, daily rsync backups, btrfs
>> 60% filled).
>>
>> I'd recommend to not use a too small SSD because it wears out very fast
>> when used as cache (I think that generally applies and is not bcache
>> specific). My old 120GB SSD was specified for 85TB write performance,
>> and it was worn out after 12 months of bcache usage, which included 2
>> complete backup restores, multiple scrubs (which relocates and rewrites
>> every data block), and weekly balances with relatime enabled. I've
>> since used noatime+nossd, completely stopped using balance and never
>> used scrub yet, with the result of vastly reduced write accesses to the
>> caching SSD. This setup is able to write bursts of 800MB/s to the disk
>> and read up to 800MB/s from disk (if btrfs can properly distribute
>> reads to all disks). Bootchart shows up to 600 MB/s during cold booting
>> (with warmed SSD cache). My nspawn containers boot in 1-2 seconds and
>> do not add to the normal boot time at all (they are autostarted during
>> boot, 1x MySQL, 1x ElasticSearch, 1x idle/spare/testing container).
>> This is really impressive for a home machine, and c'mon: 3x 1TB HDD +
>> 1x 500GB SSD is not that expensive nowadays. If you still prefer a
>> low-end SSD I'd recommend to use write-around only from my own
>> experience.
>>
>> The cache usage of the 120GB of 100% with 70-80% hit rate, which means
>> it was constantly rewriting stuff. 500GB (which I use now) is a little
>> underutilized now but almost no writes happen after warming up, so it's
>> mostly a hot-data read cache (although I configured it as write-back).
>> Plus, bigger SSDs are usually faster - especially for write ops.
>>
>> Conclusion: Btrfs + bcache make a very good pair. Btrfs is not really
>> optimized for good latency and that's where bcache comes in. Operating
>> noise from HDD reduces a lot as soon as bcache is warmed up.
>>
>> BTW: If deployed, keep an eye on your SSD wearing (using smartctl). But
>> given you are using btrfs, you keep backups anyways. ;-)
>
> Any decent SSD (read as 'any SSD of a major brand other than OCZ that you
> bought from a reputable source') will still take years to wear out unless
> you're constantly re-writing things and not using discard/trim support (and
> bcache does use discard).   Even if you're not using discard/trim, the
> typical wear-out point is well over 100x the size of the SSD for the good
> consumer devices.  For a point of reference, I've got a pair of 250GB
> Crucial MX100's (they cost less than 0.50 USD per GB when I got them and
> provide essentially the same power-loss protections that the high end Intel
> SSD's do) which have seen more than 2.5TB of data writes over their
> lifetime, combined from at least three different filesystem formats (BTRFS,
> FAT32, and ext4), swap space, and LVM management, and the wear-leveling
> indicator on each still says they have 100% life remaining, and the similar
> 500GB one I just recently upgraded in my laptop had seen over 50TB of writes
> and was still saying 95% life remaining (and had been for months).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-19 21:01                 ` Kai Krakow
@ 2016-05-20 11:46                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-20 11:46 UTC (permalink / raw)
  To: linux-btrfs

On 2016-05-19 17:01, Kai Krakow wrote:
> Am Thu, 19 May 2016 14:51:01 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> For a point of reference, I've
>> got a pair of 250GB Crucial MX100's (they cost less than 0.50 USD per
>> GB when I got them and provide essentially the same power-loss
>> protections that the high end Intel SSD's do) which have seen more
>> than 2.5TB of data writes over their lifetime, combined from at least
>> three different filesystem formats (BTRFS, FAT32, and ext4), swap
>> space, and LVM management, and the wear-leveling indicator on each
>> still says they have 100% life remaining, and the similar 500GB one I
>> just recently upgraded in my laptop had seen over 50TB of writes and
>> was still saying 95% life remaining (and had been for months).
Correction, I hadn't checked recently, the 250G ones have seen about 
6.336TB of writes (I hadn't checked for multiple months), and report 90% 
remaining life, with about 240 days of power-on time.  This overall 
equates to about 775MBs of writes per-hour, and assuming similar write 
rates for the remaining life of the SSD, I can still expect roughly 9 
years of service from these, which means about 10 years of life given my 
usage, which is well beyond what I typically get from a traditional hard 
disk for the same price, and far exceeds the typical usable life of most 
desktops, laptops, and even some workstation computers.

And you have to also keep in mind, this 775MB/hour of writes is coming 
from a system that is running:
* BOINC distributed computing applications (regularly downloading big 
files, and almost constantly writing data)
* Dropbox
* Software builds for almost a dozen different systems (I use Gentoo, so 
_everything_ is built locally)
* Regression testing for BTRFS
* Basic network services (DHCP, DNS, and similar things)
* A tor entry node
* A local mail server (store and forward only, I just use it for 
monitoring messages)
And all of that (except the BTRFS regression testing) is running 24/7, 
and that's just the local VM's, and doesn't include the file sharing or 
SAN services.  Root filesystems for all of these VM's are all on the 
SSD's, as is the host's root filesystem and swap partition, and many of 
the data partitions.  And I haven't really done any write optimization, 
and it's still less  than 1GB/hour of writes to the SSD.  The typical 
user (including many types of server systems) will be writing much less 
than that most of the time.
>
> The smaller Crucials are much worse at that: The MX100 128GB version I
> had was specified for 85TB writes which I hit after about 12 months (97%
> lifetime used according to smartctl) due to excessive write patterns.
> I'm not sure how long it would have lasted but I decided to swap it for
> a Samsung 500GB drive, and reconfigure my system for much less write
> patterns.
>
> What should I say: I liked the Crucial more, first: It has an easy
> lifetime counter in smartctl, Samsung doesn't. And it had powerloss
> protection which Samsung doesn't explicitly mention (tho I think it has
> it).
>
> At least, according to endurance tests, my Samsung SSD should take
> about 1 PB of writes. I've already written 7 TB if I can trust the
> smartctl raw value.
>
> But I think you cannot compare specification values to a real endurance
> test... I think it says 150TBW for 500GB 850 EVO.
>
The point was more that wear out is less of an issue for a lot of people 
than many individuals make it out to be, not me trying to make Crucial 
sound like an amazing brand.  Yes, one of the Crucial MX100's may not 
last long as a Samsung EVO in a busy mail server or something similar, 
but for a majority of people, they will probably outlast the usefulness 
of the computer.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-19 23:23                 ` Henk Slager
@ 2016-05-20 12:03                   ` Austin S. Hemmelgarn
  2016-05-20 17:02                     ` Ferry Toth
  2016-05-20 22:26                     ` Henk Slager
  0 siblings, 2 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-20 12:03 UTC (permalink / raw)
  To: Henk Slager, linux-btrfs

On 2016-05-19 19:23, Henk Slager wrote:
> On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-05-19 14:09, Kai Krakow wrote:
>>>
>>> Am Wed, 18 May 2016 22:44:55 +0000 (UTC)
>>> schrieb Ferry Toth <ftoth@exalondelft.nl>:
>>>
>>>> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
>>> Bcache is actually low maintenance, no knobs to turn. Converting to
>>> bcache protective superblocks is a one-time procedure which can be done
>>> online. The bcache devices act as normal HDD if not attached to a
>>> caching SSD. It's really less pain than you may think. And it's a
>>> solution available now. Converting back later is easy: Just detach the
>>> HDDs from the SSDs and use them for some other purpose if you feel so
>>> later. Having the bcache protective superblock still in place doesn't
>>> hurt then. Bcache is a no-op without caching device attached.
>>
>> No, bcache is _almost_ a no-op without a caching device.  From a userspace
>> perspective, it does nothing, but it is still another layer of indirection
>> in the kernel, which does have a small impact on performance.  The same is
>> true of using LVM with a single volume taking up the entire partition, it
>> looks almost no different from just using the partition, but it will perform
>> worse than using the partition directly.  I've actually done profiling of
>> both to figure out base values for the overhead, and while bcache with no
>> cache device is not as bad as the LVM example, it can still be a roughly
>> 0.5-2% slowdown (it gets more noticeable the faster your backing storage
>> is).
>>
>> You also lose the ability to mount that filesystem directly on a kernel
>> without bcache support (this may or may not be an issue for you).
>
> The bcache (protective) superblock is in an 8KiB block in front of the
> file system device. In case the current, non-bcached HDD's use modern
> partitioning, you can do a 5-minute remove or add of bcache, without
> moving/copying filesystem data. So in case you have a bcache-formatted
> HDD that had just 1 primary partition (512 byte logical sectors), the
> partition start is at sector 2048 and the filesystem start is at 2064.
> Hard removing bcache (so making sure the module is not
> needed/loaded/used the next boot) can be done done by changing the
> start-sector of the partition from 2048 to 2064. In gdisk one has to
> change the alignment to 16 first, otherwise this it refuses. And of
> course, also first flush+stop+de-register bcache for the HDD.
>
> The other way around is also possible, i.e. changing the start-sector
> from 2048 to 2032. So that makes adding bcache to an existing
> filesystem a 5 minute action and not a GBs- or TBs copy action. It is
> not online of course, but just one reboot is needed (or just umount,
> gdisk, partprobe, add bcache etc).
> For RAID setups, one could just do 1 HDD first.
My argument about the overhead was not about the superblock, it was 
about the bcache layer itself.  It isn't practical to just access the 
data directly if you plan on adding a cache device, because then you 
couldn't do so online unless you're going through bcache.  This extra 
layer of indirection in the kernel does add overhead, regardless of the 
on-disk format.

Secondarily, having a HDD with just one partition is not a typical use 
case, and that argument about the slack space resulting from the 1M 
alignment only holds true if you're using an MBR instead of a GPT layout 
(or for that matter, almost any other partition table format), and 
you're not booting from that disk (because GRUB embeds itself there). 
It's also fully possible to have an MBR formatted disk which doesn't 
have any spare space there too (which is how most flash drives get 
formatted).

This also doesn't change the fact that without careful initial 
formatting (it is possible on some filesystems to embed the bcache SB at 
the beginning of the FS itself, many of them have some reserved space at 
the beginning of the partition for bootloaders, and this space doesn't 
have to exist when mounting the FS) or manual alteration of the 
partition, it's not possible to mount the FS on a system without bcache 
support.
>
> There is also a tool doing the conversion in-place (I haven't used it
> myself, my python(s) had trouble; I could do the partition table edit
> much faster/easier):
> https://github.com/g2p/blocks#bcache-conversion
>
I actually hadn't known about this tool, thanks for mentioning it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-20 12:03                   ` Austin S. Hemmelgarn
@ 2016-05-20 17:02                     ` Ferry Toth
  2016-05-20 17:59                       ` Austin S. Hemmelgarn
  2016-05-20 22:26                     ` Henk Slager
  1 sibling, 1 reply; 26+ messages in thread
From: Ferry Toth @ 2016-05-20 17:02 UTC (permalink / raw)
  To: linux-btrfs

Op Fri, 20 May 2016 08:03:12 -0400, schreef Austin S. Hemmelgarn:

> On 2016-05-19 19:23, Henk Slager wrote:
>> On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>> On 2016-05-19 14:09, Kai Krakow wrote:
>>>>
>>>> Am Wed, 18 May 2016 22:44:55 +0000 (UTC)
>>>> schrieb Ferry Toth <ftoth@exalondelft.nl>:
>>>>
>>>>> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
>>>> Bcache is actually low maintenance, no knobs to turn. Converting to
>>>> bcache protective superblocks is a one-time procedure which can be
>>>> done online. The bcache devices act as normal HDD if not attached to
>>>> a caching SSD. It's really less pain than you may think. And it's a
>>>> solution available now. Converting back later is easy: Just detach
>>>> the HDDs from the SSDs and use them for some other purpose if you
>>>> feel so later. Having the bcache protective superblock still in place
>>>> doesn't hurt then. Bcache is a no-op without caching device attached.
>>>
>>> No, bcache is _almost_ a no-op without a caching device.  From a
>>> userspace perspective, it does nothing, but it is still another layer
>>> of indirection in the kernel, which does have a small impact on
>>> performance.  The same is true of using LVM with a single volume
>>> taking up the entire partition, it looks almost no different from just
>>> using the partition, but it will perform worse than using the
>>> partition directly.  I've actually done profiling of both to figure
>>> out base values for the overhead, and while bcache with no cache
>>> device is not as bad as the LVM example, it can still be a roughly
>>> 0.5-2% slowdown (it gets more noticeable the faster your backing
>>> storage is).
>>>
>>> You also lose the ability to mount that filesystem directly on a
>>> kernel without bcache support (this may or may not be an issue for
>>> you).
>>
>> The bcache (protective) superblock is in an 8KiB block in front of the
>> file system device. In case the current, non-bcached HDD's use modern
>> partitioning, you can do a 5-minute remove or add of bcache, without
>> moving/copying filesystem data. So in case you have a bcache-formatted
>> HDD that had just 1 primary partition (512 byte logical sectors), the
>> partition start is at sector 2048 and the filesystem start is at 2064.
>> Hard removing bcache (so making sure the module is not
>> needed/loaded/used the next boot) can be done done by changing the
>> start-sector of the partition from 2048 to 2064. In gdisk one has to
>> change the alignment to 16 first, otherwise this it refuses. And of
>> course, also first flush+stop+de-register bcache for the HDD.
>>
>> The other way around is also possible, i.e. changing the start-sector
>> from 2048 to 2032. So that makes adding bcache to an existing
>> filesystem a 5 minute action and not a GBs- or TBs copy action. It is
>> not online of course, but just one reboot is needed (or just umount,
>> gdisk, partprobe, add bcache etc).
>> For RAID setups, one could just do 1 HDD first.
> My argument about the overhead was not about the superblock, it was
> about the bcache layer itself.  It isn't practical to just access the
> data directly if you plan on adding a cache device, because then you
> couldn't do so online unless you're going through bcache.  This extra
> layer of indirection in the kernel does add overhead, regardless of the
> on-disk format.
> 
> Secondarily, having a HDD with just one partition is not a typical use
> case, and that argument about the slack space resulting from the 1M
> alignment only holds true if you're using an MBR instead of a GPT layout
> (or for that matter, almost any other partition table format), and
> you're not booting from that disk (because GRUB embeds itself there).
> It's also fully possible to have an MBR formatted disk which doesn't
> have any spare space there too (which is how most flash drives get
> formatted).

We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4, 
then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs 
partitions are in the same pool, which is in btrfs RAID10 format. /boot 
is in subvolume @boot.

In this configuration nothing would beat btrfs if I could just add 2 
SSD's to the pool that would be clever enough to be paired in RAID1 and 
would be preferred for small (<1GB) file writes. Then balance should be 
able to move not often used files to the HDD.

None of the methods mentioned here sound easy or quick to do, or even 
well tested.

> This also doesn't change the fact that without careful initial
> formatting (it is possible on some filesystems to embed the bcache SB at
> the beginning of the FS itself, many of them have some reserved space at
> the beginning of the partition for bootloaders, and this space doesn't
> have to exist when mounting the FS) or manual alteration of the
> partition, it's not possible to mount the FS on a system without bcache
> support.
>>
>> There is also a tool doing the conversion in-place (I haven't used it
>> myself, my python(s) had trouble; I could do the partition table edit
>> much faster/easier):
>> https://github.com/g2p/blocks#bcache-conversion
>>
> I actually hadn't known about this tool, thanks for mentioning it.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-20 17:02                     ` Ferry Toth
@ 2016-05-20 17:59                       ` Austin S. Hemmelgarn
  2016-05-20 21:31                         ` Henk Slager
  2016-05-29  6:23                         ` Andrei Borzenkov
  0 siblings, 2 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-20 17:59 UTC (permalink / raw)
  To: Ferry Toth, linux-btrfs

On 2016-05-20 13:02, Ferry Toth wrote:
> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
> partitions are in the same pool, which is in btrfs RAID10 format. /boot
> is in subvolume @boot.
If you have GRUB installed on all 4, then you don't actually have the 
full 2047 sectors between the MBR and the partition free, as GRUB is 
embedded in that space.  I forget exactly how much space it takes up, 
but I know it's not the whole 1023.5K  I would not suggest risking usage 
of the final 8k there though.  You could however convert to raid1 
temporarily, and then for each device, delete it, reformat for bcache, 
then re-add it to the FS.  This may take a while, but should be safe (of 
course, it's only an option if you're already using a kernel with bcache 
support).
> In this configuration nothing would beat btrfs if I could just add 2
> SSD's to the pool that would be clever enough to be paired in RAID1 and
> would be preferred for small (<1GB) file writes. Then balance should be
> able to move not often used files to the HDD.
>
> None of the methods mentioned here sound easy or quick to do, or even
> well tested.
It really depends on what you're used to.  I would consider most of the 
options easy, but one of the areas I'm strongest with is storage 
management, and I've repaired damaged filesystems and partition tables 
by hand with a hex editor before, so I'm not necessarily a typical user. 
  If I was going to suggest something specifically, it would be 
dm-cache, because it requires no modification to the backing store at 
all, but that would require running on LVM if you want it to be easy to 
set up (it's possible to do it without LVM, but you need something to 
call dmsetup before mounting the filesystem, which is not easy to 
configure correctly), and if you're on an enterprise distro, it may not 
be supported.

If you wanted to, it's possible, and not all that difficult, to convert 
a BTRFS system to BTRFS on top of LVM online, but you would probably 
have to split out the boot subvolume to a separate partition (depends on 
which distro you're on, some have working LVM support in GRUB, some 
don't).  If you're on a distro which does have LVM support in GRUB, the 
procedure would be:
1. Convert the BTRFS array to raid1. This lets you run with only 3 disks 
instead of 4.
2. Delete one of the disks from the array.
3. Convert the disk you deleted from the array to a LVM PV and add it to 
a VG.
4. Create a new logical volume occupying almost all of the PV you just 
added (having a little slack space is usually a good thing).
5. Add use btrfs replace to add the LV to the BTRFS array while deleting 
one of the others.
6. Repeat from step 3-5 for each disk, but stop at step 4 when you have 
exactly one disk that isn't on LVM (so for four disks, stop at step four 
when you have 2 with BTRFS+LVM, one with just the LVM logical volume, 
and one with just BTRFS).
7. Reinstall GRUB (it should pull in LVM support now).
8. Use BTRFS replace to move the final BTRFS disk to the empty LVM volume.
9. Convert the now empty final disk to LVM using steps 3-4
10. Add the LV to the BTRFS array and rebalance to raid10.
11. Reinstall GRUB again (just to be certain).

I've done essentially the same thing on numerous occasions when 
reprovisioning for various reasons, and it's actually one of the things 
outside of the xfstests that I check with my regression testing 
(including simulating a couple of the common failure modes).  It takes a 
while (especially for big arrays with lots of data), but it works, and 
is relatively safe (you are guaranteed to be able to rebuild a raid1 
array of 3 disks from just 2, so losing the disk in the process of 
copying it will not result in data loss unless you hit a kernel bug).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-20 17:59                       ` Austin S. Hemmelgarn
@ 2016-05-20 21:31                         ` Henk Slager
  2016-05-29  6:23                         ` Andrei Borzenkov
  1 sibling, 0 replies; 26+ messages in thread
From: Henk Slager @ 2016-05-20 21:31 UTC (permalink / raw)
  To: linux-btrfs

On Fri, May 20, 2016 at 7:59 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-05-20 13:02, Ferry Toth wrote:
>>
>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
>> partitions are in the same pool, which is in btrfs RAID10 format. /boot
>> is in subvolume @boot.
>
> If you have GRUB installed on all 4, then you don't actually have the full
> 2047 sectors between the MBR and the partition free, as GRUB is embedded in
> that space.  I forget exactly how much space it takes up, but I know it's
> not the whole 1023.5K  I would not suggest risking usage of the final 8k
> there though.  You could however convert to raid1 temporarily, and then for
> each device, delete it, reformat for bcache, then re-add it to the FS.  This
> may take a while, but should be safe (of course, it's only an option if
> you're already using a kernel with bcache support).

There is more then enough space in that 2047 sectors area for
inserting a bcache SB, but initially I also found it risky and was not
so sure. I anyhow don't want GRUB in the MBR, but in the filesystem/OS
partition that it should boot, otherwise multi-OS on the same SSD or
HDD gets into trouble.

For the described system, assuming a few minutes offline or
'maintenance' mode is acceptable, I personally would just shrink the
swap by 8KiB, lower its end-sector by 16 and also lower the
start-sector of the btrfs partition by 16 and then add bcache. The
location of GRUB should not matter actually.

>> In this configuration nothing would beat btrfs if I could just add 2
>> SSD's to the pool that would be clever enough to be paired in RAID1 and
>> would be preferred for small (<1GB) file writes. Then balance should be
>> able to move not often used files to the HDD.
>>
>> None of the methods mentioned here sound easy or quick to do, or even
>> well tested.

I agree that all the methods are actually quite complicated,
especially if compared to ZFS and its tools. Adding an ARC is as
simple and easy as you want and describe.

The statement I wanted make is that adding bcache for a (btrfs)
file-system can be done without touching the FS itself, provided that
one can allow some offline time for the FS.

> It really depends on what you're used to.  I would consider most of the
> options easy, but one of the areas I'm strongest with is storage management,
> and I've repaired damaged filesystems and partition tables by hand with a
> hex editor before, so I'm not necessarily a typical user.  If I was going to
> suggest something specifically, it would be dm-cache, because it requires no
> modification to the backing store at all, but that would require running on
> LVM if you want it to be easy to set up (it's possible to do it without LVM,
> but you need something to call dmsetup before mounting the filesystem, which
> is not easy to configure correctly), and if you're on an enterprise distro,
> it may not be supported.
>
> If you wanted to, it's possible, and not all that difficult, to convert a
> BTRFS system to BTRFS on top of LVM online, but you would probably have to
> split out the boot subvolume to a separate partition (depends on which
> distro you're on, some have working LVM support in GRUB, some don't).  If
> you're on a distro which does have LVM support in GRUB, the procedure would
> be:
> 1. Convert the BTRFS array to raid1. This lets you run with only 3 disks
> instead of 4.
> 2. Delete one of the disks from the array.
> 3. Convert the disk you deleted from the array to a LVM PV and add it to a
> VG.
> 4. Create a new logical volume occupying almost all of the PV you just added
> (having a little slack space is usually a good thing).
> 5. Add use btrfs replace to add the LV to the BTRFS array while deleting one
> of the others.
> 6. Repeat from step 3-5 for each disk, but stop at step 4 when you have
> exactly one disk that isn't on LVM (so for four disks, stop at step four
> when you have 2 with BTRFS+LVM, one with just the LVM logical volume, and
> one with just BTRFS).
> 7. Reinstall GRUB (it should pull in LVM support now).
> 8. Use BTRFS replace to move the final BTRFS disk to the empty LVM volume.
> 9. Convert the now empty final disk to LVM using steps 3-4
> 10. Add the LV to the BTRFS array and rebalance to raid10.
> 11. Reinstall GRUB again (just to be certain).
>
> I've done essentially the same thing on numerous occasions when
> reprovisioning for various reasons, and it's actually one of the things
> outside of the xfstests that I check with my regression testing (including
> simulating a couple of the common failure modes).  It takes a while
> (especially for big arrays with lots of data), but it works, and is
> relatively safe (you are guaranteed to be able to rebuild a raid1 array of 3
> disks from just 2, so losing the disk in the process of copying it will not
> result in data loss unless you hit a kernel bug).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-20 12:03                   ` Austin S. Hemmelgarn
  2016-05-20 17:02                     ` Ferry Toth
@ 2016-05-20 22:26                     ` Henk Slager
  2016-05-23 11:32                       ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 26+ messages in thread
From: Henk Slager @ 2016-05-20 22:26 UTC (permalink / raw)
  To: linux-btrfs

>>>> bcache protective superblocks is a one-time procedure which can be done
>>>> online. The bcache devices act as normal HDD if not attached to a
>>>> caching SSD. It's really less pain than you may think. And it's a
>>>> solution available now. Converting back later is easy: Just detach the
>>>> HDDs from the SSDs and use them for some other purpose if you feel so
>>>> later. Having the bcache protective superblock still in place doesn't
>>>> hurt then. Bcache is a no-op without caching device attached.
>>>
>>>
>>> No, bcache is _almost_ a no-op without a caching device.  From a
>>> userspace
>>> perspective, it does nothing, but it is still another layer of
>>> indirection
>>> in the kernel, which does have a small impact on performance.  The same
>>> is
>>> true of using LVM with a single volume taking up the entire partition, it
>>> looks almost no different from just using the partition, but it will
>>> perform
>>> worse than using the partition directly.  I've actually done profiling of
>>> both to figure out base values for the overhead, and while bcache with no
>>> cache device is not as bad as the LVM example, it can still be a roughly
>>> 0.5-2% slowdown (it gets more noticeable the faster your backing storage
>>> is).
>>>
>>> You also lose the ability to mount that filesystem directly on a kernel
>>> without bcache support (this may or may not be an issue for you).
>>
>>
>> The bcache (protective) superblock is in an 8KiB block in front of the
>> file system device. In case the current, non-bcached HDD's use modern
>> partitioning, you can do a 5-minute remove or add of bcache, without
>> moving/copying filesystem data. So in case you have a bcache-formatted
>> HDD that had just 1 primary partition (512 byte logical sectors), the
>> partition start is at sector 2048 and the filesystem start is at 2064.
>> Hard removing bcache (so making sure the module is not
>> needed/loaded/used the next boot) can be done done by changing the
>> start-sector of the partition from 2048 to 2064. In gdisk one has to
>> change the alignment to 16 first, otherwise this it refuses. And of
>> course, also first flush+stop+de-register bcache for the HDD.
>>
>> The other way around is also possible, i.e. changing the start-sector
>> from 2048 to 2032. So that makes adding bcache to an existing
>> filesystem a 5 minute action and not a GBs- or TBs copy action. It is
>> not online of course, but just one reboot is needed (or just umount,
>> gdisk, partprobe, add bcache etc).
>> For RAID setups, one could just do 1 HDD first.
>
> My argument about the overhead was not about the superblock, it was about
> the bcache layer itself.  It isn't practical to just access the data
> directly if you plan on adding a cache device, because then you couldn't do
> so online unless you're going through bcache.  This extra layer of
> indirection in the kernel does add overhead, regardless of the on-disk
> format.

Yes, sorry, I took some shortcut in the discussion and jumped to a
method for avoiding this 0.5-2% slowdown that you mention. (Or a
kernel crashing in bcache code due to corrupt SB on a backing device
or corrupted caching device contents).
I am actually bit surprised that there is a measurable slowdown,
considering that it is basically just one 8KiB offset on a certain
layer in the kernel stack, but I haven't looked at that code.

> Secondarily, having a HDD with just one partition is not a typical use case,
> and that argument about the slack space resulting from the 1M alignment only
> holds true if you're using an MBR instead of a GPT layout (or for that
> matter, almost any other partition table format), and you're not booting
> from that disk (because GRUB embeds itself there). It's also fully possible
> to have an MBR formatted disk which doesn't have any spare space there too
> (which is how most flash drives get formatted).

I don't know other tables than MBR and GPT, but this bcache SB
'insertion' works with both. Indeed, if GRUB is involved, it can get
complicated, I have avoided that. If there is less than 8KiB slack
space on a HDD, I would worry about alignment/performance first, then
there is likely a reason to fully rewrite the HDD with a standard 1M
alingment.
If there is more partitions and the partition in front of the one you
would like to be bcached, I personally would shrink it by 8KiB (like
NTFS or swap or ext4 ) if that saves me TeraBytes of datatransfers.

> This also doesn't change the fact that without careful initial formatting
> (it is possible on some filesystems to embed the bcache SB at the beginning
> of the FS itself, many of them have some reserved space at the beginning of
> the partition for bootloaders, and this space doesn't have to exist when
> mounting the FS) or manual alteration of the partition, it's not possible to
> mount the FS on a system without bcache support.

If we consider a non-bootable single HDD btrfs FS, are you then
suggesting that the bcache SB could be placed in the first 64KiB where
also GRUB stores its code if the FS would need booting ?
That would be interesting, it would mean that also for btrfs on raw
device (and also multi-device) there is no extra exclusive 8KiB space
needed in front.
Is there someone who has this working? I think it would lead to issues
on the blocklayer, but I have currently no clue about that.

>> There is also a tool doing the conversion in-place (I haven't used it
>> myself, my python(s) had trouble; I could do the partition table edit
>> much faster/easier):
>> https://github.com/g2p/blocks#bcache-conversion
>>
> I actually hadn't known about this tool, thanks for mentioning it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-20 22:26                     ` Henk Slager
@ 2016-05-23 11:32                       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-23 11:32 UTC (permalink / raw)
  To: Henk Slager, linux-btrfs

On 2016-05-20 18:26, Henk Slager wrote:
> Yes, sorry, I took some shortcut in the discussion and jumped to a
> method for avoiding this 0.5-2% slowdown that you mention. (Or a
> kernel crashing in bcache code due to corrupt SB on a backing device
> or corrupted caching device contents).
> I am actually bit surprised that there is a measurable slowdown,
> considering that it is basically just one 8KiB offset on a certain
> layer in the kernel stack, but I haven't looked at that code.
There's still a layer of indirection in the kernel code, even in the 
pass-through mode with no cache, and that's probably where the slowdown 
comes from.  My testing was also in a VM with it's backing device on an 
SSD though, so you may get different results on other hardware
> I don't know other tables than MBR and GPT, but this bcache SB
> 'insertion' works with both. Indeed, if GRUB is involved, it can get
> complicated, I have avoided that. If there is less than 8KiB slack
> space on a HDD, I would worry about alignment/performance first, then
> there is likely a reason to fully rewrite the HDD with a standard 1M
> alingment.
The 'alignment' things is mostly bogus these days.  It originated when 
1M was a full track on the disk, and you wanted your filesystem to start 
on the beginning of a track for performance reasons.  On most modern 
disks though, this is not a full track, but it got kept because a number 
of bootloaders (GRUB included) used to use the slack space this caused 
to embed themselves before the filesystem.  The only case where 1M 
alignment actually makes sense is on SSD's with a 1M erase block size 
(which are rare, most consumer devices have a 4M erase block).  As far 
as partition tables, you're not likely to see any other formats these 
days (the only ones I've dealt with other than MBR and GPT are APM (the 
old pre-OSX Apple format), RDB (the Amiga format, which is kind of neat 
because it can embed drivers), and the old Sun disk labels (from before 
SunOS became Solaris)), and I had actually forgotten that a GPT is only 
32k, hence my comment about it potentially being an issue.
> If there is more partitions and the partition in front of the one you
> would like to be bcached, I personally would shrink it by 8KiB (like
> NTFS or swap or ext4 ) if that saves me TeraBytes of datatransfers.
Definitely, although depending on how the system is set up, this will 
almost certainly need down time.
>
>> This also doesn't change the fact that without careful initial formatting
>> (it is possible on some filesystems to embed the bcache SB at the beginning
>> of the FS itself, many of them have some reserved space at the beginning of
>> the partition for bootloaders, and this space doesn't have to exist when
>> mounting the FS) or manual alteration of the partition, it's not possible to
>> mount the FS on a system without bcache support.
>
> If we consider a non-bootable single HDD btrfs FS, are you then
> suggesting that the bcache SB could be placed in the first 64KiB where
> also GRUB stores its code if the FS would need booting ?
> That would be interesting, it would mean that also for btrfs on raw
> device (and also multi-device) there is no extra exclusive 8KiB space
> needed in front.
> Is there someone who has this working? I think it would lead to issues
> on the blocklayer, but I have currently no clue about that.
I don't think it would work on BTRFS, we expect the SB at a fixed 
location into the device, and it wouldn't be there on the bcache device. 
  It might work on ext4 though, but I'm not certain about that.  I do 
know of at least one person who got it working with a FAT32 filesystem 
as a proof of concept though.  Trying to do that even if it would work 
on BTRFS would be _really_ risky though, because the kernel would 
potentially see both devices, and you would probably have the same 
issues that you do with block level copies.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-20 17:59                       ` Austin S. Hemmelgarn
  2016-05-20 21:31                         ` Henk Slager
@ 2016-05-29  6:23                         ` Andrei Borzenkov
  2016-05-29 17:53                           ` Chris Murphy
  1 sibling, 1 reply; 26+ messages in thread
From: Andrei Borzenkov @ 2016-05-29  6:23 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Ferry Toth, linux-btrfs

20.05.2016 20:59, Austin S. Hemmelgarn пишет:
> On 2016-05-20 13:02, Ferry Toth wrote:
>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
>> partitions are in the same pool, which is in btrfs RAID10 format. /boot
>> is in subvolume @boot.
> If you have GRUB installed on all 4, then you don't actually have the
> full 2047 sectors between the MBR and the partition free, as GRUB is
> embedded in that space.  I forget exactly how much space it takes up,
> but I know it's not the whole 1023.5K  I would not suggest risking usage
> of the final 8k there though.

If you mean grub2, required space is variable and depends on where
/boot/grub is located (i.e. which drivers it needs to access it).
Assuming plain btrfs on legacy BIOS MBR, required space is around 40-50KB.

Note that grub2 detects some post-MBR gap software signatures and skips
over them (space need not be contiguous). It is entirely possible to add
bcache detection if enough demand exists.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-29  6:23                         ` Andrei Borzenkov
@ 2016-05-29 17:53                           ` Chris Murphy
  2016-05-29 18:03                             ` Holger Hoffstätte
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2016-05-29 17:53 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Austin S. Hemmelgarn, Ferry Toth, Btrfs BTRFS

On Sun, May 29, 2016 at 12:23 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> 20.05.2016 20:59, Austin S. Hemmelgarn пишет:
>> On 2016-05-20 13:02, Ferry Toth wrote:
>>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
>>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
>>> partitions are in the same pool, which is in btrfs RAID10 format. /boot
>>> is in subvolume @boot.
>> If you have GRUB installed on all 4, then you don't actually have the
>> full 2047 sectors between the MBR and the partition free, as GRUB is
>> embedded in that space.  I forget exactly how much space it takes up,
>> but I know it's not the whole 1023.5K  I would not suggest risking usage
>> of the final 8k there though.
>
> If you mean grub2, required space is variable and depends on where
> /boot/grub is located (i.e. which drivers it needs to access it).
> Assuming plain btrfs on legacy BIOS MBR, required space is around 40-50KB.
>
> Note that grub2 detects some post-MBR gap software signatures and skips
> over them (space need not be contiguous). It is entirely possible to add
> bcache detection if enough demand exists.

Might not be a bad idea, just to avoid it getting stepped on and
causing later confusion. If it is stepped on I don't think there's
data loss except possibly in the case where there's an unclean
shutdown where the SSD has bcache data that hasn't been committed to
the HDD?

But I'm skeptical of bcache using a hidden area historically for the
bootloader, to put its device metadata. I didn't realize that was the
case. Imagine if LVM were to stuff metadata into the MBR gap, or
mdadm. Egads.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-29 17:53                           ` Chris Murphy
@ 2016-05-29 18:03                             ` Holger Hoffstätte
  2016-05-29 18:33                               ` Chris Murphy
  0 siblings, 1 reply; 26+ messages in thread
From: Holger Hoffstätte @ 2016-05-29 18:03 UTC (permalink / raw)
  To: linux-btrfs

On 05/29/16 19:53, Chris Murphy wrote:
> But I'm skeptical of bcache using a hidden area historically for the
> bootloader, to put its device metadata. I didn't realize that was the
> case. Imagine if LVM were to stuff metadata into the MBR gap, or
> mdadm. Egads.

On the matter of bcache in general this seems noteworthy:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667

bummer..

Holger


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-29 18:03                             ` Holger Hoffstätte
@ 2016-05-29 18:33                               ` Chris Murphy
  2016-05-29 20:45                                 ` Ferry Toth
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2016-05-29 18:33 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-btrfs

On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstätte
<holger@applied-asynchrony.com> wrote:
> On 05/29/16 19:53, Chris Murphy wrote:
>> But I'm skeptical of bcache using a hidden area historically for the
>> bootloader, to put its device metadata. I didn't realize that was the
>> case. Imagine if LVM were to stuff metadata into the MBR gap, or
>> mdadm. Egads.
>
> On the matter of bcache in general this seems noteworthy:
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667
>
> bummer..

Well it doesn't mean no one will take it, just that no one has taken
it yet. But the future of SSD caching may only be with LVM.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-29 18:33                               ` Chris Murphy
@ 2016-05-29 20:45                                 ` Ferry Toth
  2016-05-31 12:21                                   ` Austin S. Hemmelgarn
  2016-06-01 10:45                                   ` Dmitry Katsubo
  0 siblings, 2 replies; 26+ messages in thread
From: Ferry Toth @ 2016-05-29 20:45 UTC (permalink / raw)
  To: linux-btrfs

Op Sun, 29 May 2016 12:33:06 -0600, schreef Chris Murphy:

> On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstätte
> <holger@applied-asynchrony.com> wrote:
>> On 05/29/16 19:53, Chris Murphy wrote:
>>> But I'm skeptical of bcache using a hidden area historically for the
>>> bootloader, to put its device metadata. I didn't realize that was the
>>> case. Imagine if LVM were to stuff metadata into the MBR gap, or
>>> mdadm. Egads.
>>
>> On the matter of bcache in general this seems noteworthy:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/
commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667
>>
>> bummer..
> 
> Well it doesn't mean no one will take it, just that no one has taken it
> yet. But the future of SSD caching may only be with LVM.
> 
> --
> Chris Murphy

I think all the above posts underline exacly my point: 

Instead of using a ssd cache (be it bcache or dm-cache) it would be much 
better to have the btrfs allocator be aware of ssd's in the pool and 
prioritize allocations to the ssd to maximize performance.

This will allow to easily add more ssd's or replace worn out ones, 
without the mentioned headaches. After all adding/replacing drives to a 
pool is one of btrfs's biggest advantages. 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-29 20:45                                 ` Ferry Toth
@ 2016-05-31 12:21                                   ` Austin S. Hemmelgarn
  2016-06-01 10:45                                   ` Dmitry Katsubo
  1 sibling, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-31 12:21 UTC (permalink / raw)
  To: Ferry Toth, linux-btrfs

On 2016-05-29 16:45, Ferry Toth wrote:
> Op Sun, 29 May 2016 12:33:06 -0600, schreef Chris Murphy:
>
>> On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstätte
>> <holger@applied-asynchrony.com> wrote:
>>> On 05/29/16 19:53, Chris Murphy wrote:
>>>> But I'm skeptical of bcache using a hidden area historically for the
>>>> bootloader, to put its device metadata. I didn't realize that was the
>>>> case. Imagine if LVM were to stuff metadata into the MBR gap, or
>>>> mdadm. Egads.
>>>
>>> On the matter of bcache in general this seems noteworthy:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/
> commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667
>>>
>>> bummer..
>>
>> Well it doesn't mean no one will take it, just that no one has taken it
>> yet. But the future of SSD caching may only be with LVM.
>>
>> --
>> Chris Murphy
>
> I think all the above posts underline exacly my point:
>
> Instead of using a ssd cache (be it bcache or dm-cache) it would be much
> better to have the btrfs allocator be aware of ssd's in the pool and
> prioritize allocations to the ssd to maximize performance.
>
> This will allow to easily add more ssd's or replace worn out ones,
> without the mentioned headaches. After all adding/replacing drives to a
> pool is one of btrfs's biggest advantages.
It would still need to be pretty configurable, and even then would still 
be a niche use case.  It would also need automatic migration to be 
practical beyond a certain point, most people using regular computers 
outside of corporate environments don't have that same 'access frequency 
decreases over time' pattern that the manual migration scheme you 
suggested would be good for.

I think overall the most useful way of doing it would be something like 
the L2ARC on ZFS, which is essentially swap space for the page-cache, 
put on an SSD.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Hot data tracking / hybrid storage
  2016-05-29 20:45                                 ` Ferry Toth
  2016-05-31 12:21                                   ` Austin S. Hemmelgarn
@ 2016-06-01 10:45                                   ` Dmitry Katsubo
  1 sibling, 0 replies; 26+ messages in thread
From: Dmitry Katsubo @ 2016-06-01 10:45 UTC (permalink / raw)
  To: linux-btrfs, linux-btrfs-owner

On 2016-05-29 22:45, Ferry Toth wrote:
> Op Sun, 29 May 2016 12:33:06 -0600, schreef Chris Murphy:
> 
>> On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstätte
>> <holger@applied-asynchrony.com> wrote:
>>> On 05/29/16 19:53, Chris Murphy wrote:
>>>> But I'm skeptical of bcache using a hidden area historically for the
>>>> bootloader, to put its device metadata. I didn't realize that was 
>>>> the
>>>> case. Imagine if LVM were to stuff metadata into the MBR gap, or
>>>> mdadm. Egads.
>>> 
>>> On the matter of bcache in general this seems noteworthy:
>>> 
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667
>>> 
>>> bummer..
>> 
>> Well it doesn't mean no one will take it, just that no one has taken 
>> it
>> yet. But the future of SSD caching may only be with LVM.
>> 
> 
> I think all the above posts underline exacly my point:
> 
> Instead of using a ssd cache (be it bcache or dm-cache) it would be 
> much
> better to have the btrfs allocator be aware of ssd's in the pool and
> prioritize allocations to the ssd to maximize performance.
> 
> This will allow to easily add more ssd's or replace worn out ones,
> without the mentioned headaches. After all adding/replacing drives to a
> pool is one of btrfs's biggest advantages.

I would certainly vote for this feature. If I understand correctly, the
mirror is selected based on the PID of btrfs worker thread [1], which is
simple but not most effective. I would suggest implementing the queue of 
read
operations per physical device (perhaps reads/writes should be put into 
same
queue). If device is fast (and for SSD that is the case), the queue 
becomes
empty quicker which means it should be loaded more intensively. 
Allocation
logic should simply put the next request to the shortest queue. I think
this will guarantee that most operations are served by SSD (or any other
even faster technology that appears in the future).

[1] 
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Better_data_balancing_over_multiple_devices_for_raid1.2F10_.28read.29


-- 
With best regards,
Dmitry

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-06-01 10:45 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-15 12:12 Hot data tracking / hybrid storage Ferry Toth
2016-05-15 21:11 ` Duncan
2016-05-15 23:05   ` Kai Krakow
2016-05-17  6:27     ` Ferry Toth
2016-05-17 11:32       ` Austin S. Hemmelgarn
2016-05-17 18:33         ` Kai Krakow
2016-05-18 22:44           ` Ferry Toth
2016-05-19 18:09             ` Kai Krakow
2016-05-19 18:51               ` Austin S. Hemmelgarn
2016-05-19 21:01                 ` Kai Krakow
2016-05-20 11:46                   ` Austin S. Hemmelgarn
2016-05-19 23:23                 ` Henk Slager
2016-05-20 12:03                   ` Austin S. Hemmelgarn
2016-05-20 17:02                     ` Ferry Toth
2016-05-20 17:59                       ` Austin S. Hemmelgarn
2016-05-20 21:31                         ` Henk Slager
2016-05-29  6:23                         ` Andrei Borzenkov
2016-05-29 17:53                           ` Chris Murphy
2016-05-29 18:03                             ` Holger Hoffstätte
2016-05-29 18:33                               ` Chris Murphy
2016-05-29 20:45                                 ` Ferry Toth
2016-05-31 12:21                                   ` Austin S. Hemmelgarn
2016-06-01 10:45                                   ` Dmitry Katsubo
2016-05-20 22:26                     ` Henk Slager
2016-05-23 11:32                       ` Austin S. Hemmelgarn
2016-05-16 11:25 ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.