linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [GIT] Bcache version 12
       [not found] <1280519620.12031317482084581.JavaMail.root@shiva>
@ 2011-10-01 15:19 ` LuVar
  0 siblings, 0 replies; 25+ messages in thread
From: LuVar @ 2011-10-01 15:19 UTC (permalink / raw)
  To: Dan J Williams
  Cc: NeilBrown, Andreas Dilger, linux-bcache, linux-kernel,
	linux-fsdevel, rdunlap, axboe, akpm, Kent Overstreet

Hi here.

----- "Dan J Williams" <dan.j.williams@intel.com> wrote:

> On Fri, Sep 30, 2011 at 12:14 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> >> > Cache devices have a basically identical superblock as backing
> devices
> >> > though, and some of the registration code is shared, but cache
> devices
> >> > don't correspond to any block devices.
> >>
> >> Just like a raid0 is a virtual creation from two block devices?
>  Or
> >> some other meaning of "don't correspond"?
> >
> > No.
> >
> > Remember, you can hang multiple backing devices off a cache.
> >
> > Each backing device shows up as as a new block device - i.e. if
> you're
> > caching /dev/sdb, you now use it as /dev/bcache0.
> >
> > But the SSD doesn't belong to any of those /dev/bcacheN devices.
> 
> So to clarify I read that as "it belongs to all of them".  The ssd
> (/dev/sda, for example) can cache the contents of N block devices,
> and
> to get to the cached version of each of those you go through
> /dev/bcache[0..N].  The problem you perceive is that an md device
> requires a 1:1 mapping of member devices to md devices.  So if we had
> /dev/sda and /dev/sdb in a cache configuration (/dev/md0) your
> concern
> is that if we simultaneously wanted a /dev/md1 that caches /dev/sda
> and /dev/sdc that md would not be able to handle it.
> 
> Is that the right interpretation?
> 
> I assume /dev/sda in the example would have some bcache-logical
> partitions to delineate the /dev/sdb and /dev/sdc cache data?  Which
> sounds similar to the logical partitions md handles now for external
> metadata.  I'm not proposing that cache-state metadata could be
> handled in userspace it's too integral to the i/o path, just pointing
> out that having /dev/sda be a member of both /dev/md0 and /dev/md1 is
> possible.
> 
> >> > A cache set is a set of cache devices - i.e. SSDs. The primary
> >> > motivitation for cache sets (as distinct from just caches) is to
> have
> >> > the ability to mirror only dirty data, and not clean data.
> >> >
> >> > i.e. if you're doing writeback caching of a raid6, your ssd is
> now a
> >> > single point of failure. You could use raid1 SSDs, but most of
> the data
> >> > in the cache is clean, so you don't need to mirror that... just
> the
> >> > dirty data.
> >>
> >> ...but you only incur that "mirror clean data" penalty once, and
> then
> >> it's just a normal raid1 mirroring writes, right?
> >
> > No idea what you mean...
> 
> /dev/md1 is a slow raid5 and /dev/md0 is a raid1 of two ssds.  Once
> /dev/md0 is synced the only mirror traffic is for incoming
> cache-dirtying writes and cache-clean read allocations.  We agree
> about incoming dirty-data, but you are saying you don't want to
> mirror
> read allocations?

Just one visualization of my understand of bcache set with mirroring only dirty data: http://147.175.167.212/~luvar/bcache/bcacheSSDset.png . If I am not wrong, read alocations are for example green and blue data. Dirty allocations is red one and it should be mirrored across all ssds in mirror set to provide ssd fail security.

On the other hand, greed, blue... data are backed up on raid6 and it is nod needed to mirror them across ssd set. They should be only on one ssd to provide read speedup.

Hmmm (sci-fi), if read allocations (not dirty data) will be mirrored in ssds set, they could be used to improve cache read speed, sacrificing some ssd space. It would be great if cache algorithm can mark really hot data to be mirrored for speed reading...

> 
> >> See, if these things were just md devices multiple cache device
> would
> >> already be "done", or at least on its way by just stacking md
> devices.
> >>  Where "done" is probably an oversimplification.
> >
> > No, it really wouldn't save us anything. If all we wanted to do was
> > mirror everything, there'd be no point in implementing multiple
> cache
> > device support, and you'd just use bcache on top of md. We're
> > implementing something completely new!
> >
> > You read what I said about only mirroring dirty data... right?
> 
> I did but I guess I did not fully grok it.
> 
> >> >> In any case it certainly could be modelled in md - and if the
> modelling were
> >> >> not elegant (e.g. even device numbers for backing devices, odd
> device numbers
> >> >> for cache devices) we could "fix" md to make it more elegant.
> >> >
> >> > But we've no reason to create block devices for caches or have a
> 1:1
> >> > mapping - that'd be a serious step backwards in functionality.
> >>
> >> I don't follow that...  there's nothing that prevents having
> multiple
> >> superblocks per cache array.
> >
> > Multiple... superblocks? Do you mean partitioning up the cache, or
> do
> > you mean creating multiple block devices for a cache? Either way
> it's a
> > silly hack.
> >
> >> A couple reasons I'm probing the md angle.
> >>
> >> 1/ Since the backing devices are md devices it would be nice if
> all
> >> the user space assembly logic that has seeped into udev and dracut
> >> could be re-used for assembling bcache devices.  As it stands it
> seems
> >> bcache relies on in-kernel auto-assembly, which md has discouraged
> >> with the v1 superblock.
> >
> > md was doing in kernel probing, which bcache does not do. What
> bcache is
> > doing is centralizing all the code that touches the on disk
> > superblock/metadata. You want to change something in the superblock
> -
> > you just have to tell the kernel to do it for you. Otherwise not
> only
> > would there be duplication of code, it'd be impossible to do safely
> > without races or the userspace code screwing something up; only the
> > kernel knows and controls the state of everything.
> 
> Makes sense but there is a difference between the metadata that
> specifies the configuration and the metadata that tracks the state of
> the cache.  If that distinction is made then userspace can tell the
> kernel to run a block cache of blockdevA and blockdevB and the kernel
> only needs to handle the cache state metadata.
> 
> > Or do you expect the ext4 superblock to be managed in normal
> operation
> > by userspace tools?
> 
> No.
> 
> >> We even have nascent GUI support in
> >> gnome-disk-utility it would be nice to harness some of that
> enabling
> >> momentum for this.
> >
> > I've got nothing against standardizing the userspace interfaces to
> make
> > life easier for things like gnome-disk-utility. Tell me what you
> want
> > and if it's sane I'll see about implementing it.
> 
> That's the point, userspace has some knowledge of how to interrogate
> and manage md devices.  A bcache device is brand new... maybe for
> good
> reason but that's what I'm trying to understand.
> 
> >> 2/ md supports multiple superblock formats and if you Google "ssd
> >> caching" you'll see that there may be other superblock formats
> that
> >> the Linux block-caching driver could be asked to support down the
> >> road.  And wouldn't it be nice if bcache had at least the option
> to
> >> support the on-disk format of whatever dm-cache is doing?
> >
> > That's pure fantasy. That's like expecting the ext4 code to mount a
> ntfs
> > filesystem!
> 
> No, there's portions of what bcache does that are similar to what md
> does.  Do we need to invent new multiple-device handling
> infrastructure for a block device driver?  But we are quickly
> approaching the "show me the code" portion of this discussion, so I
> need to go do more reading of bcache.
> 
> > There's a lot more to bcache's metadata than a superblock, there's
> a
> > journal and a full b-tree. A cache is going to need an index of
> some
> > kind.
> 
> Yes, but that can be independent of the configuration metadata.
> 
> --
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-10-06 17:58             ` Pavel Machek
@ 2011-10-10 12:35               ` LuVar
  0 siblings, 0 replies; 25+ messages in thread
From: LuVar @ 2011-10-10 12:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Pekka Enberg, linux-bcache, linux-kernel, linux-fsdevel, rdunlap,
	axboe, akpm, neilb, Kent Overstreet

Hi,

----- "Pavel Machek" <pavel@ucw.cz> wrote:

> Hi!
> 
> > It can cache filesystem metadata - it can cache anything.
> > 
> > Because bcache has its own superblock (much like md), it can
> guarantee
> > that bcache devices are consistent; this is particularly important
> if
> > you want to do writeback caching. You really don't want to
> accidently
> > mount a filesystem that you were doing writeback caching on without
> the
> > ache - bcache makes it impossible to do so accidently.
> > 
> > Is any of that useful?
> 
> I guess some kind of benchmark would be nice....? I don't know what
> fair workload for this is. System bootup? Kernel compile after
> reboot?

I guess, fair benchmark is ordinary work... For example I have running my PC with uptimes sometimes more than 10 days, so booting up is not critical. For me is critical startup of my eclipse with some workspace (work, school, personal...). To have more standard comparison I suggest to have some "standard" (average user) workload for testing. Somewhere I have seen to do:
1. system bootup
2. startup of some programs (firefox,gimp,hugin)
3. do some work in openned programs (they used some macros)
4. save all work and shutdown pc

Imho measuring time for something like this could be relevant measure of general speedup => user experience speedup. This workload probably could be simulated by running phoronix test suite.

To measure speedup in some other quantity (kbps, iops...) there is probably problem with workload type (database, iozone...). Random benchmark will just measure how bcache slows down random operations. To measure speed up there have to be some repetitive io reading from disk (starting same programs...).

PS: I want to make some time to install bcache and try to make some simple benchmarks.... I hope that I will find some time for this around Christmas...

> 
> 									Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures)
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-21  5:57           ` Kent Overstreet
@ 2011-10-06 17:58             ` Pavel Machek
  2011-10-10 12:35               ` LuVar
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2011-10-06 17:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Pekka Enberg, linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, neilb-l3A5Bk7waGM

Hi!

> It can cache filesystem metadata - it can cache anything.
> 
> Because bcache has its own superblock (much like md), it can guarantee
> that bcache devices are consistent; this is particularly important if
> you want to do writeback caching. You really don't want to accidently
> mount a filesystem that you were doing writeback caching on without the
> ache - bcache makes it impossible to do so accidently.
> 
> Is any of that useful?

I guess some kind of benchmark would be nice....? I don't know what
fair workload for this is. System bootup? Kernel compile after reboot?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-30  7:14                           ` Kent Overstreet
@ 2011-09-30 19:47                             ` Williams, Dan J
  0 siblings, 0 replies; 25+ messages in thread
From: Williams, Dan J @ 2011-09-30 19:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, Andreas Dilger, linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 30, 2011 at 12:14 AM, Kent Overstreet
<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > Cache devices have a basically identical superblock as backing devices
>> > though, and some of the registration code is shared, but cache devices
>> > don't correspond to any block devices.
>>
>> Just like a raid0 is a virtual creation from two block devices?  Or
>> some other meaning of "don't correspond"?
>
> No.
>
> Remember, you can hang multiple backing devices off a cache.
>
> Each backing device shows up as as a new block device - i.e. if you're
> caching /dev/sdb, you now use it as /dev/bcache0.
>
> But the SSD doesn't belong to any of those /dev/bcacheN devices.

So to clarify I read that as "it belongs to all of them".  The ssd
(/dev/sda, for example) can cache the contents of N block devices, and
to get to the cached version of each of those you go through
/dev/bcache[0..N].  The problem you perceive is that an md device
requires a 1:1 mapping of member devices to md devices.  So if we had
/dev/sda and /dev/sdb in a cache configuration (/dev/md0) your concern
is that if we simultaneously wanted a /dev/md1 that caches /dev/sda
and /dev/sdc that md would not be able to handle it.

Is that the right interpretation?

I assume /dev/sda in the example would have some bcache-logical
partitions to delineate the /dev/sdb and /dev/sdc cache data?  Which
sounds similar to the logical partitions md handles now for external
metadata.  I'm not proposing that cache-state metadata could be
handled in userspace it's too integral to the i/o path, just pointing
out that having /dev/sda be a member of both /dev/md0 and /dev/md1 is
possible.

>> > A cache set is a set of cache devices - i.e. SSDs. The primary
>> > motivitation for cache sets (as distinct from just caches) is to have
>> > the ability to mirror only dirty data, and not clean data.
>> >
>> > i.e. if you're doing writeback caching of a raid6, your ssd is now a
>> > single point of failure. You could use raid1 SSDs, but most of the data
>> > in the cache is clean, so you don't need to mirror that... just the
>> > dirty data.
>>
>> ...but you only incur that "mirror clean data" penalty once, and then
>> it's just a normal raid1 mirroring writes, right?
>
> No idea what you mean...

/dev/md1 is a slow raid5 and /dev/md0 is a raid1 of two ssds.  Once
/dev/md0 is synced the only mirror traffic is for incoming
cache-dirtying writes and cache-clean read allocations.  We agree
about incoming dirty-data, but you are saying you don't want to mirror
read allocations?

>> See, if these things were just md devices multiple cache device would
>> already be "done", or at least on its way by just stacking md devices.
>>  Where "done" is probably an oversimplification.
>
> No, it really wouldn't save us anything. If all we wanted to do was
> mirror everything, there'd be no point in implementing multiple cache
> device support, and you'd just use bcache on top of md. We're
> implementing something completely new!
>
> You read what I said about only mirroring dirty data... right?

I did but I guess I did not fully grok it.

>> >> In any case it certainly could be modelled in md - and if the modelling were
>> >> not elegant (e.g. even device numbers for backing devices, odd device numbers
>> >> for cache devices) we could "fix" md to make it more elegant.
>> >
>> > But we've no reason to create block devices for caches or have a 1:1
>> > mapping - that'd be a serious step backwards in functionality.
>>
>> I don't follow that...  there's nothing that prevents having multiple
>> superblocks per cache array.
>
> Multiple... superblocks? Do you mean partitioning up the cache, or do
> you mean creating multiple block devices for a cache? Either way it's a
> silly hack.
>
>> A couple reasons I'm probing the md angle.
>>
>> 1/ Since the backing devices are md devices it would be nice if all
>> the user space assembly logic that has seeped into udev and dracut
>> could be re-used for assembling bcache devices.  As it stands it seems
>> bcache relies on in-kernel auto-assembly, which md has discouraged
>> with the v1 superblock.
>
> md was doing in kernel probing, which bcache does not do. What bcache is
> doing is centralizing all the code that touches the on disk
> superblock/metadata. You want to change something in the superblock -
> you just have to tell the kernel to do it for you. Otherwise not only
> would there be duplication of code, it'd be impossible to do safely
> without races or the userspace code screwing something up; only the
> kernel knows and controls the state of everything.

Makes sense but there is a difference between the metadata that
specifies the configuration and the metadata that tracks the state of
the cache.  If that distinction is made then userspace can tell the
kernel to run a block cache of blockdevA and blockdevB and the kernel
only needs to handle the cache state metadata.

> Or do you expect the ext4 superblock to be managed in normal operation
> by userspace tools?

No.

>> We even have nascent GUI support in
>> gnome-disk-utility it would be nice to harness some of that enabling
>> momentum for this.
>
> I've got nothing against standardizing the userspace interfaces to make
> life easier for things like gnome-disk-utility. Tell me what you want
> and if it's sane I'll see about implementing it.

That's the point, userspace has some knowledge of how to interrogate
and manage md devices.  A bcache device is brand new... maybe for good
reason but that's what I'm trying to understand.

>> 2/ md supports multiple superblock formats and if you Google "ssd
>> caching" you'll see that there may be other superblock formats that
>> the Linux block-caching driver could be asked to support down the
>> road.  And wouldn't it be nice if bcache had at least the option to
>> support the on-disk format of whatever dm-cache is doing?
>
> That's pure fantasy. That's like expecting the ext4 code to mount a ntfs
> filesystem!

No, there's portions of what bcache does that are similar to what md
does.  Do we need to invent new multiple-device handling
infrastructure for a block device driver?  But we are quickly
approaching the "show me the code" portion of this discussion, so I
need to go do more reading of bcache.

> There's a lot more to bcache's metadata than a superblock, there's a
> journal and a full b-tree. A cache is going to need an index of some
> kind.

Yes, but that can be independent of the configuration metadata.

--
Dan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
       [not found]                         ` <CAA9_cmfOdv4ozkz7bd2QsbL5_VtAraMZMXoo0AAV0eCgNQr62Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-30  7:14                           ` Kent Overstreet
  2011-09-30 19:47                             ` Williams, Dan J
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-30  7:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: NeilBrown, Andreas Dilger, linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Sep 29, 2011 at 04:38:52PM -0700, Dan Williams wrote:
> On Tue, Sep 20, 2011 at 7:54 PM, Kent Overstreet
> > There is (for now) a 1:1 mapping of backing devices to block devices.
> 
> Is that "(for now)" where you see md not being able to model this in the future?

No, the for now was about bcache. I'm planning on adding volume
managament/thin provisioning to bcache, but that may end up being only a
stepping stone to a full fs (i.e. never 


> > Cache devices have a basically identical superblock as backing devices
> > though, and some of the registration code is shared, but cache devices
> > don't correspond to any block devices.
> 
> Just like a raid0 is a virtual creation from two block devices?  Or
> some other meaning of "don't correspond"?

No.

Remember, you can hang multiple backing devices off a cache.

Each backing device shows up as as a new block device - i.e. if you're
caching /dev/sdb, you now use it as /dev/bcache0.

But the SSD doesn't belong to any of those /dev/bcacheN devices.

> > A cache set is a set of cache devices - i.e. SSDs. The primary
> > motivitation for cache sets (as distinct from just caches) is to have
> > the ability to mirror only dirty data, and not clean data.
> >
> > i.e. if you're doing writeback caching of a raid6, your ssd is now a
> > single point of failure. You could use raid1 SSDs, but most of the data
> > in the cache is clean, so you don't need to mirror that... just the
> > dirty data.
> 
> ...but you only incur that "mirror clean data" penalty once, and then
> it's just a normal raid1 mirroring writes, right?

No idea what you mean...

> See, if these things were just md devices multiple cache device would
> already be "done", or at least on its way by just stacking md devices.
>  Where "done" is probably an oversimplification.

No, it really wouldn't save us anything. If all we wanted to do was
mirror everything, there'd be no point in implementing multiple cache
device support, and you'd just use bcache on top of md. We're
implementing something completely new!

You read what I said about only mirroring dirty data... right?

> >> In any case it certainly could be modelled in md - and if the modelling were
> >> not elegant (e.g. even device numbers for backing devices, odd device numbers
> >> for cache devices) we could "fix" md to make it more elegant.
> >
> > But we've no reason to create block devices for caches or have a 1:1
> > mapping - that'd be a serious step backwards in functionality.
> 
> I don't follow that...  there's nothing that prevents having multiple
> superblocks per cache array.

Multiple... superblocks? Do you mean partitioning up the cache, or do
you mean creating multiple block devices for a cache? Either way it's a
silly hack.

> A couple reasons I'm probing the md angle.
> 
> 1/ Since the backing devices are md devices it would be nice if all
> the user space assembly logic that has seeped into udev and dracut
> could be re-used for assembling bcache devices.  As it stands it seems
> bcache relies on in-kernel auto-assembly, which md has discouraged
> with the v1 superblock. 

md was doing in kernel probing, which bcache does not do. What bcache is
doing is centralizing all the code that touches the on disk
superblock/metadata. You want to change something in the superblock -
you just have to tell the kernel to do it for you. Otherwise not only
would there be duplication of code, it'd be impossible to do safely
without races or the userspace code screwing something up; only the
kernel knows and controls the state of everything.

Or do you expect the ext4 superblock to be managed in normal operation
by userspace tools?

> We even have nascent GUI support in
> gnome-disk-utility it would be nice to harness some of that enabling
> momentum for this.

I've got nothing against standardizing the userspace interfaces to make
life easier for things like gnome-disk-utility. Tell me what you want
and if it's sane I'll see about implementing it.

> 2/ md supports multiple superblock formats and if you Google "ssd
> caching" you'll see that there may be other superblock formats that
> the Linux block-caching driver could be asked to support down the
> road.  And wouldn't it be nice if bcache had at least the option to
> support the on-disk format of whatever dm-cache is doing?

That's pure fantasy. That's like expecting the ext4 code to mount a ntfs
filesystem!

There's a lot more to bcache's metadata than a superblock, there's a
journal and a full b-tree. A cache is going to need an index of some
kind.

> > The way I see it md is more or less conflating two different things -
> > things that consume block devices
> 
> ...did the interwebs chomp the last part of that thought?

Yeah, was supposed to be "things that consume block devices and things
that provide them".

> Side question, what are the "Change Id:" lines referring to in the git
> commit messages?

Gerrit wants them, and I don't see the point of stripping them out for
the public tree.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-21  2:54                     ` Kent Overstreet
@ 2011-09-29 23:38                       ` Dan Williams
       [not found]                         ` <CAA9_cmfOdv4ozkz7bd2QsbL5_VtAraMZMXoo0AAV0eCgNQr62Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2011-09-29 23:38 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, Andreas Dilger, linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Sep 20, 2011 at 7:54 PM, Kent Overstreet
<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Does not each block device have a unique superblock (created by make-bcache)
>> on it?  That should define a clear 1-to-1 mapping....
>
> There is (for now) a 1:1 mapping of backing devices to block devices.

Is that "(for now)" where you see md not being able to model this in the future?

> Cache devices have a basically identical superblock as backing devices
> though, and some of the registration code is shared, but cache devices
> don't correspond to any block devices.

Just like a raid0 is a virtual creation from two block devices?  Or
some other meaning of "don't correspond"?

>> It isn't clear from the documentation what a 'cache set' is.  I think it is a
>> set of related cache devices.  But how do they relate to backing devices?
>> Is it one backing device per cache set?  Or can it be several backing devices
>> are all cached by one cache-set??
>
> Many backing devices per cache set, yes.
>
> A cache set is a set of cache devices - i.e. SSDs. The primary
> motivitation for cache sets (as distinct from just caches) is to have
> the ability to mirror only dirty data, and not clean data.
>
> i.e. if you're doing writeback caching of a raid6, your ssd is now a
> single point of failure. You could use raid1 SSDs, but most of the data
> in the cache is clean, so you don't need to mirror that... just the
> dirty data.

...but you only incur that "mirror clean data" penalty once, and then
it's just a normal raid1 mirroring writes, right?

> Multiple cache device support isn't quite finished yet (there's not a
> lot of work to do, just lots of higher priorities). It looks like it's
> also going to be a useful abstraction for bcache FTL, too - we can treat
> multiple channels of an SSD as different devices for allocation
> purposes, we just won't expose it to the user in that case.

See, if these things were just md devices multiple cache device would
already be "done", or at least on its way by just stacking md devices.
 Where "done" is probably an oversimplification.

>> In any case it certainly could be modelled in md - and if the modelling were
>> not elegant (e.g. even device numbers for backing devices, odd device numbers
>> for cache devices) we could "fix" md to make it more elegant.
>
> But we've no reason to create block devices for caches or have a 1:1
> mapping - that'd be a serious step backwards in functionality.

I don't follow that...  there's nothing that prevents having multiple
superblocks per cache array.

A couple reasons I'm probing the md angle.

1/ Since the backing devices are md devices it would be nice if all
the user space assembly logic that has seeped into udev and dracut
could be re-used for assembling bcache devices.  As it stands it seems
bcache relies on in-kernel auto-assembly, which md has discouraged
with the v1 superblock.  We even have nascent GUI support in
gnome-disk-utility it would be nice to harness some of that enabling
momentum for this.

2/ md supports multiple superblock formats and if you Google "ssd
caching" you'll see that there may be other superblock formats that
the Linux block-caching driver could be asked to support down the
road.  And wouldn't it be nice if bcache had at least the option to
support the on-disk format of whatever dm-cache is doing?

>> (Not that I'm necessarily advocating an md interface, but if I can understand
>> why you don't think md can work, then I might understand bcache better ....
>> or you might get to understand md better).
>
> And I still would like to have some generic infrastructure, if only I
> had the time to work on such things :)
>
> The way I see it md is more or less conflating two different things -
> things that consume block devices

...did the interwebs chomp the last part of that thought?

[..]
>> Also I don't think the code belongs in /block.  The CRC64 code should go
>> in /lib and the rest should either be in /drivers/block or
>> possible /drivers/md (as it makes a single device out of 'multiple devices'.
>> Obviously that isn't urgent, but should be fixed before it can be considered
>> to be ready.
>
> Yeah, moving it into drivers/block/bcache/ and splitting it up into
> different files is on the todo list (for some reason, one of the other
> guys working on bcache thinks a 9k line .c file is excessive :)

Not unheard of
$ cat drivers/scsi/ipr.c | wc -l
9237

> Pulling code out of bcache_util.[ch] and sending them as separate
> patches is also on the todo list - certainly the crc code and the rb
> tree code.
>
>> Is there some documentation on the format of the cache and the cache
>> replacement policy?  I couldn't easily find anything on your wiki.
>> Having that would make it much easier to review the code and to understand
>> pessimal workloads.
>
> Format of the cache - not sure what you mean, on disk format?
>
> Cache replacement policy is currently straight LRU. Someone else is
> supposed to start looking at more intelligent cache replacement policy
> soon, though I tend to think with most workloads and skipping sequential
> IO LRU is actually going to do pretty well.
>
>> Thanks,
>> NeilBrown
>
> Thanks for your time! I'll have new code and benchmarks up just as soon
> as I can, it really has been busy lately. Are there any benchmarks you'd
> be interested in in particular?
>

Side question, what are the "Change Id:" lines referring to in the git
commit messages?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-21  9:19     ` Arnd Bergmann
@ 2011-09-22  4:07       ` Kent Overstreet
  0 siblings, 0 replies; 25+ messages in thread
From: Kent Overstreet @ 2011-09-22  4:07 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-bcache, linux-kernel, linux-fsdevel, rdunlap, axboe, akpm, neilb

On Wed, Sep 21, 2011 at 11:19:04AM +0200, Arnd Bergmann wrote:
> On Tuesday 20 September 2011 20:44:16 Kent Overstreet wrote:
> > On Tue, Sep 20, 2011 at 05:37:05PM +0200, Arnd Bergmann wrote:
> > > On Saturday 10 September 2011, Kent Overstreet wrote:
> > > > Short overview:
> > > > Bcache does both writethrough and writeback caching. It presents itself
> > > > as a new block device, a bit like say md. You can cache an arbitrary
> > > > number of block devices with a single cache device, and attach and
> > > > detach things at runtime - it's quite flexible.
> > > > 
> > > > It's very fast. It uses a b+ tree for the index, along with a journal to
> > > > coalesce index updates, and a bunch of other cool tricks like auxiliary
> > > > binary search trees with software floating point keys to avoid a bunch
> > > > of random memory accesses when doing binary searches in the btree. It
> > > > does over 50k iops doing 4k random writes without breaking a sweat,
> > > > and would do many times that if I had faster hardware.
> > > > 
> > > > It (configurably) tracks and skips sequential IO, so as to efficiently
> > > > cache random IO. It's got more cool features than I can remember at this
> > > > point. It's resilient, handling IO errors from the SSD when possible up
> > > > to a configurable threshhold, then detaches the cache from the backing
> > > > device even while you're still using it.
> > > 
> > > Hi Kent,
> > > 
> > > What kind of SSD hardware do you target here? I roughly categorize them
> > > into two classes, the low-end (USB, SDHC, CF, cheap ATA SSD) and the
> > > high-end (SAS, PCIe, NAS, expensive ATA SSD), which have extremely
> > > different characteristics. 
> > 
> > All of the above.
> > 
> > > I'm mainly interested in the first category, and a brief look at your
> > > code suggests that this is what you are indeed targetting. If that is
> > > true, can you name the specific hardware characteristics you require
> > > as a minimum? I.e. what erase block (bucket) sizes do you support
> > > (maximum size, non-power-of-two), how many buckets do you have
> > > open at the same time, and do you guarantee that each bucket is written
> > > in consecutive order?
> > 
> > Bucket size is set when you format your cache device. It is restricted
> > to powers of two (though the only reason for that restriction is to
> > avoid dividing by bucket size all over the place; if there was a
> > legitimate need we could easily see what the performance hit would be).
> 
> Note that odd erase block sizes are getting very common now, since TLC
> flash is being used for many consumer grade devices and these tend to
> have erase blocks of three times the equivalent SLC flash. That means you
> have to support bucket sizes of 1.5/3/6/12 MB eventually. I've seen
> a few devices that use very odd sizes like 4128KiB or 992KiB, or that
> misalign the erase blocks to the drive's sector number (i.e. the first
> erase block is smaller than the others). I would not recommend trying to
> support those.

Eesh. I hadn't heard that before, that's rather annoying. If 3x a power
of two is the norm though, I suppose I can just have sector_to_bucket()
do two shifts instead of one..

> 2MB is rather small for devices made in 2011, the most common you'll 
> see now are 4MB and 8MB, and it's rising every year. Devices that use
> more channels in parallel like the Sandisk pSSD-P2 already use 16 MB
> erase blocks and performance drops sharply if you get it wrong there.

Yeah, I'm aware of the trend, it's annoying though. Bcache really wants
to know more about the internal topology of the SSD, if the SSD could
present a couple channels and not stripe them together bcache could
retain the benefits of striping by doing it itself (within reason; if
stripe size has to be too small that inflates the btree size) and get
the benefits of smaller buckets/erase blocks.

If you're ok with the internal fragmentation on disk in the btree nodes,
that should be the only serious drawback of 4-8 mb erase blocks. I'd
really hate to have to rework things to be able to store multiple btree
nodes in a bucket though, that would be painful.

We'd want the moving garbage collector for that too so we can get good
cache utilization; only trouble with that on a real SSD is you really
don't want to be moving data around at the same time the FTL is.

> I'd say that 16 (+1) open buckets is pushing it, *very* few devices can
> actually sustain that. Maybe you don't normally use all of them but instead
> have some buckets that see most of the incoming writes? I can't see how
> you'd avoid constant thrashing on cheap drives otherwise.

That's good to know. It's certainly true that in practice we don't
normally use them all, but it sounds like it'd be worth tweaking that.

> Ok, sounds great! I'll probably come back to this point once you
> have made it upstream. Right now I would not add more features in
> order to keep the code reasonably simple for review.

Right now, the only high priority feature is full data checksumming,
there's real demand for that.

Don't suppose you'd care to help review it so we can get it merged? ;)

> Have you thought about combining bcache with exofs? Your description sounds
> like what you have is basically an object based storage, so if you provide
> an interface that exofs can use, you don't need to worry about all the
> complicated VFS interactions.

I hadn't thought of exofs, that's a great idea. We'd have to fork it but
it looks like a great starting point, simple and roughly what we want.

> My impression is that you are on the right track for the cache, and that
> it would be good to combine this with a file system, but that it would
> be counterproductive to also want to support rotating disks or merging
> the high-level FS code into what you have now. The amount of research
> that has gone into these things is something you won't be able to
> match without having to sacrifice the stuff that you already do well.

A filesystem is certainly a ways down the road. I do think there's a lot
of potential though; I'm really happy with how the design of bcache has
evolved and there's a lot of elegance to the filesystem ideas.

Got to ship what we've got first, though :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-21  3:44   ` Kent Overstreet
@ 2011-09-21  9:19     ` Arnd Bergmann
  2011-09-22  4:07       ` Kent Overstreet
  0 siblings, 1 reply; 25+ messages in thread
From: Arnd Bergmann @ 2011-09-21  9:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, neilb-l3A5Bk7waGM

On Tuesday 20 September 2011 20:44:16 Kent Overstreet wrote:
> On Tue, Sep 20, 2011 at 05:37:05PM +0200, Arnd Bergmann wrote:
> > On Saturday 10 September 2011, Kent Overstreet wrote:
> > > Short overview:
> > > Bcache does both writethrough and writeback caching. It presents itself
> > > as a new block device, a bit like say md. You can cache an arbitrary
> > > number of block devices with a single cache device, and attach and
> > > detach things at runtime - it's quite flexible.
> > > 
> > > It's very fast. It uses a b+ tree for the index, along with a journal to
> > > coalesce index updates, and a bunch of other cool tricks like auxiliary
> > > binary search trees with software floating point keys to avoid a bunch
> > > of random memory accesses when doing binary searches in the btree. It
> > > does over 50k iops doing 4k random writes without breaking a sweat,
> > > and would do many times that if I had faster hardware.
> > > 
> > > It (configurably) tracks and skips sequential IO, so as to efficiently
> > > cache random IO. It's got more cool features than I can remember at this
> > > point. It's resilient, handling IO errors from the SSD when possible up
> > > to a configurable threshhold, then detaches the cache from the backing
> > > device even while you're still using it.
> > 
> > Hi Kent,
> > 
> > What kind of SSD hardware do you target here? I roughly categorize them
> > into two classes, the low-end (USB, SDHC, CF, cheap ATA SSD) and the
> > high-end (SAS, PCIe, NAS, expensive ATA SSD), which have extremely
> > different characteristics. 
> 
> All of the above.
> 
> > I'm mainly interested in the first category, and a brief look at your
> > code suggests that this is what you are indeed targetting. If that is
> > true, can you name the specific hardware characteristics you require
> > as a minimum? I.e. what erase block (bucket) sizes do you support
> > (maximum size, non-power-of-two), how many buckets do you have
> > open at the same time, and do you guarantee that each bucket is written
> > in consecutive order?
> 
> Bucket size is set when you format your cache device. It is restricted
> to powers of two (though the only reason for that restriction is to
> avoid dividing by bucket size all over the place; if there was a
> legitimate need we could easily see what the performance hit would be).

Note that odd erase block sizes are getting very common now, since TLC
flash is being used for many consumer grade devices and these tend to
have erase blocks of three times the equivalent SLC flash. That means you
have to support bucket sizes of 1.5/3/6/12 MB eventually. I've seen
a few devices that use very odd sizes like 4128KiB or 992KiB, or that
misalign the erase blocks to the drive's sector number (i.e. the first
erase block is smaller than the others). I would not recommend trying to
support those.

> And it has to be >= PAGE_SIZE; come to think of it I don't think there's
> a hard upper bound. Performance should be reasonable for bucket sizes
> anywhere between 64k and around 2 mb; somewhere around 64k your btree
> will have a depth of 2 and that and the increased operations on non leaf
> nodes are going to hurt performance. Above around 2 mb and performance
> will start to drop as btree nodes get bigger, but the hit won't be
> enormous.

2MB is rather small for devices made in 2011, the most common you'll 
see now are 4MB and 8MB, and it's rising every year. Devices that use
more channels in parallel like the Sandisk pSSD-P2 already use 16 MB
erase blocks and performance drops sharply if you get it wrong there.

> For data buckets, we currently keep 16 open, 8 for clean data and 8 for
> dirty data. That's hard coded, but there's no reason it has to be. Btree
> nodes are in normal operation mostly not full and thus could be
> considered open buckets - it's always one btree node per bucket. IO to
> the btree is typically < 1% of total IO, though.
>
> Most metadata IO is to the journal; the journal uses a list of buckets
> and writes to them all sequentially, so one open bucket for the journal.
> 
> The one exception is the superblock, but that doesn't get written to in
> normal operation. I am eventually going to switch to using another
> journal for the superblock, as part of bcache FTL.
> 
> We do guarantee that buckets are allways written to sequentially (save
> the superblock). If discards are on, bcache will always issue a discard
> before it starts writing to a bucket again (except for the journal, that
> part's unfinished).

I'd say that 16 (+1) open buckets is pushing it, *very* few devices can
actually sustain that. Maybe you don't normally use all of them but instead
have some buckets that see most of the incoming writes? I can't see how
you'd avoid constant thrashing on cheap drives otherwise.

In the data I've collected at
https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey you
can see that the median is 5 open erase blocks in current devices, so
that's probably what you should aim for in the low end. It's not hard
to measure during a format operation that you might already be doing,
so if you can make the number configurable, you should be able to get
more out of the low end without regressing on the real SSDs.

Some SD cards support only one open erase block (plus journal/FAT),
and it's probably not worth spending much thought about supporting
those. This affects all Kingston branded SDHC cards and their OEM
partners (some patriot, extrememory, danelec, emtec, integral --
however as these don't produce the cards themselves, the next shipment
could be better).

Fortunately, the number of open erase blocks in a device is increasing
over time, as manufacturers have to work around growing erase block sizes,
so this problem may become less significant in a few years.

> > On a different note, we had discussed at the last storage/fs summit about
> > using an SSD cache either without a backing store or having the backing
> > store on the same drive as the cache in order to optimize traditional
> > file system on low-end flash media. Have you considered these scenarios?
> > How hard would it be to support this in a meaningful way? My hope is that
> > by sacrificing some 10% of the drive size, you would get significantly
> > improved performance because you can avoid many internal GC cycles within
> > the drive.
> 
> Yeah, so... what you really want there is to move the FTL into the
> kernel, so you can have an FTL that doesn't suck. Bcache is already
> about 90% of the way to being a full blown high performance FTL...
> 
> Besides the metadata stuff that I sort of covered above, the other thing
> that'd have to be done to use bcache as an FTL and not a cache is we'd
> just need a moving garbage collector - so when a bucket is mostly
> empty but has data we need to keep we can move it somewhere else. But
> this is pretty similar to what background writeback does now, so it'll
> be easy and straightforward.
> 
> So yeah, it can be done :)

Ok, sounds great! I'll probably come back to this point once you
have made it upstream. Right now I would not add more features in
order to keep the code reasonably simple for review.

> Further off, what I really want to do is extend bcache somewhat to turn
> it into the bottom half of a filesystem...
> 
> It sounds kind of crazy at first, but - well, bcache already has an
> index, allocation and garbage collection. Right now the index is
> device:sector -> cache device:phys sector. If we just switch to
> inode:sector -> ... we can map files in the filesystem with the exact
> same index we're using for the cache. Not just the same code, the same
> index.
> 
> Then the rough plan is that layer above bcache - the filesystem proper -
> will store all the inodes in a file (say file 0); then when bcache is
> doing garbage collection it has to be able to ask the fs "How big is
> file n supposed to be?". It gives us a very nice separation between
> layers.

Have you thought about combining bcache with exofs? Your description sounds
like what you have is basically an object based storage, so if you provide
an interface that exofs can use, you don't need to worry about all the
complicated VFS interactions.

> There's a ton of other details - bcache then needs to handle allocation
> for rotating disks and not just SSDs, and you want to do that somewhat
> differently as fragmentation matters. But the idea seems to have legs.
> 
> Also, bcache is gaining some nifty functionality that'd be nice to have
> available in the filesystem proper - we're working on full data
> checksumming right now, in particular. We might be able to pull off all
> the features of ZFS and then some, and beat ZFS on performance (maybe
> even with a smaller codebase!).
> 
> If you don't think I'm completely insane and want to hear more, let me
> know :)

My impression is that you are on the right track for the cache, and that
it would be good to combine this with a file system, but that it would
be counterproductive to also want to support rotating disks or merging
the high-level FS code into what you have now. The amount of research
that has gone into these things is something you won't be able to
match without having to sacrifice the stuff that you already do well.

	Arnd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-21  5:42         ` Pekka Enberg
@ 2011-09-21  5:57           ` Kent Overstreet
  2011-10-06 17:58             ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-21  5:57 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-bcache, linux-kernel, linux-fsdevel, rdunlap, axboe, akpm, neilb

On Wed, Sep 21, 2011 at 08:42:01AM +0300, Pekka Enberg wrote:
> On Wed, Sep 21, 2011 at 5:55 AM, Kent Overstreet
> > <kent.overstreet@gmail.com> wrote:
> >> Short version: bcache is for making IO faster.
> 
> On Wed, Sep 21, 2011 at 8:33 AM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > That's helpful...
> 
> Your documentation isn't helpful either:
> 
> +Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be
> +nice if you could use them as cache... Hence bcache.
> 
> So it's a cool hack but you fail to explain why someone wants to use
> it. You also fail to explain why you decided to implement it the way
> you did instead of making it more like fscache, for example.
> 
> Really, why do I need to go digging for this sort of information? It
> feels almost as if you don't want people to review your code...

The documentation should be better and better organized to be sure, but
I'm honestly not sure what's so strange about the concept of a cache for
block devices..

My changelog messages are certainly lousy but they aren't really the
place for a design doc, if that's what you're looking for.

As for bcache's design vs. fscache's design... well, they're so unlike
each other I'm not sure it even makes much sense to go into much.

Bcache caches block devices, fscache caches at the filesystem layer.
They each have uses where the other can't be used.

If you want more than that - IMO bcache's design is simpler, higher
performing, and more flexible.

bcache doesn't have to have a notion of files; it caches extents.

It can cache filesystem metadata - it can cache anything.

Because bcache has its own superblock (much like md), it can guarantee
that bcache devices are consistent; this is particularly important if
you want to do writeback caching. You really don't want to accidently
mount a filesystem that you were doing writeback caching on without the
ache - bcache makes it impossible to do so accidently.

Is any of that useful?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-21  5:33       ` Pekka Enberg
@ 2011-09-21  5:42         ` Pekka Enberg
  2011-09-21  5:57           ` Kent Overstreet
  0 siblings, 1 reply; 25+ messages in thread
From: Pekka Enberg @ 2011-09-21  5:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, linux-fsdevel, rdunlap, axboe, akpm, neilb

On Wed, Sep 21, 2011 at 5:55 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
>> Short version: bcache is for making IO faster.

On Wed, Sep 21, 2011 at 8:33 AM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> That's helpful...

Your documentation isn't helpful either:

+Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be
+nice if you could use them as cache... Hence bcache.

So it's a cool hack but you fail to explain why someone wants to use
it. You also fail to explain why you decided to implement it the way
you did instead of making it more like fscache, for example.

Really, why do I need to go digging for this sort of information? It
feels almost as if you don't want people to review your code...

                        Pekka

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-21  2:55     ` Kent Overstreet
@ 2011-09-21  5:33       ` Pekka Enberg
  2011-09-21  5:42         ` Pekka Enberg
  0 siblings, 1 reply; 25+ messages in thread
From: Pekka Enberg @ 2011-09-21  5:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, neilb-l3A5Bk7waGM

Hi Kent,

On Wed, Sep 21, 2011 at 5:55 AM, Kent Overstreet
<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Well, I think you'll want the documentation for that -
> Documentation/bcache.txt is somewhat up to date and should answer that
> decently.

I hope you realize I never got that far because your patch description
is so useless. If you want people to review your code, you should make
it easy for reviewers - not yourself. But whatever.

On Wed, Sep 21, 2011 at 5:55 AM, Kent Overstreet
<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Short version: bcache is for making IO faster.

That's helpful...

                                Pekka

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-20 15:37 ` Arnd Bergmann
@ 2011-09-21  3:44   ` Kent Overstreet
  2011-09-21  9:19     ` Arnd Bergmann
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-21  3:44 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-bcache, linux-kernel, linux-fsdevel, rdunlap, axboe, akpm, neilb

On Tue, Sep 20, 2011 at 05:37:05PM +0200, Arnd Bergmann wrote:
> On Saturday 10 September 2011, Kent Overstreet wrote:
> > Short overview:
> > Bcache does both writethrough and writeback caching. It presents itself
> > as a new block device, a bit like say md. You can cache an arbitrary
> > number of block devices with a single cache device, and attach and
> > detach things at runtime - it's quite flexible.
> > 
> > It's very fast. It uses a b+ tree for the index, along with a journal to
> > coalesce index updates, and a bunch of other cool tricks like auxiliary
> > binary search trees with software floating point keys to avoid a bunch
> > of random memory accesses when doing binary searches in the btree. It
> > does over 50k iops doing 4k random writes without breaking a sweat,
> > and would do many times that if I had faster hardware.
> > 
> > It (configurably) tracks and skips sequential IO, so as to efficiently
> > cache random IO. It's got more cool features than I can remember at this
> > point. It's resilient, handling IO errors from the SSD when possible up
> > to a configurable threshhold, then detaches the cache from the backing
> > device even while you're still using it.
> 
> Hi Kent,
> 
> What kind of SSD hardware do you target here? I roughly categorize them
> into two classes, the low-end (USB, SDHC, CF, cheap ATA SSD) and the
> high-end (SAS, PCIe, NAS, expensive ATA SSD), which have extremely
> different characteristics. 

All of the above.

> I'm mainly interested in the first category, and a brief look at your
> code suggests that this is what you are indeed targetting. If that is
> true, can you name the specific hardware characteristics you require
> as a minimum? I.e. what erase block (bucket) sizes do you support
> (maximum size, non-power-of-two), how many buckets do you have
> open at the same time, and do you guarantee that each bucket is written
> in consecutive order?

Bucket size is set when you format your cache device. It is restricted
to powers of two (though the only reason for that restriction is to
avoid dividing by bucket size all over the place; if there was a
legitimate need we could easily see what the performance hit would be).

And it has to be >= PAGE_SIZE; come to think of it I don't think there's
a hard upper bound. Performance should be reasonable for bucket sizes
anywhere between 64k and around 2 mb; somewhere around 64k your btree
will have a depth of 2 and that and the increased operations on non leaf
nodes are going to hurt performance. Above around 2 mb and performance
will start to drop as btree nodes get bigger, but the hit won't be
enormous.

For data buckets, we currently keep 16 open, 8 for clean data and 8 for
dirty data. That's hard coded, but there's no reason it has to be. Btree
nodes are in normal operation mostly not full and thus could be
considered open buckets - it's always one btree node per bucket. IO to
the btree is typically < 1% of total IO, though.

Most metadata IO is to the journal; the journal uses a list of buckets
and writes to them all sequentially, so one open bucket for the journal.

The one exception is the superblock, but that doesn't get written to in
normal operation. I am eventually going to switch to using another
journal for the superblock, as part of bcache FTL.

We do guarantee that buckets are allways written to sequentially (save
the superblock). If discards are on, bcache will always issue a discard
before it starts writing to a bucket again (except for the journal, that
part's unfinished).

> On a different note, we had discussed at the last storage/fs summit about
> using an SSD cache either without a backing store or having the backing
> store on the same drive as the cache in order to optimize traditional
> file system on low-end flash media. Have you considered these scenarios?
> How hard would it be to support this in a meaningful way? My hope is that
> by sacrificing some 10% of the drive size, you would get significantly
> improved performance because you can avoid many internal GC cycles within
> the drive.

Yeah, so... what you really want there is to move the FTL into the
kernel, so you can have an FTL that doesn't suck. Bcache is already
about 90% of the way to being a full blown high performance FTL...

Besides the metadata stuff that I sort of covered above, the other thing
that'd have to be done to use bcache as an FTL and not a cache is we'd
just need a moving garbage collector - so when a bucket is mostly
empty but has data we need to keep we can move it somewhere else. But
this is pretty similar to what background writeback does now, so it'll
be easy and straightforward.

So yeah, it can be done :)

Further off, what I really want to do is extend bcache somewhat to turn
it into the bottom half of a filesystem...

It sounds kind of crazy at first, but - well, bcache already has an
index, allocation and garbage collection. Right now the index is
device:sector -> cache device:phys sector. If we just switch to
inode:sector -> ... we can map files in the filesystem with the exact
same index we're using for the cache. Not just the same code, the same
index.

Then the rough plan is that layer above bcache - the filesystem proper -
will store all the inodes in a file (say file 0); then when bcache is
doing garbage collection it has to be able to ask the fs "How big is
file n supposed to be?". It gives us a very nice separation between
layers.

There's a ton of other details - bcache then needs to handle allocation
for rotating disks and not just SSDs, and you want to do that somewhat
differently as fragmentation matters. But the idea seems to have legs.

Also, bcache is gaining some nifty functionality that'd be nice to have
available in the filesystem proper - we're working on full data
checksumming right now, in particular. We might be able to pull off all
the features of ZFS and then some, and beat ZFS on performance (maybe
even with a smaller codebase!).

If you don't think I'm completely insane and want to hear more, let me
know :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
       [not found]   ` <CAOJsxLFPODubVEB3Tjg54C7jDKM8H-RCM_u5kvO1D0kKyjUYXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-21  2:55     ` Kent Overstreet
  2011-09-21  5:33       ` Pekka Enberg
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-21  2:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, neilb-l3A5Bk7waGM

On Mon, Sep 19, 2011 at 10:28:03AM +0300, Pekka Enberg wrote:
> Hi Kent,
> 
> On Sat, Sep 10, 2011 at 9:45 AM, Kent Overstreet
> <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > bcache, n.: a cache for arbitrary block devices that uses an SSD
> >
> > It's probably past time I started poking people to see about getting
> > this stuff in. It's synced up with mainline, the documentation is for
> > once relatively up to date, and it looks just about production ready.
> >
> > Suggestions are more than welcome on how to make it easier to review -
> > it's entirely too much code, I know (near 10k lines now). I'll be
> > emailing the patches that touch other parts of the kernel separately.
> >
> > Short overview:
> > Bcache does both writethrough and writeback caching. It presents itself
> > as a new block device, a bit like say md. You can cache an arbitrary
> > number of block devices with a single cache device, and attach and
> > detach things at runtime - it's quite flexible.
> 
> The changelog is pretty useless for people like myself who have never
> heard of bcache before because it lacks any explanation why bcache is
> useful (i.e. a real-world use case).
> 
>                         Pekka

Well, I think you'll want the documentation for that -
Documentation/bcache.txt is somewhat up to date and should answer that
decently.

Short version: bcache is for making IO faster.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-19  7:16                   ` NeilBrown
@ 2011-09-21  2:54                     ` Kent Overstreet
  2011-09-29 23:38                       ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-21  2:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Williams, Andreas Dilger, linux-bcache, linux-kernel,
	linux-fsdevel, rdunlap, axboe, akpm

Sorry for the delayed response, 

On Mon, Sep 19, 2011 at 05:16:06PM +1000, NeilBrown wrote:
> On Thu, 15 Sep 2011 14:33:36 -0700 Kent Overstreet
> > Damn, nope. I still think a module parameter is even uglier than a
> > sysfs file, though.
> 
> Beauty is in the eye of the beholder I guess.

I certainly don't find either beautiful for this purpose :p

> 
> > 
> > As far as I can tell, the linux kernel is really lacking any sort of
> > coherent vision for how to make arbitrary interfaces available from
> > the filesystem.
> 
> Cannot disagree with that.  Coherent vision isn't something that the kernel
> community really values.
> 
> I think the best approach is always to find out how someone else already
> achieved a similar goal.  Then either:
>  1/ copy that
>  2/ make a convincing argument why is it bad, and produce a better
>     implementation which meets your needs and theirs.
> 
> i.e. perfect is not an option, better is good when convincing, but not-worse
> is always acceptable.

Yeah, if I knew of anything else that I felt was at least acceptable we
could at least get consistency. 

> 
> 
> > 
> > We all seem to agree that it's a worthwhile thing to do - nobody likes
> > ioctls, /proc/sys has been around for ages; something visible and
> > discoverable beats an ioctl or a weird special purpose system call any
> > day.
> > 
> > But until people can agree on - hell, even come up with a decent plan
> > - for the right way to put interfaces in the filesystem, I'm not going
> > to lose much sleep over it.
> > 
> > >> I looked into that many months ago, spent quite a bit of time fighting
> > >> with the dm code trying to get it to do what I wanted and... no. Never
> > >> again
> > >
> > > Did you do a similar analysis of md?  I had a pet caching project that
> > > had it's own sysfs interface registration system, and came to the
> > > conclusion that it would have been better to have started with an MD
> > > personality.  Especially when one of the legs of the cache is a
> > > md-raid array it helps to keep all that assembly logic using the same
> > > interface.
> > 
> > I did spend some time looking at md, I don't really remember if I gave
> > it a fair chance or if I found a critical flaw.
> > 
> > I agree that an md personality ought to be a good fit but I don't
> > think the current md code is ideal for what bcache wants to do. Much
> > saner than dm, but I think it still suffers from the assumption that
> > there's some easy mapping from superblocks to block devices, with
> > bcache they really can't be tied together.
> 
> I don't understand what you mean there, even after reading bcache.txt.
> 
> Does not each block device have a unique superblock (created by make-bcache)
> on it?  That should define a clear 1-to-1 mapping....

There is (for now) a 1:1 mapping of backing devices to block devices.
Cache devices have a basically identical superblock as backing devices
though, and some of the registration code is shared, but cache devices
don't correspond to any block devices.

> It isn't clear from the documentation what a 'cache set' is.  I think it is a
> set of related cache devices.  But how do they relate to backing devices?
> Is it one backing device per cache set?  Or can it be several backing devices
> are all cached by one cache-set??

Many backing devices per cache set, yes.

A cache set is a set of cache devices - i.e. SSDs. The primary
motivitation for cache sets (as distinct from just caches) is to have
the ability to mirror only dirty data, and not clean data.

i.e. if you're doing writeback caching of a raid6, your ssd is now a
single point of failure. You could use raid1 SSDs, but most of the data
in the cache is clean, so you don't need to mirror that... just the
dirty data.

Multiple cache device support isn't quite finished yet (there's not a
lot of work to do, just lots of higher priorities). It looks like it's
also going to be a useful abstraction for bcache FTL, too - we can treat
multiple channels of an SSD as different devices for allocation
purposes, we just won't expose it to the user in that case.

> In any case it certainly could be modelled in md - and if the modelling were
> not elegant (e.g. even device numbers for backing devices, odd device numbers
> for cache devices) we could "fix" md to make it more elegant.

But we've no reason to create block devices for caches or have a 1:1
mapping - that'd be a serious step backwards in functionality.

> (Not that I'm necessarily advocating an md interface, but if I can understand
> why you don't think md can work, then I might understand bcache better ....
> or you might get to understand md better).

And I still would like to have some generic infrastructure, if only I
had the time to work on such things :)

The way I see it md is more or less conflating two different things -
things that consume block devices 

> 
> 
> Do you have any benchmark numbers showing how wonderful this feature is in
> practice?  Preferably some artificial workloads that show fantastic
> improvement, some that show the worst result you can, and something that is
> actually realistic (best case, worst case, real case).  Graphs are nice.

Well, I went to rerun my favorite benchmark the other day - 4k O_DIRECT
random writes with fio - and discovered a new performance bug
(something weird is going on in allocation leading to huge CPU
utilization). Maybe by next week I'll be able to post some real numbers...

Prior to that bug turning up though - on that benchmark with an SSD
using a Sandforce controller (consumer grade MLC), I was consistently
getting 35k iops. It definitely can go a lot faster on faster hardware,
but those are just the numbers I'm familiar with. Latency is also good
though I couldn't tell you how good offhand; throughput was topping out
with 32 IOs in flight or a bit less.

Basically, if 4k random writes are fast across the board performance is
at least going to be pretty good, because writes don't get completed
until the cache's index is updated and the index update is written to
disk - if the index performance is weak it'll be the bottleneck. But on
that benchmark we're bottlenecked by the SSD (the numbers are similar to
running the same benchmark on the raw SSD).

So the basic story is - bcache is pretty close in performance to either
the raw SSD or raw disk, depending on where the data is for reads and
writethrough vs. writeback caching for writes.

> ... I just checked http://bcache.evilpiepirate.org/ and there is one graph
> there which does seem nice, but it doesn't tell me much (I don't know what a
> Corsair Nova is).  And while bonnie certainly has some value, it mainly shows
> you how fast bonnie can run.  Reporting the file size used and splitting out
> the sequential and random, read and write speeds would help a lot.

Heh, those numbers are over a year old anyways. I really, really need to
update the wiki. When I do post new numbers it'll be well documented
fio benchmarks.

> Also I don't think the code belongs in /block.  The CRC64 code should go
> in /lib and the rest should either be in /drivers/block or
> possible /drivers/md (as it makes a single device out of 'multiple devices'.
> Obviously that isn't urgent, but should be fixed before it can be considered
> to be ready.

Yeah, moving it into drivers/block/bcache/ and splitting it up into
different files is on the todo list (for some reason, one of the other
guys working on bcache thinks a 9k line .c file is excessive :)

Pulling code out of bcache_util.[ch] and sending them as separate
patches is also on the todo list - certainly the crc code and the rb
tree code.

> Is there some documentation on the format of the cache and the cache
> replacement policy?  I couldn't easily find anything on your wiki.
> Having that would make it much easier to review the code and to understand
> pessimal workloads.

Format of the cache - not sure what you mean, on disk format?

Cache replacement policy is currently straight LRU. Someone else is
supposed to start looking at more intelligent cache replacement policy
soon, though I tend to think with most workloads and skipping sequential
IO LRU is actually going to do pretty well.

> Thanks,
> NeilBrown

Thanks for your time! I'll have new code and benchmarks up just as soon
as I can, it really has been busy lately. Are there any benchmarks you'd
be interested in in particular?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-10  6:45 Kent Overstreet
                   ` (2 preceding siblings ...)
  2011-09-19  7:28 ` Pekka Enberg
@ 2011-09-20 15:37 ` Arnd Bergmann
  2011-09-21  3:44   ` Kent Overstreet
  3 siblings, 1 reply; 25+ messages in thread
From: Arnd Bergmann @ 2011-09-20 15:37 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, linux-fsdevel, rdunlap, axboe, akpm, neilb

On Saturday 10 September 2011, Kent Overstreet wrote:
> Short overview:
> Bcache does both writethrough and writeback caching. It presents itself
> as a new block device, a bit like say md. You can cache an arbitrary
> number of block devices with a single cache device, and attach and
> detach things at runtime - it's quite flexible.
> 
> It's very fast. It uses a b+ tree for the index, along with a journal to
> coalesce index updates, and a bunch of other cool tricks like auxiliary
> binary search trees with software floating point keys to avoid a bunch
> of random memory accesses when doing binary searches in the btree. It
> does over 50k iops doing 4k random writes without breaking a sweat,
> and would do many times that if I had faster hardware.
> 
> It (configurably) tracks and skips sequential IO, so as to efficiently
> cache random IO. It's got more cool features than I can remember at this
> point. It's resilient, handling IO errors from the SSD when possible up
> to a configurable threshhold, then detaches the cache from the backing
> device even while you're still using it.

Hi Kent,

What kind of SSD hardware do you target here? I roughly categorize them
into two classes, the low-end (USB, SDHC, CF, cheap ATA SSD) and the
high-end (SAS, PCIe, NAS, expensive ATA SSD), which have extremely
different characteristics. 

I'm mainly interested in the first category, and a brief look at your
code suggests that this is what you are indeed targetting. If that is
true, can you name the specific hardware characteristics you require
as a minimum? I.e. what erase block (bucket) sizes do you support
(maximum size, non-power-of-two), how many buckets do you have
open at the same time, and do you guarantee that each bucket is written
in consecutive order?

On a different note, we had discussed at the last storage/fs summit about
using an SSD cache either without a backing store or having the backing
store on the same drive as the cache in order to optimize traditional
file system on low-end flash media. Have you considered these scenarios?
How hard would it be to support this in a meaningful way? My hope is that
by sacrificing some 10% of the drive size, you would get significantly
improved performance because you can avoid many internal GC cycles within
the drive.

	Arnd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-10  6:45 Kent Overstreet
  2011-09-11  6:18 ` NeilBrown
  2011-09-15 22:03 ` Dan Williams
@ 2011-09-19  7:28 ` Pekka Enberg
       [not found]   ` <CAOJsxLFPODubVEB3Tjg54C7jDKM8H-RCM_u5kvO1D0kKyjUYXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-09-20 15:37 ` Arnd Bergmann
  3 siblings, 1 reply; 25+ messages in thread
From: Pekka Enberg @ 2011-09-19  7:28 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, neilb-l3A5Bk7waGM

Hi Kent,

On Sat, Sep 10, 2011 at 9:45 AM, Kent Overstreet
<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> bcache, n.: a cache for arbitrary block devices that uses an SSD
>
> It's probably past time I started poking people to see about getting
> this stuff in. It's synced up with mainline, the documentation is for
> once relatively up to date, and it looks just about production ready.
>
> Suggestions are more than welcome on how to make it easier to review -
> it's entirely too much code, I know (near 10k lines now). I'll be
> emailing the patches that touch other parts of the kernel separately.
>
> Short overview:
> Bcache does both writethrough and writeback caching. It presents itself
> as a new block device, a bit like say md. You can cache an arbitrary
> number of block devices with a single cache device, and attach and
> detach things at runtime - it's quite flexible.

The changelog is pretty useless for people like myself who have never
heard of bcache before because it lacks any explanation why bcache is
useful (i.e. a real-world use case).

                        Pekka

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
       [not found]                 ` <CAC7rs0t_J+foaLZSuuw5BhpUAYfr-KY1iegFOxEBPCpbrkk1Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-19  7:16                   ` NeilBrown
  2011-09-21  2:54                     ` Kent Overstreet
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2011-09-19  7:16 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dan Williams, Andreas Dilger,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

[-- Attachment #1: Type: text/plain, Size: 5803 bytes --]

On Thu, 15 Sep 2011 14:33:36 -0700 Kent Overstreet
<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Thu, Sep 15, 2011 at 2:15 PM, Dan Williams <dan.j.williams-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Sun, Sep 11, 2011 at 6:44 PM, Kent Overstreet
> > <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> On Sun, Sep 11, 2011 at 07:35:56PM -0600, Andreas Dilger wrote:
> >>> On 2011-09-11, at 1:23 PM, Kent Overstreet <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> > I don't think that makes any more sense, as module paramaters AFAIK are
> >>> > even more explicitly just a value you can stick in and pull out.
> >>> > /sys/fs/bcache/register is really more analagous to mount().
> >
> > ... and you looked at module_param_call()?
> 
> Damn, nope. I still think a module parameter is even uglier than a
> sysfs file, though.

Beauty is in the eye of the beholder I guess.

> 
> As far as I can tell, the linux kernel is really lacking any sort of
> coherent vision for how to make arbitrary interfaces available from
> the filesystem.

Cannot disagree with that.  Coherent vision isn't something that the kernel
community really values.

I think the best approach is always to find out how someone else already
achieved a similar goal.  Then either:
 1/ copy that
 2/ make a convincing argument why is it bad, and produce a better
    implementation which meets your needs and theirs.

i.e. perfect is not an option, better is good when convincing, but not-worse
is always acceptable.


> 
> We all seem to agree that it's a worthwhile thing to do - nobody likes
> ioctls, /proc/sys has been around for ages; something visible and
> discoverable beats an ioctl or a weird special purpose system call any
> day.
> 
> But until people can agree on - hell, even come up with a decent plan
> - for the right way to put interfaces in the filesystem, I'm not going
> to lose much sleep over it.
> 
> >> I looked into that many months ago, spent quite a bit of time fighting
> >> with the dm code trying to get it to do what I wanted and... no. Never
> >> again
> >
> > Did you do a similar analysis of md?  I had a pet caching project that
> > had it's own sysfs interface registration system, and came to the
> > conclusion that it would have been better to have started with an MD
> > personality.  Especially when one of the legs of the cache is a
> > md-raid array it helps to keep all that assembly logic using the same
> > interface.
> 
> I did spend some time looking at md, I don't really remember if I gave
> it a fair chance or if I found a critical flaw.
> 
> I agree that an md personality ought to be a good fit but I don't
> think the current md code is ideal for what bcache wants to do. Much
> saner than dm, but I think it still suffers from the assumption that
> there's some easy mapping from superblocks to block devices, with
> bcache they really can't be tied together.

I don't understand what you mean there, even after reading bcache.txt.

Does not each block device have a unique superblock (created by make-bcache)
on it?  That should define a clear 1-to-1 mapping....

It isn't clear from the documentation what a 'cache set' is.  I think it is a
set of related cache devices.  But how do they relate to backing devices?
Is it one backing device per cache set?  Or can it be several backing devices
are all cached by one cache-set??
In any case it certainly could be modelled in md - and if the modelling were
not elegant (e.g. even device numbers for backing devices, odd device numbers
for cache devices) we could "fix" md to make it more elegant.

(Not that I'm necessarily advocating an md interface, but if I can understand
why you don't think md can work, then I might understand bcache better ....
or you might get to understand md better).


Do you have any benchmark numbers showing how wonderful this feature is in
practice?  Preferably some artificial workloads that show fantastic
improvement, some that show the worst result you can, and something that is
actually realistic (best case, worst case, real case).  Graphs are nice.

... I just checked http://bcache.evilpiepirate.org/ and there is one graph
there which does seem nice, but it doesn't tell me much (I don't know what a
Corsair Nova is).  And while bonnie certainly has some value, it mainly shows
you how fast bonnie can run.  Reporting the file size used and splitting out
the sequential and random, read and write speeds would help a lot.

Also I don't think the code belongs in /block.  The CRC64 code should go
in /lib and the rest should either be in /drivers/block or
possible /drivers/md (as it makes a single device out of 'multiple devices'.
Obviously that isn't urgent, but should be fixed before it can be considered
to be ready.

Is there some documentation on the format of the cache and the cache
replacement policy?  I couldn't easily find anything on your wiki.
Having that would make it much easier to review the code and to understand
pessimal workloads.


Thanks,
NeilBrown



> 
> > And md supports assembling devices via sysfs without
> > requiring mdadm which is a nice feature.
> 
> Didn't know that, I'll have to look at that. If nothing else
> consistency is good...
> 
> > Also has the benefit of reusing the distro installation / boot
> > enabling for md devices which turned out to be a bit of work when
> > enabling external-metadata in md.
> 
> Dunno what you mean about external metadata, but it would be nice to
> not have to do anything to userspace to boot from a bcache device. As
> is though it's only a couple lines of bash you have to drop in your
> initramfs.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-15 22:03 ` Dan Williams
@ 2011-09-15 22:07   ` Kent Overstreet
  0 siblings, 0 replies; 25+ messages in thread
From: Kent Overstreet @ 2011-09-15 22:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-bcache, linux-kernel, linux-fsdevel, rdunlap, axboe, akpm, neilb

On Thu, Sep 15, 2011 at 3:03 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Does it consider the raid5/6 write hole in what it caches?  Guess I
> need to take a look at the code, but just wondering if it considers
> the need to maintain a consistent strip when writing back to raid5/6
> array, or would there still be a need for a separate driver/region of
> the SSD for caching that data.

Do you mean - if you're caching a raid5 (not the individual devices,
the entire array) the parity blocks?

In that case no, bcache will never see them. However, if you're doing
writeback caching that won't be a huge problem since you'll end up
with more full stripe writes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-10  6:45 Kent Overstreet
  2011-09-11  6:18 ` NeilBrown
@ 2011-09-15 22:03 ` Dan Williams
  2011-09-15 22:07   ` Kent Overstreet
  2011-09-19  7:28 ` Pekka Enberg
  2011-09-20 15:37 ` Arnd Bergmann
  3 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2011-09-15 22:03 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, neilb-l3A5Bk7waGM

On Fri, Sep 9, 2011 at 11:45 PM, Kent Overstreet
<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> bcache, n.: a cache for arbitrary block devices that uses an SSD
>
> It's probably past time I started poking people to see about getting
> this stuff in. It's synced up with mainline, the documentation is for
> once relatively up to date, and it looks just about production ready.
>
> Suggestions are more than welcome on how to make it easier to review -
> it's entirely too much code, I know (near 10k lines now). I'll be
> emailing the patches that touch other parts of the kernel separately.
>
> Short overview:
> Bcache does both writethrough and writeback caching. It presents itself
> as a new block device, a bit like say md. You can cache an arbitrary
> number of block devices with a single cache device, and attach and
> detach things at runtime - it's quite flexible.
>
> It's very fast. It uses a b+ tree for the index, along with a journal to
> coalesce index updates, and a bunch of other cool tricks like auxiliary
> binary search trees with software floating point keys to avoid a bunch
> of random memory accesses when doing binary searches in the btree. It
> does over 50k iops doing 4k random /writes/ without breaking a sweat,
> and would do many times that if I had faster hardware.

Does it consider the raid5/6 write hole in what it caches?  Guess I
need to take a look at the code, but just wondering if it considers
the need to maintain a consistent strip when writing back to raid5/6
array, or would there still be a need for a separate driver/region of
the SSD for caching that data.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
       [not found]             ` <CAA9_cmeqevWoK=9WMD9c+csc8SbaYq0aK9j1qWr_0FEa6jWZEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-15 21:33               ` Kent Overstreet
       [not found]                 ` <CAC7rs0t_J+foaLZSuuw5BhpUAYfr-KY1iegFOxEBPCpbrkk1Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-15 21:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, NeilBrown, linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Sep 15, 2011 at 2:15 PM, Dan Williams <dan.j.williams-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Sun, Sep 11, 2011 at 6:44 PM, Kent Overstreet
> <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Sun, Sep 11, 2011 at 07:35:56PM -0600, Andreas Dilger wrote:
>>> On 2011-09-11, at 1:23 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
>>> > I don't think that makes any more sense, as module paramaters AFAIK are
>>> > even more explicitly just a value you can stick in and pull out.
>>> > /sys/fs/bcache/register is really more analagous to mount().
>
> ... and you looked at module_param_call()?

Damn, nope. I still think a module parameter is even uglier than a
sysfs file, though.

As far as I can tell, the linux kernel is really lacking any sort of
coherent vision for how to make arbitrary interfaces available from
the filesystem.

We all seem to agree that it's a worthwhile thing to do - nobody likes
ioctls, /proc/sys has been around for ages; something visible and
discoverable beats an ioctl or a weird special purpose system call any
day.

But until people can agree on - hell, even come up with a decent plan
- for the right way to put interfaces in the filesystem, I'm not going
to lose much sleep over it.

>> I looked into that many months ago, spent quite a bit of time fighting
>> with the dm code trying to get it to do what I wanted and... no. Never
>> again
>
> Did you do a similar analysis of md?  I had a pet caching project that
> had it's own sysfs interface registration system, and came to the
> conclusion that it would have been better to have started with an MD
> personality.  Especially when one of the legs of the cache is a
> md-raid array it helps to keep all that assembly logic using the same
> interface.

I did spend some time looking at md, I don't really remember if I gave
it a fair chance or if I found a critical flaw.

I agree that an md personality ought to be a good fit but I don't
think the current md code is ideal for what bcache wants to do. Much
saner than dm, but I think it still suffers from the assumption that
there's some easy mapping from superblocks to block devices, with
bcache they really can't be tied together.

> And md supports assembling devices via sysfs without
> requiring mdadm which is a nice feature.

Didn't know that, I'll have to look at that. If nothing else
consistency is good...

> Also has the benefit of reusing the distro installation / boot
> enabling for md devices which turned out to be a bit of work when
> enabling external-metadata in md.

Dunno what you mean about external metadata, but it would be nice to
not have to do anything to userspace to boot from a bcache device. As
is though it's only a couple lines of bash you have to drop in your
initramfs.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-12  1:44         ` Kent Overstreet
@ 2011-09-15 21:15           ` Dan Williams
       [not found]             ` <CAA9_cmeqevWoK=9WMD9c+csc8SbaYq0aK9j1qWr_0FEa6jWZEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2011-09-15 21:15 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andreas Dilger, NeilBrown, linux-bcache, linux-kernel,
	linux-fsdevel, rdunlap, axboe, akpm

On Sun, Sep 11, 2011 at 6:44 PM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Sun, Sep 11, 2011 at 07:35:56PM -0600, Andreas Dilger wrote:
>> On 2011-09-11, at 1:23 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
>> > I don't think that makes any more sense, as module paramaters AFAIK are
>> > even more explicitly just a value you can stick in and pull out.
>> > /sys/fs/bcache/register is really more analagous to mount().

... and you looked at module_param_call()?

>> > You're not the first person to complain about that, I moved it to
>> > configfs for awhile at Greg K-H's behest... but when I added cache sets
>> > I had to move it back to sysfs.
>> >
>> >> Alternately you could device a new 'bus' type for bcache and do some sort of
>> >> device-model magic to attach something as a new device of that type.
>> >
>> > I like that, I think that could make a lot of sense.
>> >
>> > I'm not sure what to do about register though, I do prefer to have it a
>> > file you can echo to but it doesn't really fit anywhere.
>>
>> Rather than using /proc or /sys to configure bcache, why not integrate it with device mapper, and use dmsetup to configure it?  That avoids adding yet another block device abstraction into the kernel, and yet one more obscure way of configuring things.
>>
>> A bcache device could be considered almost like an LV snapshot, where writes go to the SSD device instead of a disk, and they can have writeback or writethrough cache.
>
> I looked into that many months ago, spent quite a bit of time fighting
> with the dm code trying to get it to do what I wanted and... no. Never
> again

Did you do a similar analysis of md?  I had a pet caching project that
had it's own sysfs interface registration system, and came to the
conclusion that it would have been better to have started with an MD
personality.  Especially when one of the legs of the cache is a
md-raid array it helps to keep all that assembly logic using the same
interface.  And md supports assembling devices via sysfs without
requiring mdadm which is a nice feature.

Also has the benefit of reusing the distro installation / boot
enabling for md devices which turned out to be a bit of work when
enabling external-metadata in md.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
       [not found]       ` <FD294A0B-7127-4ED1-89B8-3E3ADA796360-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
@ 2011-09-12  1:44         ` Kent Overstreet
  2011-09-15 21:15           ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-12  1:44 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: NeilBrown, linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Sep 11, 2011 at 07:35:56PM -0600, Andreas Dilger wrote:
> On 2011-09-11, at 1:23 PM, Kent Overstreet <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Sun, Sep 11, 2011 at 08:18:54AM +0200, NeilBrown wrote:
> >> 
> >> Looking at bcache.txt....
> >> 
> >> To make bcache devices known to the kernel, echo them to /sys/fs/bcache/register
> >>  echo /dev/sdb > /sys/fs/bcache/register
> >>  echo /dev/sdc > /sys/fs/bcache/register
> >> 
> >> ???
> >> I know that /sys is heading the way of /proc and becoming a disorganised ad
> >> hoc mess, but we don't need to actively encourage that.
> >> So when you are created a new block device type, putting controls
> >> under /sys/fs (where I believe 'fs' stands for "file system") seem ill
> >> advised.
> >> 
> >> My personal preference would be to see this as configuring the module and us
> >>  /sys/modules/bcache/parameters/register
> > 
> > I don't think that makes any more sense, as module paramaters AFAIK are
> > even more explicitly just a value you can stick in and pull out.
> > /sys/fs/bcache/register is really more analagous to mount().
> > 
> > You're not the first person to complain about that, I moved it to
> > configfs for awhile at Greg K-H's behest... but when I added cache sets
> > I had to move it back to sysfs.
> > 
> >> Alternately you could device a new 'bus' type for bcache and do some sort of
> >> device-model magic to attach something as a new device of that type.
> > 
> > I like that, I think that could make a lot of sense.
> > 
> > I'm not sure what to do about register though, I do prefer to have it a
> > file you can echo to but it doesn't really fit anywhere.
> 
> Rather than using /proc or /sys to configure bcache, why not integrate it with device mapper, and use dmsetup to configure it?  That avoids adding yet another block device abstraction into the kernel, and yet one more obscure way of configuring things. 
> 
> A bcache device could be considered almost like an LV snapshot, where writes go to the SSD device instead of a disk, and they can have writeback or writethrough cache. 

I looked into that many months ago, spent quite a bit of time fighting
with the dm code trying to get it to do what I wanted and... no. Never
again. It's worse than the cgroups code, and that's saying something.

It'd be great to have some uniformity, but you can't pay me enough to
touch that code again; IMO it's horribly misdesigned and probably a lost
cause.

Anyways, the code to create a new block device in bcache is trivial,
using dm certainly wouldn't make bcache any simpler (quite the
opposite). Supporting a standard interface would also be easy provided
it was a sane one.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-11  6:18 ` NeilBrown
@ 2011-09-11 19:23   ` Kent Overstreet
       [not found]     ` <FD294A0B-7127-4ED1-89B8-3E3ADA796360@dilger.ca>
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2011-09-11 19:23 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-bcache, linux-kernel, linux-fsdevel, rdunlap, axboe, akpm

On Sun, Sep 11, 2011 at 08:18:54AM +0200, NeilBrown wrote:
> On Fri, 9 Sep 2011 23:45:31 -0700 Kent Overstreet <kent.overstreet@gmail.com>
> > The code is up at
> > git://evilpiepirate.org/~kent/linux-bcache.git
> 
> In particular it is in the bcache-3.1 branch I assume.
> The  HEAD branch is old 2.6.34 code.

Yeah. I've still got to work off of 2.6.34, alas.

> > git://evilpiepirate.org/~kent/bcache-tools.git
> > 
> > The wiki is woefully out of date, but that might change one day:
> > http://bcache.evilpiepirate.org
> > 
> > The most up to date documentation is in the kernel tree -
> > Documentation/bcache.txt
> > 
> >  Documentation/ABI/testing/sysfs-block-bcache |  156 +
> >  Documentation/bcache.txt                     |  265 +
> >  block/Kconfig                                |   36 +
> >  block/Makefile                               |    4 +
> >  block/bcache.c                               | 8479 ++++++++++++++++++++++++++
> >  block/bcache_util.c                          |  661 ++
> >  block/bcache_util.h                          |  555 ++
> >  fs/bio.c                                     |    9 +-
> 
> Any change that a new driver needs to existing code much raise a big question
> mark.
> This change in bio.c looks like a bit of a hack.

It certainly is, but IMO it's justifiable as it improves the rest of the
code.

> Could you just provide a
> 'front_pad' to bioset_create to give you space in each bio to store the
> bio pool that the bio was allocated from.  See use of
> mddev_bio_destructor in drivers/md/md.c for an example.

I could, but it gets ugly - the inner details of bio allocation pretty
much have to spill out into the users of the code.

The reason this is more of an issue for me is I can't always allocate
from biosets, if it's running out of generic_make_request() that could
deadlock, so it's got to use bio_kmalloc() and punt to workqueue if that
allocation fails.

So it's not the prettiest thing in the world but it does provide some
useful generic functionality that could simplify code in other parts of
the kernel too.

> >  include/linux/blk_types.h                    |    2 +
> >  include/linux/sched.h                        |    4 +
> 
> Could we have a few words justifying the new fields in task_struct?

Yeah, they're for maintaining a rolling average of the sequential IO
sizes each task has been doing.

So if you want sequential IOs greater than 4 mb to skip the cache, this
way if you start copying a bunch of large files, after the first couple
files bcache can just start skipping every new file instead of caching
the first 4 mb (since the bios will never be that big).

Could be that they belong in struct io_context or somewhere else, I was
pointed towards struct io_context fairly recently but still haven't
gotten around to looking at it in detail.

> In general your commit logs are much much to brief (virtually non-existent).
> It is much easier to review code if you also tell us what the purpose is :-)

Yeah, comments have never been my strong point. I'll work on filling
those out :)
> 
> 
> >  include/trace/events/bcache.h                |   53 +
> >  kernel/fork.c                                |    3 +
> 
> Does this code even compile?
> fork.c now has
> +#ifdef CONFIG_BLK_CACHE
> +       p->sequential_io = p->nr_ios = 0;
> +#endif
> 
> but you have removed nr_ios from task_struct ??

Hah. It wouldn't compile if it ever tried. I renamed BLK_CACHE to BCACHE
at one point and it seems I missed that one.

> 
> 
> 
> >  12 files changed, 10225 insertions(+), 2 deletions(-)
> 
> 
> Looking at bcache.txt....
> 
> To make bcache devices known to the kernel, echo them to /sys/fs/bcache/register
>   echo /dev/sdb > /sys/fs/bcache/register
>   echo /dev/sdc > /sys/fs/bcache/register
> 
> ???
> I know that /sys is heading the way of /proc and becoming a disorganised ad
> hoc mess, but we don't need to actively encourage that.
> So when you are created a new block device type, putting controls
> under /sys/fs (where I believe 'fs' stands for "file system") seem ill
> advised.
> 
> My personal preference would be to see this as configuring the module and us
>   /sys/modules/bcache/parameters/register

I don't think that makes any more sense, as module paramaters AFAIK are
even more explicitly just a value you can stick in and pull out.
/sys/fs/bcache/register is really more analagous to mount().

You're not the first person to complain about that, I moved it to
configfs for awhile at Greg K-H's behest... but when I added cache sets
I had to move it back to sysfs.

> 
> Alternately you could device a new 'bus' type for bcache and do some sort of
> device-model magic to attach something as a new device of that type.

I like that, I think that could make a lot of sense.

I'm not sure what to do about register though, I do prefer to have it a
file you can echo to but it doesn't really fit anywhere.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [GIT] Bcache version 12
  2011-09-10  6:45 Kent Overstreet
@ 2011-09-11  6:18 ` NeilBrown
  2011-09-11 19:23   ` Kent Overstreet
  2011-09-15 22:03 ` Dan Williams
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2011-09-11  6:18 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 9 Sep 2011 23:45:31 -0700 Kent Overstreet <kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> bcache, n.: a cache for arbitrary block devices that uses an SSD
> 
> It's probably past time I started poking people to see about getting
> this stuff in. It's synced up with mainline, the documentation is for
> once relatively up to date, and it looks just about production ready.
> 
> Suggestions are more than welcome on how to make it easier to review -
> it's entirely too much code, I know (near 10k lines now). I'll be
> emailing the patches that touch other parts of the kernel separately.
> 
> Short overview:
> Bcache does both writethrough and writeback caching. It presents itself
> as a new block device, a bit like say md. You can cache an arbitrary
> number of block devices with a single cache device, and attach and
> detach things at runtime - it's quite flexible.
> 
> It's very fast. It uses a b+ tree for the index, along with a journal to
> coalesce index updates, and a bunch of other cool tricks like auxiliary
> binary search trees with software floating point keys to avoid a bunch
> of random memory accesses when doing binary searches in the btree. It
> does over 50k iops doing 4k random /writes/ without breaking a sweat,
> and would do many times that if I had faster hardware.
> 
> It (configurably) tracks and skips sequential IO, so as to efficiently
> cache random IO. It's got more cool features than I can remember at this
> point. It's resilient, handling IO errors from the SSD when possible up
> to a configurable threshhold, then detaches the cache from the backing
> device even while you're still using it.
> 
> The code is up at
> git://evilpiepirate.org/~kent/linux-bcache.git

In particular it is in the bcache-3.1 branch I assume.
The  HEAD branch is old 2.6.34 code.

> git://evilpiepirate.org/~kent/bcache-tools.git
> 
> The wiki is woefully out of date, but that might change one day:
> http://bcache.evilpiepirate.org
> 
> The most up to date documentation is in the kernel tree -
> Documentation/bcache.txt
> 
>  Documentation/ABI/testing/sysfs-block-bcache |  156 +
>  Documentation/bcache.txt                     |  265 +
>  block/Kconfig                                |   36 +
>  block/Makefile                               |    4 +
>  block/bcache.c                               | 8479 ++++++++++++++++++++++++++
>  block/bcache_util.c                          |  661 ++
>  block/bcache_util.h                          |  555 ++
>  fs/bio.c                                     |    9 +-

Any change that a new driver needs to existing code much raise a big question
mark.
This change in bio.c looks like a bit of a hack.  Could you just provide a
'front_pad' to bioset_create to give you space in each bio to store the
bio pool that the bio was allocated from.  See use of
mddev_bio_destructor in drivers/md/md.c for an example.


>  include/linux/blk_types.h                    |    2 +
>  include/linux/sched.h                        |    4 +

Could we have a few words justifying the new fields in task_struct?

In general your commit logs are much much to brief (virtually non-existent).
It is much easier to review code if you also tell us what the purpose is :-)


>  include/trace/events/bcache.h                |   53 +
>  kernel/fork.c                                |    3 +

Does this code even compile?
fork.c now has
+#ifdef CONFIG_BLK_CACHE
+       p->sequential_io = p->nr_ios = 0;
+#endif

but you have removed nr_ios from task_struct ??



>  12 files changed, 10225 insertions(+), 2 deletions(-)


Looking at bcache.txt....

To make bcache devices known to the kernel, echo them to /sys/fs/bcache/register
  echo /dev/sdb > /sys/fs/bcache/register
  echo /dev/sdc > /sys/fs/bcache/register

???
I know that /sys is heading the way of /proc and becoming a disorganised ad
hoc mess, but we don't need to actively encourage that.
So when you are created a new block device type, putting controls
under /sys/fs (where I believe 'fs' stands for "file system") seem ill
advised.

My personal preference would be to see this as configuring the module and us
  /sys/modules/bcache/parameters/register

Alternately you could device a new 'bus' type for bcache and do some sort of
device-model magic to attach something as a new device of that type.

NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [GIT] Bcache version 12
@ 2011-09-10  6:45 Kent Overstreet
  2011-09-11  6:18 ` NeilBrown
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Kent Overstreet @ 2011-09-10  6:45 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, neilb-l3A5Bk7waGM

bcache, n.: a cache for arbitrary block devices that uses an SSD

It's probably past time I started poking people to see about getting
this stuff in. It's synced up with mainline, the documentation is for
once relatively up to date, and it looks just about production ready.

Suggestions are more than welcome on how to make it easier to review -
it's entirely too much code, I know (near 10k lines now). I'll be
emailing the patches that touch other parts of the kernel separately.

Short overview:
Bcache does both writethrough and writeback caching. It presents itself
as a new block device, a bit like say md. You can cache an arbitrary
number of block devices with a single cache device, and attach and
detach things at runtime - it's quite flexible.

It's very fast. It uses a b+ tree for the index, along with a journal to
coalesce index updates, and a bunch of other cool tricks like auxiliary
binary search trees with software floating point keys to avoid a bunch
of random memory accesses when doing binary searches in the btree. It
does over 50k iops doing 4k random /writes/ without breaking a sweat,
and would do many times that if I had faster hardware.

It (configurably) tracks and skips sequential IO, so as to efficiently
cache random IO. It's got more cool features than I can remember at this
point. It's resilient, handling IO errors from the SSD when possible up
to a configurable threshhold, then detaches the cache from the backing
device even while you're still using it.

The code is up at
git://evilpiepirate.org/~kent/linux-bcache.git
git://evilpiepirate.org/~kent/bcache-tools.git

The wiki is woefully out of date, but that might change one day:
http://bcache.evilpiepirate.org

The most up to date documentation is in the kernel tree -
Documentation/bcache.txt

 Documentation/ABI/testing/sysfs-block-bcache |  156 +
 Documentation/bcache.txt                     |  265 +
 block/Kconfig                                |   36 +
 block/Makefile                               |    4 +
 block/bcache.c                               | 8479 ++++++++++++++++++++++++++
 block/bcache_util.c                          |  661 ++
 block/bcache_util.h                          |  555 ++
 fs/bio.c                                     |    9 +-
 include/linux/blk_types.h                    |    2 +
 include/linux/sched.h                        |    4 +
 include/trace/events/bcache.h                |   53 +
 kernel/fork.c                                |    3 +
 12 files changed, 10225 insertions(+), 2 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-10-10 12:35 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1280519620.12031317482084581.JavaMail.root@shiva>
2011-10-01 15:19 ` [GIT] Bcache version 12 LuVar
2011-09-10  6:45 Kent Overstreet
2011-09-11  6:18 ` NeilBrown
2011-09-11 19:23   ` Kent Overstreet
     [not found]     ` <FD294A0B-7127-4ED1-89B8-3E3ADA796360@dilger.ca>
     [not found]       ` <FD294A0B-7127-4ED1-89B8-3E3ADA796360-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
2011-09-12  1:44         ` Kent Overstreet
2011-09-15 21:15           ` Dan Williams
     [not found]             ` <CAA9_cmeqevWoK=9WMD9c+csc8SbaYq0aK9j1qWr_0FEa6jWZEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-15 21:33               ` Kent Overstreet
     [not found]                 ` <CAC7rs0t_J+foaLZSuuw5BhpUAYfr-KY1iegFOxEBPCpbrkk1Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-19  7:16                   ` NeilBrown
2011-09-21  2:54                     ` Kent Overstreet
2011-09-29 23:38                       ` Dan Williams
     [not found]                         ` <CAA9_cmfOdv4ozkz7bd2QsbL5_VtAraMZMXoo0AAV0eCgNQr62Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-30  7:14                           ` Kent Overstreet
2011-09-30 19:47                             ` Williams, Dan J
2011-09-15 22:03 ` Dan Williams
2011-09-15 22:07   ` Kent Overstreet
2011-09-19  7:28 ` Pekka Enberg
     [not found]   ` <CAOJsxLFPODubVEB3Tjg54C7jDKM8H-RCM_u5kvO1D0kKyjUYXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-21  2:55     ` Kent Overstreet
2011-09-21  5:33       ` Pekka Enberg
2011-09-21  5:42         ` Pekka Enberg
2011-09-21  5:57           ` Kent Overstreet
2011-10-06 17:58             ` Pavel Machek
2011-10-10 12:35               ` LuVar
2011-09-20 15:37 ` Arnd Bergmann
2011-09-21  3:44   ` Kent Overstreet
2011-09-21  9:19     ` Arnd Bergmann
2011-09-22  4:07       ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).