All of lore.kernel.org
 help / color / mirror / Atom feed
* ceph-disk improvements
@ 2016-04-01 15:36 Sage Weil
  2016-04-02  5:54 ` Wido den Hollander
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Sage Weil @ 2016-04-01 15:36 UTC (permalink / raw)
  To: ceph-devel

Hi all,

There are a couple of looming features for ceph-disk:

1- Support for additional devices when using BlueStore.  There can be up 
to three: the main device, a WAL/journal device (small, ~128MB, ideally 
NVRAM), and a fast metadata device (as big as you have available; will be 
used for internal metadata).

2- Support for setting up dm-cache, bcache, and/or FlashCache underneath 
filestore or bluestore.

The current syntax of

 ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]

isn't terribly expressive.  For example, the journal device size is set 
via a config option, not on the command line.  For bluestore, the metadata 
device will probably want/need explicit user input so they can ensure it's 
1/Nth of their SSD (if they have N HDDs to each SSD).

And if we put dmcache in there, that partition will need to be sized too.

Another consideration is that right now we don't play nice with LVM at 
all.  Should we?  dm-cache is usually used in conjunction with LVM 
(although it doesn't have to be).  Does LVM provide value?  Like, the 
ability for users to add a second SSD to a box and migrate cache, wal, or 
journal partitions around?

I'm interested in hearing feedback on requirements, approaches, and 
interfaces before we go too far down the road...

Thanks!
sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-01 15:36 ceph-disk improvements Sage Weil
@ 2016-04-02  5:54 ` Wido den Hollander
  2016-04-02  8:52   ` Loic Dachary
                     ` (2 more replies)
  2016-04-04  0:41 ` Adrian Saul
  2016-04-07 12:10 ` Alfredo Deza
  2 siblings, 3 replies; 19+ messages in thread
From: Wido den Hollander @ 2016-04-02  5:54 UTC (permalink / raw)
  To: Sage Weil, ceph-devel


> Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
> 
> 
> Hi all,
> 
> There are a couple of looming features for ceph-disk:
> 
> 1- Support for additional devices when using BlueStore.  There can be up 
> to three: the main device, a WAL/journal device (small, ~128MB, ideally 
> NVRAM), and a fast metadata device (as big as you have available; will be 
> used for internal metadata).
> 
> 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath 
> filestore or bluestore.
> 

Keep in mind that you can't create a partition on a bcache device. So when using
bcache, the journal has to be filebased and not a partition.

If we add the flag --file-based-journal or --no-partitions we can create OSDs on
both bcache and dm-cache.

With BlueStore this becomes a problem since it requires the small (XFS)
filesystem for it's metadata.

Wido

> The current syntax of
> 
>  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
> 
> isn't terribly expressive.  For example, the journal device size is set 
> via a config option, not on the command line.  For bluestore, the metadata 
> device will probably want/need explicit user input so they can ensure it's 
> 1/Nth of their SSD (if they have N HDDs to each SSD).
> 
> And if we put dmcache in there, that partition will need to be sized too.
> 
> Another consideration is that right now we don't play nice with LVM at 
> all.  Should we?  dm-cache is usually used in conjunction with LVM 
> (although it doesn't have to be).  Does LVM provide value?  Like, the 
> ability for users to add a second SSD to a box and migrate cache, wal, or 
> journal partitions around?
> 
> I'm interested in hearing feedback on requirements, approaches, and 
> interfaces before we go too far down the road...
> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-02  5:54 ` Wido den Hollander
@ 2016-04-02  8:52   ` Loic Dachary
  2016-04-04 12:58     ` Wido den Hollander
  2016-04-03 12:59   ` Sage Weil
  2016-04-07  9:26   ` Lars Marowsky-Bree
  2 siblings, 1 reply; 19+ messages in thread
From: Loic Dachary @ 2016-04-02  8:52 UTC (permalink / raw)
  To: Wido den Hollander, ceph-devel

Hi Wido,

On 02/04/2016 07:54, Wido den Hollander wrote:
> 
>> Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
>>
>>
>> Hi all,
>>
>> There are a couple of looming features for ceph-disk:
>>
>> 1- Support for additional devices when using BlueStore.  There can be up 
>> to three: the main device, a WAL/journal device (small, ~128MB, ideally 
>> NVRAM), and a fast metadata device (as big as you have available; will be 
>> used for internal metadata).
>>
>> 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath 
>> filestore or bluestore.
>>
> 
> Keep in mind that you can't create a partition on a bcache device. So when using
> bcache, the journal has to be filebased and not a partition.

Is this true of all bcache versions ( https://bcache.evilpiepirate.org/ ) ? Or is it a planned feature ? Or is it never going to happen ?

Cheers

> 
> If we add the flag --file-based-journal or --no-partitions we can create OSDs on
> both bcache and dm-cache.
> 
> With BlueStore this becomes a problem since it requires the small (XFS)
> filesystem for it's metadata.
> 
> Wido
> 
>> The current syntax of
>>
>>  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
>>
>> isn't terribly expressive.  For example, the journal device size is set 
>> via a config option, not on the command line.  For bluestore, the metadata 
>> device will probably want/need explicit user input so they can ensure it's 
>> 1/Nth of their SSD (if they have N HDDs to each SSD).
>>
>> And if we put dmcache in there, that partition will need to be sized too.
>>
>> Another consideration is that right now we don't play nice with LVM at 
>> all.  Should we?  dm-cache is usually used in conjunction with LVM 
>> (although it doesn't have to be).  Does LVM provide value?  Like, the 
>> ability for users to add a second SSD to a box and migrate cache, wal, or 
>> journal partitions around?
>>
>> I'm interested in hearing feedback on requirements, approaches, and 
>> interfaces before we go too far down the road...
>>
>> Thanks!
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-02  5:54 ` Wido den Hollander
  2016-04-02  8:52   ` Loic Dachary
@ 2016-04-03 12:59   ` Sage Weil
  2016-04-04 13:04     ` Wido den Hollander
  2016-04-07  9:26   ` Lars Marowsky-Bree
  2 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2016-04-03 12:59 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Sat, 2 Apr 2016, Wido den Hollander wrote:
> > Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
> > Hi all,
> > 
> > There are a couple of looming features for ceph-disk:
> > 
> > 1- Support for additional devices when using BlueStore.  There can be up 
> > to three: the main device, a WAL/journal device (small, ~128MB, ideally 
> > NVRAM), and a fast metadata device (as big as you have available; will be 
> > used for internal metadata).
> > 
> > 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath 
> > filestore or bluestore.
> > 
> 
> Keep in mind that you can't create a partition on a bcache device. So when using
> bcache, the journal has to be filebased and not a partition.

Can you create a bcache device out of a partition, though?

sage

> If we add the flag --file-based-journal or --no-partitions we can create OSDs on
> both bcache and dm-cache.
> 
> With BlueStore this becomes a problem since it requires the small (XFS)
> filesystem for it's metadata.
> 
> Wido
> 
> > The current syntax of
> > 
> >  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
> > 
> > isn't terribly expressive.  For example, the journal device size is set 
> > via a config option, not on the command line.  For bluestore, the metadata 
> > device will probably want/need explicit user input so they can ensure it's 
> > 1/Nth of their SSD (if they have N HDDs to each SSD).
> > 
> > And if we put dmcache in there, that partition will need to be sized too.
> > 
> > Another consideration is that right now we don't play nice with LVM at 
> > all.  Should we?  dm-cache is usually used in conjunction with LVM 
> > (although it doesn't have to be).  Does LVM provide value?  Like, the 
> > ability for users to add a second SSD to a box and migrate cache, wal, or 
> > journal partitions around?
> > 
> > I'm interested in hearing feedback on requirements, approaches, and 
> > interfaces before we go too far down the road...
> > 
> > Thanks!
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: ceph-disk improvements
  2016-04-01 15:36 ceph-disk improvements Sage Weil
  2016-04-02  5:54 ` Wido den Hollander
@ 2016-04-04  0:41 ` Adrian Saul
  2016-04-07 12:10 ` Alfredo Deza
  2 siblings, 0 replies; 19+ messages in thread
From: Adrian Saul @ 2016-04-04  0:41 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

> Another consideration is that right now we don't play nice with LVM at all.
> Should we?  dm-cache is usually used in conjunction with LVM (although it
> doesn't have to be).  Does LVM provide value?  Like, the ability for users to
> add a second SSD to a box and migrate cache, wal, or journal partitions
> around?

Perhaps being able to designate a VG to create a configured/specified sized LV from to create these needed devices, rather than having to manually partition up a disk or use an additional filesystem layer to simplify journal etc creation.   If ceph-disk has the OSD ID at the creation time then it can apply that to the LV naming to give better transparency as to what it is used for.    You can sort of do this now wrapping ceph-disk in a script but its not obvious what the OSD-ID will be so in my case I use the physical disk name, which is not always consistent (disks swaps changing disk name on reboot etc).   Having ceph-disk integrate this would tie them together more cleanly.

Benefits could be that you can do online migrations to a  new disk by adding it to the VG and using pvmove, or support online resizing of these volumes with lvextend (ceph-disk grow-journal?) if the sizing has room to be adjusted later.

>
> I'm interested in hearing feedback on requirements, approaches, and
> interfaces before we go too far down the road...
>
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-02  8:52   ` Loic Dachary
@ 2016-04-04 12:58     ` Wido den Hollander
  0 siblings, 0 replies; 19+ messages in thread
From: Wido den Hollander @ 2016-04-04 12:58 UTC (permalink / raw)
  To: Loic Dachary, ceph-devel


> Op 2 april 2016 om 10:52 schreef Loic Dachary <loic@dachary.org>:
> 
> 
> Hi Wido,
> 
> On 02/04/2016 07:54, Wido den Hollander wrote:
> > 
> >> Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> Hi all,
> >>
> >> There are a couple of looming features for ceph-disk:
> >>
> >> 1- Support for additional devices when using BlueStore.  There can be up 
> >> to three: the main device, a WAL/journal device (small, ~128MB, ideally 
> >> NVRAM), and a fast metadata device (as big as you have available; will be 
> >> used for internal metadata).
> >>
> >> 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath 
> >> filestore or bluestore.
> >>
> > 
> > Keep in mind that you can't create a partition on a bcache device. So when
> > using
> > bcache, the journal has to be filebased and not a partition.
> 
> Is this true of all bcache versions ( https://bcache.evilpiepirate.org/ ) ? Or
> is it a planned feature ? Or is it never going to happen ?

I am not sure. But from my experience with bcache it is not supported on any
kernel. Tried up to version 4.2

> 
> Cheers
> 
> > 
> > If we add the flag --file-based-journal or --no-partitions we can create
> > OSDs on
> > both bcache and dm-cache.
> > 
> > With BlueStore this becomes a problem since it requires the small (XFS)
> > filesystem for it's metadata.
> > 
> > Wido
> > 
> >> The current syntax of
> >>
> >>  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
> >>
> >> isn't terribly expressive.  For example, the journal device size is set 
> >> via a config option, not on the command line.  For bluestore, the metadata 
> >> device will probably want/need explicit user input so they can ensure it's 
> >> 1/Nth of their SSD (if they have N HDDs to each SSD).
> >>
> >> And if we put dmcache in there, that partition will need to be sized too.
> >>
> >> Another consideration is that right now we don't play nice with LVM at 
> >> all.  Should we?  dm-cache is usually used in conjunction with LVM 
> >> (although it doesn't have to be).  Does LVM provide value?  Like, the 
> >> ability for users to add a second SSD to a box and migrate cache, wal, or 
> >> journal partitions around?
> >>
> >> I'm interested in hearing feedback on requirements, approaches, and 
> >> interfaces before we go too far down the road...
> >>
> >> Thanks!
> >> sage
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-03 12:59   ` Sage Weil
@ 2016-04-04 13:04     ` Wido den Hollander
  2016-04-05  8:30       ` Sebastien Han
  0 siblings, 1 reply; 19+ messages in thread
From: Wido den Hollander @ 2016-04-04 13:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


> Op 3 april 2016 om 14:59 schreef Sage Weil <sweil@redhat.com>:
> 
> 
> On Sat, 2 Apr 2016, Wido den Hollander wrote:
> > > Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
> > > Hi all,
> > > 
> > > There are a couple of looming features for ceph-disk:
> > > 
> > > 1- Support for additional devices when using BlueStore.  There can be up 
> > > to three: the main device, a WAL/journal device (small, ~128MB, ideally 
> > > NVRAM), and a fast metadata device (as big as you have available; will be 
> > > used for internal metadata).
> > > 
> > > 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath 
> > > filestore or bluestore.
> > > 
> > 
> > Keep in mind that you can't create a partition on a bcache device. So when
> > using
> > bcache, the journal has to be filebased and not a partition.
> 
> Can you create a bcache device out of a partition, though?
> 

Yes. If you have /dev/sdb which is a SSD and /dev/sdc which is a disk, you can
do:

/dev/sdc can be used as a caching device:

$ make-bcache -C /dev/sdb

Now, you can partition /dev/sdc (the HDD):

$ parted /dev/sdc mklabel gpt
$ parted /dev/sdc mkpart primary 2048s 10G
$ parted /dev/sdc mkpart primary 10G 100%
$ make-bcache -B /dev/sdc2

Now you still have to attach /dev/bcache0 (which is /dev/sdc2) to /dev/sdb by
echoing the UUID to /sys/block/bcache0/bcache/attach

This is explained in a quick howto here:
https://wiki.archlinux.org/index.php/Bcache

So for BlueStore this would work. A small, non-bcache, parition for XFS and
/dev/bcache0 for BlueStore directly.

The question will be if you want ceph-disk to prepare bcache completely, or ask
the user to provide a already configured device.

$ ceph-disk prepare --bluestore --bcache /dev/sdc1:/dev/bcache0

The first device will be the XFS partition with metadata and the second will be
data device.

Wido

> sage
> 
> > If we add the flag --file-based-journal or --no-partitions we can create
> > OSDs on
> > both bcache and dm-cache.
> > 
> > With BlueStore this becomes a problem since it requires the small (XFS)
> > filesystem for it's metadata.
> > 
> > Wido
> > 
> > > The current syntax of
> > > 
> > >  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
> > > 
> > > isn't terribly expressive.  For example, the journal device size is set 
> > > via a config option, not on the command line.  For bluestore, the metadata
> > > 
> > > device will probably want/need explicit user input so they can ensure it's
> > > 
> > > 1/Nth of their SSD (if they have N HDDs to each SSD).
> > > 
> > > And if we put dmcache in there, that partition will need to be sized too.
> > > 
> > > Another consideration is that right now we don't play nice with LVM at 
> > > all.  Should we?  dm-cache is usually used in conjunction with LVM 
> > > (although it doesn't have to be).  Does LVM provide value?  Like, the 
> > > ability for users to add a second SSD to a box and migrate cache, wal, or 
> > > journal partitions around?
> > > 
> > > I'm interested in hearing feedback on requirements, approaches, and 
> > > interfaces before we go too far down the road...
> > > 
> > > Thanks!
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-04 13:04     ` Wido den Hollander
@ 2016-04-05  8:30       ` Sebastien Han
  2016-04-05  9:26         ` Ilya Dryomov
  0 siblings, 1 reply; 19+ messages in thread
From: Sebastien Han @ 2016-04-05  8:30 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Sage Weil, ceph-devel, Samuel Yaple

Wido, I just discussed that with Sam last week and it seems that
bcache allocates a minor of 1 when creating the device.
Sam ended up writing this:
https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
The fix is not complex not sure why it is not part of bcache yet...

Not sure if it's ceph-disk's job to do all of this with bcache though...
We might need to check with the bache guys what are their plans about this.
If this will go through at some point we might just wait, if not we
could implement the partition trick on ceph-disk.

In addition to that, ceph-disk could have a more general cache option
where we could add "plugins" like bcache, dm-cache etc.
Thanks!


On Mon, Apr 4, 2016 at 3:04 PM, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 3 april 2016 om 14:59 schreef Sage Weil <sweil@redhat.com>:
>>
>>
>> On Sat, 2 Apr 2016, Wido den Hollander wrote:
>> > > Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
>> > > Hi all,
>> > >
>> > > There are a couple of looming features for ceph-disk:
>> > >
>> > > 1- Support for additional devices when using BlueStore.  There can be up
>> > > to three: the main device, a WAL/journal device (small, ~128MB, ideally
>> > > NVRAM), and a fast metadata device (as big as you have available; will be
>> > > used for internal metadata).
>> > >
>> > > 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath
>> > > filestore or bluestore.
>> > >
>> >
>> > Keep in mind that you can't create a partition on a bcache device. So when
>> > using
>> > bcache, the journal has to be filebased and not a partition.
>>
>> Can you create a bcache device out of a partition, though?
>>
>
> Yes. If you have /dev/sdb which is a SSD and /dev/sdc which is a disk, you can
> do:
>
> /dev/sdc can be used as a caching device:
>
> $ make-bcache -C /dev/sdb
>
> Now, you can partition /dev/sdc (the HDD):
>
> $ parted /dev/sdc mklabel gpt
> $ parted /dev/sdc mkpart primary 2048s 10G
> $ parted /dev/sdc mkpart primary 10G 100%
> $ make-bcache -B /dev/sdc2
>
> Now you still have to attach /dev/bcache0 (which is /dev/sdc2) to /dev/sdb by
> echoing the UUID to /sys/block/bcache0/bcache/attach
>
> This is explained in a quick howto here:
> https://wiki.archlinux.org/index.php/Bcache
>
> So for BlueStore this would work. A small, non-bcache, parition for XFS and
> /dev/bcache0 for BlueStore directly.
>
> The question will be if you want ceph-disk to prepare bcache completely, or ask
> the user to provide a already configured device.
>
> $ ceph-disk prepare --bluestore --bcache /dev/sdc1:/dev/bcache0
>
> The first device will be the XFS partition with metadata and the second will be
> data device.
>
> Wido
>
>> sage
>>
>> > If we add the flag --file-based-journal or --no-partitions we can create
>> > OSDs on
>> > both bcache and dm-cache.
>> >
>> > With BlueStore this becomes a problem since it requires the small (XFS)
>> > filesystem for it's metadata.
>> >
>> > Wido
>> >
>> > > The current syntax of
>> > >
>> > >  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
>> > >
>> > > isn't terribly expressive.  For example, the journal device size is set
>> > > via a config option, not on the command line.  For bluestore, the metadata
>> > >
>> > > device will probably want/need explicit user input so they can ensure it's
>> > >
>> > > 1/Nth of their SSD (if they have N HDDs to each SSD).
>> > >
>> > > And if we put dmcache in there, that partition will need to be sized too.
>> > >
>> > > Another consideration is that right now we don't play nice with LVM at
>> > > all.  Should we?  dm-cache is usually used in conjunction with LVM
>> > > (although it doesn't have to be).  Does LVM provide value?  Like, the
>> > > ability for users to add a second SSD to a box and migrate cache, wal, or
>> > > journal partitions around?
>> > >
>> > > I'm interested in hearing feedback on requirements, approaches, and
>> > > interfaces before we go too far down the road...
>> > >
>> > > Thanks!
>> > > sage
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers

––––––
Sébastien Han
Senior Cloud Architect

"Always give 100%. Unless you're giving blood."

Mail: seb@redhat.com
Address: 11 bis, rue Roquépine - 75008 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-05  8:30       ` Sebastien Han
@ 2016-04-05  9:26         ` Ilya Dryomov
  2016-04-05  9:41           ` Wido den Hollander
  2016-04-05 10:21           ` Loic Dachary
  0 siblings, 2 replies; 19+ messages in thread
From: Ilya Dryomov @ 2016-04-05  9:26 UTC (permalink / raw)
  To: Sebastien Han
  Cc: Wido den Hollander, Sage Weil, Ceph Development, Samuel Yaple

On Tue, Apr 5, 2016 at 10:30 AM, Sebastien Han <shan@redhat.com> wrote:
> Wido, I just discussed that with Sam last week and it seems that
> bcache allocates a minor of 1 when creating the device.
> Sam ended up writing this:
> https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
> The fix is not complex not sure why it is not part of bcache yet...

I think it's just that no one complained loud enough.

>
> Not sure if it's ceph-disk's job to do all of this with bcache though...
> We might need to check with the bache guys what are their plans about this.
> If this will go through at some point we might just wait, if not we
> could implement the partition trick on ceph-disk.

Making something like this go through shouldn't be a problem.  Sam's
patch is a bit of quick hack though - it messes up bcache device IDs
and also limits the number of partitions to 16.  Better to avoid
another hard-coded constant, if possible.

    # ls -lh /dev/bcache*
    brw-rw---- 1 root disk 254,  0 Mar 31 20:17 /dev/bcache0
    brw-rw---- 1 root disk 254,  1 Mar 31 20:17 /dev/bcache0p1
    brw-rw---- 1 root disk 254, 16 Mar 31 20:17 /dev/bcache16
    brw-rw---- 1 root disk 254, 17 Mar 31 20:17 /dev/bcache16p1
    brw-rw---- 1 root disk 254, 32 Mar 31 20:17 /dev/bcache32
    brw-rw---- 1 root disk 254, 33 Mar 31 20:17 /dev/bcache32p1

We had to solve almost exactly this problem in rbd.  I can submit
a patch for bcache if it helps ceph-disk in the long run.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-05  9:26         ` Ilya Dryomov
@ 2016-04-05  9:41           ` Wido den Hollander
  2016-04-05 10:21           ` Loic Dachary
  1 sibling, 0 replies; 19+ messages in thread
From: Wido den Hollander @ 2016-04-05  9:41 UTC (permalink / raw)
  To: Sebastien Han, Ilya Dryomov; +Cc: Samuel Yaple, Sage Weil, Ceph Development


> Op 5 april 2016 om 11:26 schreef Ilya Dryomov <idryomov@gmail.com>:
> 
> 
> On Tue, Apr 5, 2016 at 10:30 AM, Sebastien Han <shan@redhat.com> wrote:
> > Wido, I just discussed that with Sam last week and it seems that
> > bcache allocates a minor of 1 when creating the device.
> > Sam ended up writing this:
> > https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
> > The fix is not complex not sure why it is not part of bcache yet...
> 
> I think it's just that no one complained loud enough.
> 

Yes, I think so. We probably should make some noisy. This way Ceph could use
bcache very easily.

Just provide /dev/bcache0 to ceph-disk and you are done.

> >
> > Not sure if it's ceph-disk's job to do all of this with bcache though...
> > We might need to check with the bache guys what are their plans about this.
> > If this will go through at some point we might just wait, if not we
> > could implement the partition trick on ceph-disk.
> 
> Making something like this go through shouldn't be a problem.  Sam's
> patch is a bit of quick hack though - it messes up bcache device IDs
> and also limits the number of partitions to 16.  Better to avoid
> another hard-coded constant, if possible.
> 
>     # ls -lh /dev/bcache*
>     brw-rw---- 1 root disk 254,  0 Mar 31 20:17 /dev/bcache0
>     brw-rw---- 1 root disk 254,  1 Mar 31 20:17 /dev/bcache0p1
>     brw-rw---- 1 root disk 254, 16 Mar 31 20:17 /dev/bcache16
>     brw-rw---- 1 root disk 254, 17 Mar 31 20:17 /dev/bcache16p1
>     brw-rw---- 1 root disk 254, 32 Mar 31 20:17 /dev/bcache32
>     brw-rw---- 1 root disk 254, 33 Mar 31 20:17 /dev/bcache32p1
> 
> We had to solve almost exactly this problem in rbd.  I can submit
> a patch for bcache if it helps ceph-disk in the long run.
> 
> Thanks,
> 
>                 Ilya
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-05  9:26         ` Ilya Dryomov
  2016-04-05  9:41           ` Wido den Hollander
@ 2016-04-05 10:21           ` Loic Dachary
       [not found]             ` <CAJ3CzQWJnC7O6pcAF57MqYXMyZEpZVUSWhfynb6yue2iLKXfLA@mail.gmail.com>
  1 sibling, 1 reply; 19+ messages in thread
From: Loic Dachary @ 2016-04-05 10:21 UTC (permalink / raw)
  To: Ilya Dryomov, Sebastien Han
  Cc: Wido den Hollander, Sage Weil, Ceph Development, Samuel Yaple

Hi Ilya,

On 05/04/2016 11:26, Ilya Dryomov wrote:
> On Tue, Apr 5, 2016 at 10:30 AM, Sebastien Han <shan@redhat.com> wrote:
>> Wido, I just discussed that with Sam last week and it seems that
>> bcache allocates a minor of 1 when creating the device.
>> Sam ended up writing this:
>> https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
>> The fix is not complex not sure why it is not part of bcache yet...
> 
> I think it's just that no one complained loud enough.
> 
>>
>> Not sure if it's ceph-disk's job to do all of this with bcache though...
>> We might need to check with the bache guys what are their plans about this.
>> If this will go through at some point we might just wait, if not we
>> could implement the partition trick on ceph-disk.
> 
> Making something like this go through shouldn't be a problem.  Sam's
> patch is a bit of quick hack though - it messes up bcache device IDs
> and also limits the number of partitions to 16.  Better to avoid
> another hard-coded constant, if possible.
> 
>     # ls -lh /dev/bcache*
>     brw-rw---- 1 root disk 254,  0 Mar 31 20:17 /dev/bcache0
>     brw-rw---- 1 root disk 254,  1 Mar 31 20:17 /dev/bcache0p1
>     brw-rw---- 1 root disk 254, 16 Mar 31 20:17 /dev/bcache16
>     brw-rw---- 1 root disk 254, 17 Mar 31 20:17 /dev/bcache16p1
>     brw-rw---- 1 root disk 254, 32 Mar 31 20:17 /dev/bcache32
>     brw-rw---- 1 root disk 254, 33 Mar 31 20:17 /dev/bcache32p1
> 
> We had to solve almost exactly this problem in rbd.  I can submit
> a patch for bcache if it helps ceph-disk in the long run.

It would help. Implementing a workaround in ceph-disk to compensate for the fact that bcache does not support partitioning feels much better when there is hope it will eventually be removed :-)

Cheers

> 
> Thanks,
> 
>                 Ilya
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
       [not found]             ` <CAJ3CzQWJnC7O6pcAF57MqYXMyZEpZVUSWhfynb6yue2iLKXfLA@mail.gmail.com>
@ 2016-04-05 13:26               ` Ilya Dryomov
       [not found]                 ` <CAJ3CzQUWyF-B6o5wBmJ0rcfK_aBkmSX7LcRDkLj=W5COSbPwnQ@mail.gmail.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Ilya Dryomov @ 2016-04-05 13:26 UTC (permalink / raw)
  To: Samuel Yaple
  Cc: Loic Dachary, Sebastien Han, Wido den Hollander, Sage Weil,
	Ceph Development, Kent Overstreet

On Tue, Apr 5, 2016 at 1:48 PM, Sam Yaple <samuel@yaple.net> wrote:
> On Tue, Apr 5, 2016 at 10:21 AM, Loic Dachary <loic@dachary.org> wrote:
>>
>> Hi Ilya,
>>
>> On 05/04/2016 11:26, Ilya Dryomov wrote:
>> > On Tue, Apr 5, 2016 at 10:30 AM, Sebastien Han <shan@redhat.com> wrote:
>> >> Wido, I just discussed that with Sam last week and it seems that
>> >> bcache allocates a minor of 1 when creating the device.
>> >> Sam ended up writing this:
>> >> https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
>> >> The fix is not complex not sure why it is not part of bcache yet...
>> >
>> > I think it's just that no one complained loud enough.
>> >
>> >>
>> >> Not sure if it's ceph-disk's job to do all of this with bcache
>> >> though...
>> >> We might need to check with the bache guys what are their plans about
>> >> this.
>> >> If this will go through at some point we might just wait, if not we
>> >> could implement the partition trick on ceph-disk.
>> >
>> > Making something like this go through shouldn't be a problem.  Sam's
>> > patch is a bit of quick hack though - it messes up bcache device IDs
>> > and also limits the number of partitions to 16.  Better to avoid
>> > another hard-coded constant, if possible.
>
>
> This has already been discussed with Kent Overstreet in the IRCs. I am
> looking into patching properly (this was very much a quick-and-dirty hack)
> but I will admit it is not a top priority for me. As far as it 'messing up
> bcache devices IDs' I would entirely disagree. For starters, this is how zfs
> spits out its volumes (/dev/zdb0, /dev/zdb16, etc). But more importantly I
> think is that up until this point bcache has been using the device minor
> number _as_ the bcache device number. Changing that behavior is less than
> ideal to me and surely more prone to bugs. Since you can't be assured that
> bcache0 will be the same device after a reboot anyway, I dont see why it
> matters. Use PartUUIDs and other labels and be done with it.

This is just common sense: if I create three bcache devices on my
system, I'd expect them to be named /dev/bcache{0,1,2} (or {1,2,3}, or
{a,b,c}) just like other block devices are.  An out-of-tree zfs is
hardly the best example to follow here.

Of course if userspace tooling expects or relies on minor numbers being
equal to device IDs, that's a good enough reason to keep it as is.  The
same goes for limiting the number of partitions to 16: if tools expect
the major to be the same for all bcache device partitions, it'd have to
be hard-coded.

Both of my points are just suggestions though.

>>
>> >
>> >     # ls -lh /dev/bcache*
>> >     brw-rw---- 1 root disk 254,  0 Mar 31 20:17 /dev/bcache0
>> >     brw-rw---- 1 root disk 254,  1 Mar 31 20:17 /dev/bcache0p1
>> >     brw-rw---- 1 root disk 254, 16 Mar 31 20:17 /dev/bcache16
>> >     brw-rw---- 1 root disk 254, 17 Mar 31 20:17 /dev/bcache16p1
>> >     brw-rw---- 1 root disk 254, 32 Mar 31 20:17 /dev/bcache32
>> >     brw-rw---- 1 root disk 254, 33 Mar 31 20:17 /dev/bcache32p1
>> >
>> > We had to solve almost exactly this problem in rbd.  I can submit
>> > a patch for bcache if it helps ceph-disk in the long run.
>
>
> While I was working on this, I have found myself busy don't have any idea of
> a time frame.
>
>>
>> It would help. Implementing a workaround in ceph-disk to compensate for
>> the fact that bcache does not support partitioning feels much better when
>> there is hope it will eventually be removed :-)
>
>
> There is no push back from Kent on this matter. I feel confident any
> implemented workaround in ceph-disk will be able to be removed.
>
> #bcache.2016-04-01.log:00:54 < py1hon> well, if you found it useful maybe
> other people will too, but to send a patch upstream I'd want to figure out
> what the most standard way is, if there is one :)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
       [not found]                 ` <CAJ3CzQUWyF-B6o5wBmJ0rcfK_aBkmSX7LcRDkLj=W5COSbPwnQ@mail.gmail.com>
@ 2016-04-05 14:42                   ` Ilya Dryomov
  0 siblings, 0 replies; 19+ messages in thread
From: Ilya Dryomov @ 2016-04-05 14:42 UTC (permalink / raw)
  To: Samuel Yaple
  Cc: Loic Dachary, Sebastien Han, Wido den Hollander, Sage Weil,
	Ceph Development, Kent Overstreet

On Tue, Apr 5, 2016 at 4:10 PM, Sam Yaple <samuel@yaple.net> wrote:
> On Tue, Apr 5, 2016 at 1:26 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>
>> On Tue, Apr 5, 2016 at 1:48 PM, Sam Yaple <samuel@yaple.net> wrote:
>> > On Tue, Apr 5, 2016 at 10:21 AM, Loic Dachary <loic@dachary.org> wrote:
>> >>
>> >> Hi Ilya,
>> >>
>> >> On 05/04/2016 11:26, Ilya Dryomov wrote:
>> >> > On Tue, Apr 5, 2016 at 10:30 AM, Sebastien Han <shan@redhat.com>
>> >> > wrote:
>> >> >> Wido, I just discussed that with Sam last week and it seems that
>> >> >> bcache allocates a minor of 1 when creating the device.
>> >> >> Sam ended up writing this:
>> >> >> https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
>> >> >> The fix is not complex not sure why it is not part of bcache yet...
>> >> >
>> >> > I think it's just that no one complained loud enough.
>> >> >
>> >> >>
>> >> >> Not sure if it's ceph-disk's job to do all of this with bcache
>> >> >> though...
>> >> >> We might need to check with the bache guys what are their plans
>> >> >> about
>> >> >> this.
>> >> >> If this will go through at some point we might just wait, if not we
>> >> >> could implement the partition trick on ceph-disk.
>> >> >
>> >> > Making something like this go through shouldn't be a problem.  Sam's
>> >> > patch is a bit of quick hack though - it messes up bcache device IDs
>> >> > and also limits the number of partitions to 16.  Better to avoid
>> >> > another hard-coded constant, if possible.
>> >
>> >
>> > This has already been discussed with Kent Overstreet in the IRCs. I am
>> > looking into patching properly (this was very much a quick-and-dirty
>> > hack)
>> > but I will admit it is not a top priority for me. As far as it 'messing
>> > up
>> > bcache devices IDs' I would entirely disagree. For starters, this is how
>> > zfs
>> > spits out its volumes (/dev/zdb0, /dev/zdb16, etc). But more importantly
>> > I
>> > think is that up until this point bcache has been using the device minor
>> > number _as_ the bcache device number. Changing that behavior is less
>> > than
>> > ideal to me and surely more prone to bugs. Since you can't be assured
>> > that
>> > bcache0 will be the same device after a reboot anyway, I dont see why it
>> > matters. Use PartUUIDs and other labels and be done with it.
>>
>> This is just common sense: if I create three bcache devices on my
>> system, I'd expect them to be named /dev/bcache{0,1,2} (or {1,2,3}, or
>> {a,b,c}) just like other block devices are.  An out-of-tree zfs is
>> hardly the best example to follow here.
>
>
> They actually _won't_ be named bcache{0,1,2} in a long running system.
> Scenario is you create and delete a bcache device, the module keeps track of
> that counter so it might be /dev/bcache{0,4,5}. Maybe call it a bug? either
> way thats how it currently works. Fair enough about zfs, and frankly I don't
> have strong opinions about the naming. Im worried about regressions though.

Does it?  It's not a simple counter - those IDs are managed by an ida
tree.  You grab an ID, when you are done with it (i.e. delete a device),
you put it back and it becomes available for reuse.

>>
>>
>> Of course if userspace tooling expects or relies on minor numbers being
>> equal to device IDs, that's a good enough reason to keep it as is.  The
>> same goes for limiting the number of partitions to 16: if tools expect
>> the major to be the same for all bcache device partitions, it'd have to
>> be hard-coded.
>
> The hard-coded-ness was part of the quick-and-dirty fix. I haven't looked at
> how, say, the sd* drivers work but they, by default, increment by 16. So at
> the very least it could be a sane default. The loop driver has a max_part
> option. Not a big fan of that myself, but it is an option. And again I
> really don't care about the way we solve this as long as it is solved. Like
> Kent said, we just need to find the "standard" way to do this, or the least
> bad.

I think the most "standard" way is:

- Reuse device IDs.  It looks like bcache is already using an ida tree
  for its minors (device IDs, currently), so if you are seeing {0,1,4},
  it's probably a bug.

- Preallocate 16 minors.  Don't introduce max_part or anything like
  that.

- Use extended devt to support devices with more than 16 partitions.
  The catch is those (p17+) partitions are going to have a different
  (fixed) major, so one would have to make sure that bcache tools are
  fine with that.

That's what rbd, virtblk and a bunch of other block device drivers are
doing and that's the patch that I had in mind.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-02  5:54 ` Wido den Hollander
  2016-04-02  8:52   ` Loic Dachary
  2016-04-03 12:59   ` Sage Weil
@ 2016-04-07  9:26   ` Lars Marowsky-Bree
  2016-04-07 10:31     ` Sebastien Han
  2 siblings, 1 reply; 19+ messages in thread
From: Lars Marowsky-Bree @ 2016-04-07  9:26 UTC (permalink / raw)
  To: ceph-devel

On 2016-04-02T07:54:05, Wido den Hollander <wido@42on.com> wrote:

> 
> > Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
> > 
> > 
> > Hi all,
> > 
> > There are a couple of looming features for ceph-disk:
> > 
> > 1- Support for additional devices when using BlueStore.  There can be up 
> > to three: the main device, a WAL/journal device (small, ~128MB, ideally 
> > NVRAM), and a fast metadata device (as big as you have available; will be 
> > used for internal metadata).
> > 
> > 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath 
> > filestore or bluestore.
> > 
> 
> Keep in mind that you can't create a partition on a bcache device. So when using
> bcache, the journal has to be filebased and not a partition.

Is there a point to running the journal on bcache? I see a point for
bcache for the (potentially on HDD) actual data store - but the
WAL/journal, would that not be directly on NVMe/PM/SSD (at the very
least)? Without an intervening cache?


Regards,
    Lars

-- 
Architect SDS, Distinguished Engineer
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-07  9:26   ` Lars Marowsky-Bree
@ 2016-04-07 10:31     ` Sebastien Han
  2016-04-07 10:51       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 19+ messages in thread
From: Sebastien Han @ 2016-04-07 10:31 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: ceph-devel

One thing I'd like to see and I know we discussed that with Loïc a
while ago is ceph-disk being idempotent when it comes to the device
preparation.
Running "ceph-disk prepare" against a device should result in an exit
0 if the disk already has an OSD prepared on it (unless we do
something like --force, which will then zap the disk).
I had to implement this logic in ceph-ansible (and the guys from
chef/puppet probably did the same), so now it's done but i'll be happy
to leave ceph-disk doing it :).

Thoughts?

On Thu, Apr 7, 2016 at 11:26 AM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> On 2016-04-02T07:54:05, Wido den Hollander <wido@42on.com> wrote:
>
>>
>> > Op 1 april 2016 om 17:36 schreef Sage Weil <sweil@redhat.com>:
>> >
>> >
>> > Hi all,
>> >
>> > There are a couple of looming features for ceph-disk:
>> >
>> > 1- Support for additional devices when using BlueStore.  There can be up
>> > to three: the main device, a WAL/journal device (small, ~128MB, ideally
>> > NVRAM), and a fast metadata device (as big as you have available; will be
>> > used for internal metadata).
>> >
>> > 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath
>> > filestore or bluestore.
>> >
>>
>> Keep in mind that you can't create a partition on a bcache device. So when using
>> bcache, the journal has to be filebased and not a partition.
>
> Is there a point to running the journal on bcache? I see a point for
> bcache for the (potentially on HDD) actual data store - but the
> WAL/journal, would that not be directly on NVMe/PM/SSD (at the very
> least)? Without an intervening cache?
>
>
> Regards,
>     Lars
>
> --
> Architect SDS, Distinguished Engineer
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers

––––––
Sébastien Han
Senior Cloud Architect

"Always give 100%. Unless you're giving blood."

Mail: seb@redhat.com
Address: 11 bis, rue Roquépine - 75008 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-07 10:31     ` Sebastien Han
@ 2016-04-07 10:51       ` Lars Marowsky-Bree
  2016-04-07 11:45         ` Loic Dachary
  0 siblings, 1 reply; 19+ messages in thread
From: Lars Marowsky-Bree @ 2016-04-07 10:51 UTC (permalink / raw)
  To: ceph-devel

On 2016-04-07T12:31:21, Sebastien Han <shan@redhat.com> wrote:

> One thing I'd like to see and I know we discussed that with Loïc a
> while ago is ceph-disk being idempotent when it comes to the device
> preparation.
> Running "ceph-disk prepare" against a device should result in an exit
> 0 if the disk already has an OSD prepared on it (unless we do
> something like --force, which will then zap the disk).
> I had to implement this logic in ceph-ansible (and the guys from
> chef/puppet probably did the same), so now it's done but i'll be happy
> to leave ceph-disk doing it :).
> 
> Thoughts?

Basically, +1.

Owen (who works on salt-ceph) has a love/love relationship with the word
"idempotent" as well.

Note that it should probably only exit=0 if the OSD matches the current
fsid?


Regards,
    Lars

-- 
Architect SDS, Distinguished Engineer
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-07 10:51       ` Lars Marowsky-Bree
@ 2016-04-07 11:45         ` Loic Dachary
  0 siblings, 0 replies; 19+ messages in thread
From: Loic Dachary @ 2016-04-07 11:45 UTC (permalink / raw)
  To: Lars Marowsky-Bree, ceph-devel

Hi Lars,

There is some preliminary thoughts at http://tracker.ceph.com/issues/7475.

Cheers

On 07/04/2016 12:51, Lars Marowsky-Bree wrote:
> On 2016-04-07T12:31:21, Sebastien Han <shan@redhat.com> wrote:
> 
>> One thing I'd like to see and I know we discussed that with Loïc a
>> while ago is ceph-disk being idempotent when it comes to the device
>> preparation.
>> Running "ceph-disk prepare" against a device should result in an exit
>> 0 if the disk already has an OSD prepared on it (unless we do
>> something like --force, which will then zap the disk).
>> I had to implement this logic in ceph-ansible (and the guys from
>> chef/puppet probably did the same), so now it's done but i'll be happy
>> to leave ceph-disk doing it :).
>>
>> Thoughts?
> 
> Basically, +1.
> 
> Owen (who works on salt-ceph) has a love/love relationship with the word
> "idempotent" as well.
> 
> Note that it should probably only exit=0 if the OSD matches the current
> fsid?
> 
> 
> Regards,
>     Lars
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-01 15:36 ceph-disk improvements Sage Weil
  2016-04-02  5:54 ` Wido den Hollander
  2016-04-04  0:41 ` Adrian Saul
@ 2016-04-07 12:10 ` Alfredo Deza
  2016-06-23 15:41   ` Sage Weil
  2 siblings, 1 reply; 19+ messages in thread
From: Alfredo Deza @ 2016-04-07 12:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Apr 1, 2016 at 11:36 AM, Sage Weil <sweil@redhat.com> wrote:
> Hi all,
>
> There are a couple of looming features for ceph-disk:
>
> 1- Support for additional devices when using BlueStore.  There can be up
> to three: the main device, a WAL/journal device (small, ~128MB, ideally
> NVRAM), and a fast metadata device (as big as you have available; will be
> used for internal metadata).
>
> 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath
> filestore or bluestore.
>
> The current syntax of
>
>  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
>
> isn't terribly expressive.  For example, the journal device size is set
> via a config option, not on the command line.  For bluestore, the metadata
> device will probably want/need explicit user input so they can ensure it's
> 1/Nth of their SSD (if they have N HDDs to each SSD).
>
> And if we put dmcache in there, that partition will need to be sized too.

Sebastien's suggestion of allowing plugins for ceph-disk is ideal
here, because it would allow to enable extra functionality
(and possibly at a faster release pace) without interfering with the
current syntax.

Reusing your examples, a "bluestore" plugin could be a sub-command:

    ceph-disk bluestore prepare [...]

Device size, extra flags or overriding options would be clearly
separated because of the subcommand. This would be the same
case for dm-cache, bcache, or whatever comes next.

>
> Another consideration is that right now we don't play nice with LVM at
> all.  Should we?  dm-cache is usually used in conjunction with LVM
> (although it doesn't have to be).  Does LVM provide value?  Like, the
> ability for users to add a second SSD to a box and migrate cache, wal, or
> journal partitions around?

One of the problematic corners of ceph-disk is that it tries to be
helpful by trying to predict accurately sizes and partitions
to make it simpler for a user. I would love to see ceph-disk be less
flexible here and require actual full devices for an OSD and a
separate
device for a Journal, while starting to deprecate journal collocation
and directory-osd.

Going back to the plugin idea, LVM support could be enabled by a
separate plugin and ceph-disk could stay lean.

>
> I'm interested in hearing feedback on requirements, approaches, and
> interfaces before we go too far down the road...
>
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ceph-disk improvements
  2016-04-07 12:10 ` Alfredo Deza
@ 2016-06-23 15:41   ` Sage Weil
  0 siblings, 0 replies; 19+ messages in thread
From: Sage Weil @ 2016-06-23 15:41 UTC (permalink / raw)
  To: Alfredo Deza; +Cc: ceph-devel

[Resurrecting an old thread!]

On Thu, 7 Apr 2016, Alfredo Deza wrote:

> On Fri, Apr 1, 2016 at 11:36 AM, Sage Weil <sweil@redhat.com> wrote:
> > Hi all,
> >
> > There are a couple of looming features for ceph-disk:
> >
> > 1- Support for additional devices when using BlueStore.  There can be up
> > to three: the main device, a WAL/journal device (small, ~128MB, ideally
> > NVRAM), and a fast metadata device (as big as you have available; will be
> > used for internal metadata).
> >
> > 2- Support for setting up dm-cache, bcache, and/or FlashCache underneath
> > filestore or bluestore.
> >
> > The current syntax of
> >
> >  ceph-disk prepare [--dmcrypt] [--bluestore] DATADEV [JOURNALDEV]
> >
> > isn't terribly expressive.  For example, the journal device size is set
> > via a config option, not on the command line.  For bluestore, the metadata
> > device will probably want/need explicit user input so they can ensure it's
> > 1/Nth of their SSD (if they have N HDDs to each SSD).
> >
> > And if we put dmcache in there, that partition will need to be sized too.
> 
> Sebastien's suggestion of allowing plugins for ceph-disk is ideal
> here, because it would allow to enable extra functionality
> (and possibly at a faster release pace) without interfering with the
> current syntax.
> 
> Reusing your examples, a "bluestore" plugin could be a sub-command:
> 
>     ceph-disk bluestore prepare [...]
> 
> Device size, extra flags or overriding options would be clearly
> separated because of the subcommand. This would be the same
> case for dm-cache, bcache, or whatever comes next.

I like this in principle, but I'm not sure how to make this coexist 
peacefully with the current usage.  Lots of tooling already does 
'ceph-disk prepare ...' and 'ceph-dist activate ...'.  We definitely can't 
break the activate portion (and in general that part has to be "magic" and 
leverage whatever plugins are needed in order to make the device go).  And 
for prepare, in my ideal world we'd be able to flip the switch on the 
'default' without changing the usage, so that legacy instantiations that 
targetted filestore would "just work" and start creating bluestore osds.
Is that too ambitious?

Maybe it is just the prepare path that matters, and the usage would 
go from

	ceph-disk prepare --cluster [cluster-name] --cluster-uuid [uuid] \
		--fs-type [ext4|xfs|btrfs] [data-path] [journal-path]

to

	ceph-disk prepare [plugin] ...

?  Or maybe it's simpler to do

	ceph-disk -p|--plugin <foo> prepare ...

I'm not sure it's quite to simple, though, because really dm-cache or 
dm-crypt functionality are both orthogonal to filestore vs bluestore...

> > Another consideration is that right now we don't play nice with LVM at
> > all.  Should we?  dm-cache is usually used in conjunction with LVM
> > (although it doesn't have to be).  Does LVM provide value?  Like, the
> > ability for users to add a second SSD to a box and migrate cache, wal, or
> > journal partitions around?
> 
> One of the problematic corners of ceph-disk is that it tries to be 
> helpful by trying to predict accurately sizes and partitions to make it 
> simpler for a user. I would love to see ceph-disk be less flexible here 
> and require actual full devices for an OSD and a separate device for a 
> Journal, while starting to deprecate journal collocation and 
> directory-osd.

You mean you would have it not create partitions for you?  This might be a 
bit hard since it's pretty centered around creating labeled partitions.  
We could feed it existing partitions without labels, but I think that 
would just make the usability much harder.

Perhaps standardizing the syntax around this so that partitions is either 
size X, Y%, or the full/rest of the device, and have sane defaults for 
each--but standard options.  For bluestore, for instance,

	ceph-disk prepare bluestore BASEDEV [--wal WALDEV] [--db DBDEV]

and any DEV looks like

	/dev/foo[=<full|x%|yM|zG>]

> Going back to the plugin idea, LVM support could be enabled by a 
> separate plugin and ceph-disk could stay lean.

We could do something like

	/dev/foo[,<gpt|lvm>][=<full|x%|yM|zG>]

e.g.,

	ceph-disk prepare bluestore /dev/sdb --wal /dev/sdc=128M
or
	ceph-disk prepare bluestore /dev/sdb,lvm --wal /dev/sdc,lvm=128M

?

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-06-23 15:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-01 15:36 ceph-disk improvements Sage Weil
2016-04-02  5:54 ` Wido den Hollander
2016-04-02  8:52   ` Loic Dachary
2016-04-04 12:58     ` Wido den Hollander
2016-04-03 12:59   ` Sage Weil
2016-04-04 13:04     ` Wido den Hollander
2016-04-05  8:30       ` Sebastien Han
2016-04-05  9:26         ` Ilya Dryomov
2016-04-05  9:41           ` Wido den Hollander
2016-04-05 10:21           ` Loic Dachary
     [not found]             ` <CAJ3CzQWJnC7O6pcAF57MqYXMyZEpZVUSWhfynb6yue2iLKXfLA@mail.gmail.com>
2016-04-05 13:26               ` Ilya Dryomov
     [not found]                 ` <CAJ3CzQUWyF-B6o5wBmJ0rcfK_aBkmSX7LcRDkLj=W5COSbPwnQ@mail.gmail.com>
2016-04-05 14:42                   ` Ilya Dryomov
2016-04-07  9:26   ` Lars Marowsky-Bree
2016-04-07 10:31     ` Sebastien Han
2016-04-07 10:51       ` Lars Marowsky-Bree
2016-04-07 11:45         ` Loic Dachary
2016-04-04  0:41 ` Adrian Saul
2016-04-07 12:10 ` Alfredo Deza
2016-06-23 15:41   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.