All of lore.kernel.org
 help / color / mirror / Atom feed
* status of spdk
@ 2016-11-08 23:31 Yehuda Sadeh-Weinraub
  2016-11-08 23:40 ` Sage Weil
  2016-11-09  4:45 ` Haomai Wang
  0 siblings, 2 replies; 18+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2016-11-08 23:31 UTC (permalink / raw)
  To: Wang, Haomai, Weil, Sage; +Cc: ceph-devel

I just started looking at spdk, and have a few comments and questions.

First, it's not clear to me how we should handle build. At the moment
the spdk code resides as a submodule in the ceph tree, but it depends
on dpdk, which currently needs to be downloaded separately. We can add
it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
said, getting it to build was a bit tricky and I think it might be
broken with cmake. In order to get it working I resorted to building a
system library and use that.

The way to currently configure an osd to use bluestore with spdk is by
creating a symbolic link that replaces the bluestore 'block' device to
point to a file that has a name that is prefixed with 'spdk:'.
Originally I assumed that the suffix would be the nvme device id, but
it seems that it's not really needed, however, the file itself needs
to contain the device id (see
https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
minor fixes).

As I understand it, in order to support multiple osds on the same NVMe
device we have a few options. We can leverage NVMe namespaces, but
that's not supported on all devices. We can configure bluestore to
only use part of the device (device sharding? not sure if it supports
it). I think it's best if we could keep bluestore out of the loop
there and have the NVMe driver abstract multiple partitions of the
NVMe device. The idea is to be able to define multiple partitions on
the device (e.g., each partition will be defined by the offset, size,
and namespace), and have the osd set to use a specific partition.
We'll probably need a special tool to manage it, and potentially keep
the partition table information on the device itself. The tool could
also manage the creation of the block link. We should probably rethink
how the link is structure and what it points at.

Any thoughts?

Yehuda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-08 23:31 status of spdk Yehuda Sadeh-Weinraub
@ 2016-11-08 23:40 ` Sage Weil
  2016-11-09  0:06   ` Yehuda Sadeh-Weinraub
  2016-11-09  4:49   ` Haomai Wang
  2016-11-09  4:45 ` Haomai Wang
  1 sibling, 2 replies; 18+ messages in thread
From: Sage Weil @ 2016-11-08 23:40 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: Wang, Haomai, ceph-devel

On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> I just started looking at spdk, and have a few comments and questions.
> 
> First, it's not clear to me how we should handle build. At the moment
> the spdk code resides as a submodule in the ceph tree, but it depends
> on dpdk, which currently needs to be downloaded separately. We can add
> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
> said, getting it to build was a bit tricky and I think it might be
> broken with cmake. In order to get it working I resorted to building a
> system library and use that.

Note that this PR is about to merge

	https://github.com/ceph/ceph/pull/10748

which adds the DPDK submodule, so hopefully this issue will go away when 
that merged or with a follow-on cleanup.

> The way to currently configure an osd to use bluestore with spdk is by
> creating a symbolic link that replaces the bluestore 'block' device to
> point to a file that has a name that is prefixed with 'spdk:'.
> Originally I assumed that the suffix would be the nvme device id, but
> it seems that it's not really needed, however, the file itself needs
> to contain the device id (see
> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
> minor fixes).

Open a PR for those?

> As I understand it, in order to support multiple osds on the same NVMe
> device we have a few options. We can leverage NVMe namespaces, but
> that's not supported on all devices. We can configure bluestore to
> only use part of the device (device sharding? not sure if it supports
> it). I think it's best if we could keep bluestore out of the loop
> there and have the NVMe driver abstract multiple partitions of the
> NVMe device. The idea is to be able to define multiple partitions on
> the device (e.g., each partition will be defined by the offset, size,
> and namespace), and have the osd set to use a specific partition.
> We'll probably need a special tool to manage it, and potentially keep
> the partition table information on the device itself. The tool could
> also manage the creation of the block link. We should probably rethink
> how the link is structure and what it points at.

I agree that bluestore shouldn't get involved.

Is the NVMe namespaces meant to support multiple processes sharing the 
same hardware device?

Also, if you do that, is it possible to give one of the namespaces to the 
kernel?  That might solve the bootstrapping problem we currently have 
where we have nowhere to put the $osd_data filesystem with the device 
metadata.  (This is admittedly not necessarily a blocking issue.  Putting 
those dirs on / wouldn't be the end of the world; it just means cards 
can't be easily moved between boxes.)

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-08 23:40 ` Sage Weil
@ 2016-11-09  0:06   ` Yehuda Sadeh-Weinraub
  2016-11-09  0:21     ` LIU, Fei
  2016-11-09  4:49   ` Haomai Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2016-11-09  0:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Wang, Haomai, ceph-devel

On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
>> I just started looking at spdk, and have a few comments and questions.
>>
>> First, it's not clear to me how we should handle build. At the moment
>> the spdk code resides as a submodule in the ceph tree, but it depends
>> on dpdk, which currently needs to be downloaded separately. We can add
>> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
>> said, getting it to build was a bit tricky and I think it might be
>> broken with cmake. In order to get it working I resorted to building a
>> system library and use that.
>
> Note that this PR is about to merge
>
>         https://github.com/ceph/ceph/pull/10748
>
> which adds the DPDK submodule, so hopefully this issue will go away when
> that merged or with a follow-on cleanup.
>
>> The way to currently configure an osd to use bluestore with spdk is by
>> creating a symbolic link that replaces the bluestore 'block' device to
>> point to a file that has a name that is prefixed with 'spdk:'.
>> Originally I assumed that the suffix would be the nvme device id, but
>> it seems that it's not really needed, however, the file itself needs
>> to contain the device id (see
>> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
>> minor fixes).
>
> Open a PR for those?

Sure

>
>> As I understand it, in order to support multiple osds on the same NVMe
>> device we have a few options. We can leverage NVMe namespaces, but
>> that's not supported on all devices. We can configure bluestore to
>> only use part of the device (device sharding? not sure if it supports
>> it). I think it's best if we could keep bluestore out of the loop
>> there and have the NVMe driver abstract multiple partitions of the
>> NVMe device. The idea is to be able to define multiple partitions on
>> the device (e.g., each partition will be defined by the offset, size,
>> and namespace), and have the osd set to use a specific partition.
>> We'll probably need a special tool to manage it, and potentially keep
>> the partition table information on the device itself. The tool could
>> also manage the creation of the block link. We should probably rethink
>> how the link is structure and what it points at.
>
> I agree that bluestore shouldn't get involved.
>
> Is the NVMe namespaces meant to support multiple processes sharing the
> same hardware device?

More of a partitioning solution, but yes (as far as I undestand).

>
> Also, if you do that, is it possible to give one of the namespaces to the
> kernel?  That might solve the bootstrapping problem we currently have

Theoretically, but not right now (or ever?). See here:

https://lists.01.org/pipermail/spdk/2016-July/000073.html

> where we have nowhere to put the $osd_data filesystem with the device
> metadata.  (This is admittedly not necessarily a blocking issue.  Putting
> those dirs on / wouldn't be the end of the world; it just means cards
> can't be easily moved between boxes.)
>

Maybe we can use bluestore for these too ;) that been said, there
might be some kind of a loopback solution that could work, but not
sure if it won't create major bottlenecks that we'd want to avoid.

Yehuda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-09  0:06   ` Yehuda Sadeh-Weinraub
@ 2016-11-09  0:21     ` LIU, Fei
  2016-11-09  2:45       ` Dong Wu
  2016-11-09  4:59       ` Haomai Wang
  0 siblings, 2 replies; 18+ messages in thread
From: LIU, Fei @ 2016-11-09  0:21 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub, Sage Weil; +Cc: Wang, Haomai, ceph-devel

Hi Yehuda and Haomai,
   The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?

   Regards,
   James

   

On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:

    On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
    > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
    >> I just started looking at spdk, and have a few comments and questions.
    >>
    >> First, it's not clear to me how we should handle build. At the moment
    >> the spdk code resides as a submodule in the ceph tree, but it depends
    >> on dpdk, which currently needs to be downloaded separately. We can add
    >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
    >> said, getting it to build was a bit tricky and I think it might be
    >> broken with cmake. In order to get it working I resorted to building a
    >> system library and use that.
    >
    > Note that this PR is about to merge
    >
    >         https://github.com/ceph/ceph/pull/10748
    >
    > which adds the DPDK submodule, so hopefully this issue will go away when
    > that merged or with a follow-on cleanup.
    >
    >> The way to currently configure an osd to use bluestore with spdk is by
    >> creating a symbolic link that replaces the bluestore 'block' device to
    >> point to a file that has a name that is prefixed with 'spdk:'.
    >> Originally I assumed that the suffix would be the nvme device id, but
    >> it seems that it's not really needed, however, the file itself needs
    >> to contain the device id (see
    >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
    >> minor fixes).
    >
    > Open a PR for those?
    
    Sure
    
    >
    >> As I understand it, in order to support multiple osds on the same NVMe
    >> device we have a few options. We can leverage NVMe namespaces, but
    >> that's not supported on all devices. We can configure bluestore to
    >> only use part of the device (device sharding? not sure if it supports
    >> it). I think it's best if we could keep bluestore out of the loop
    >> there and have the NVMe driver abstract multiple partitions of the
    >> NVMe device. The idea is to be able to define multiple partitions on
    >> the device (e.g., each partition will be defined by the offset, size,
    >> and namespace), and have the osd set to use a specific partition.
    >> We'll probably need a special tool to manage it, and potentially keep
    >> the partition table information on the device itself. The tool could
    >> also manage the creation of the block link. We should probably rethink
    >> how the link is structure and what it points at.
    >
    > I agree that bluestore shouldn't get involved.
    >
    > Is the NVMe namespaces meant to support multiple processes sharing the
    > same hardware device?
    
    More of a partitioning solution, but yes (as far as I undestand).
    
    >
    > Also, if you do that, is it possible to give one of the namespaces to the
    > kernel?  That might solve the bootstrapping problem we currently have
    
    Theoretically, but not right now (or ever?). See here:
    
    https://lists.01.org/pipermail/spdk/2016-July/000073.html
    
    > where we have nowhere to put the $osd_data filesystem with the device
    > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
    > those dirs on / wouldn't be the end of the world; it just means cards
    > can't be easily moved between boxes.)
    >
    
    Maybe we can use bluestore for these too ;) that been said, there
    might be some kind of a loopback solution that could work, but not
    sure if it won't create major bottlenecks that we'd want to avoid.
    
    Yehuda
    --
    To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-09  0:21     ` LIU, Fei
@ 2016-11-09  2:45       ` Dong Wu
  2016-11-09 20:53         ` Moreno, Orlando
  2016-11-09  4:59       ` Haomai Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Dong Wu @ 2016-11-09  2:45 UTC (permalink / raw)
  To: LIU, Fei; +Cc: Yehuda Sadeh-Weinraub, Sage Weil, Wang, Haomai, ceph-devel

Hi, Yehuda and Haomai,
    DPDK backend may have the same problem. I had tried to use
haomai's  PR: https://github.com/ceph/ceph/pull/10748 to test dpdk
backend, but failed to start multiple OSDs on the host with only one
network card, alse i read about the dpdk multi-process support:
http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did
not find any config  to set multi-process support. Anything wrong or
multi-process support not been implemented?

2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> Hi Yehuda and Haomai,
>    The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?
>
>    Regards,
>    James
>
>
>
> On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
>
>     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
>     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
>     >> I just started looking at spdk, and have a few comments and questions.
>     >>
>     >> First, it's not clear to me how we should handle build. At the moment
>     >> the spdk code resides as a submodule in the ceph tree, but it depends
>     >> on dpdk, which currently needs to be downloaded separately. We can add
>     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
>     >> said, getting it to build was a bit tricky and I think it might be
>     >> broken with cmake. In order to get it working I resorted to building a
>     >> system library and use that.
>     >
>     > Note that this PR is about to merge
>     >
>     >         https://github.com/ceph/ceph/pull/10748
>     >
>     > which adds the DPDK submodule, so hopefully this issue will go away when
>     > that merged or with a follow-on cleanup.
>     >
>     >> The way to currently configure an osd to use bluestore with spdk is by
>     >> creating a symbolic link that replaces the bluestore 'block' device to
>     >> point to a file that has a name that is prefixed with 'spdk:'.
>     >> Originally I assumed that the suffix would be the nvme device id, but
>     >> it seems that it's not really needed, however, the file itself needs
>     >> to contain the device id (see
>     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
>     >> minor fixes).
>     >
>     > Open a PR for those?
>
>     Sure
>
>     >
>     >> As I understand it, in order to support multiple osds on the same NVMe
>     >> device we have a few options. We can leverage NVMe namespaces, but
>     >> that's not supported on all devices. We can configure bluestore to
>     >> only use part of the device (device sharding? not sure if it supports
>     >> it). I think it's best if we could keep bluestore out of the loop
>     >> there and have the NVMe driver abstract multiple partitions of the
>     >> NVMe device. The idea is to be able to define multiple partitions on
>     >> the device (e.g., each partition will be defined by the offset, size,
>     >> and namespace), and have the osd set to use a specific partition.
>     >> We'll probably need a special tool to manage it, and potentially keep
>     >> the partition table information on the device itself. The tool could
>     >> also manage the creation of the block link. We should probably rethink
>     >> how the link is structure and what it points at.
>     >
>     > I agree that bluestore shouldn't get involved.
>     >
>     > Is the NVMe namespaces meant to support multiple processes sharing the
>     > same hardware device?
>
>     More of a partitioning solution, but yes (as far as I undestand).
>
>     >
>     > Also, if you do that, is it possible to give one of the namespaces to the
>     > kernel?  That might solve the bootstrapping problem we currently have
>
>     Theoretically, but not right now (or ever?). See here:
>
>     https://lists.01.org/pipermail/spdk/2016-July/000073.html
>
>     > where we have nowhere to put the $osd_data filesystem with the device
>     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
>     > those dirs on / wouldn't be the end of the world; it just means cards
>     > can't be easily moved between boxes.)
>     >
>
>     Maybe we can use bluestore for these too ;) that been said, there
>     might be some kind of a loopback solution that could work, but not
>     sure if it won't create major bottlenecks that we'd want to avoid.
>
>     Yehuda
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-08 23:31 status of spdk Yehuda Sadeh-Weinraub
  2016-11-08 23:40 ` Sage Weil
@ 2016-11-09  4:45 ` Haomai Wang
  1 sibling, 0 replies; 18+ messages in thread
From: Haomai Wang @ 2016-11-09  4:45 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: Weil, Sage, ceph-devel

On Wed, Nov 9, 2016 at 7:31 AM, Yehuda Sadeh-Weinraub <yehuda@redhat.com> wrote:
> I just started looking at spdk, and have a few comments and questions.
>
> First, it's not clear to me how we should handle build. At the moment
> the spdk code resides as a submodule in the ceph tree, but it depends
> on dpdk, which currently needs to be downloaded separately. We can add
> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
> said, getting it to build was a bit tricky and I think it might be
> broken with cmake. In order to get it working I resorted to building a
> system library and use that.

yes, because we expect dpdk submodule will merge soon. we left this aside..

now the eaisest way is yum install dpdk-devel to complete the build
instead of git clone dpdk repo separated.

>
> The way to currently configure an osd to use bluestore with spdk is by
> creating a symbolic link that replaces the bluestore 'block' device to
> point to a file that has a name that is prefixed with 'spdk:'.
> Originally I assumed that the suffix would be the nvme device id, but
> it seems that it's not really needed, however, the file itself needs
> to contain the device id (see
> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
> minor fixes).

hmm, I commented in config_opt.h.
// If you want to use spdk driver, you need to specify NVMe serial number here
// with "spdk:" prefix.
// Users can use 'lspci -vvv -d 8086:0953 | grep "Device Serial Number"' to
// get the serial number of Intel(R) Fultondale NVMe controllers.
// Example:
// bluestore_block_path = spdk:55cd2e404bd73932

we don't need to create symbolic link by hand, it could be done in
bluestore codes.

>
> As I understand it, in order to support multiple osds on the same NVMe
> device we have a few options. We can leverage NVMe namespaces, but
> that's not supported on all devices. We can configure bluestore to
> only use part of the device (device sharding? not sure if it supports
> it). I think it's best if we could keep bluestore out of the loop
> there and have the NVMe driver abstract multiple partitions of the
> NVMe device. The idea is to be able to define multiple partitions on
> the device (e.g., each partition will be defined by the offset, size,
> and namespace), and have the osd set to use a specific partition.
> We'll probably need a special tool to manage it, and potentially keep
> the partition table information on the device itself. The tool could
> also manage the creation of the block link. We should probably rethink
> how the link is structure and what it points at.

I discussed multi namespace with intel, spdk will embedded multi
namespace management.
But before ceph-osd single process can support multi OSD instance, I
think we need to do offset/length in application side.

Besides these problems, the most important thing is getting ride of
spdk dependence on dpdk. before multi-osd within single process
feature is done, we can't bear the multi polling threads occur 100%
cpu times.

>
> Any thoughts?
>
> Yehuda



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-08 23:40 ` Sage Weil
  2016-11-09  0:06   ` Yehuda Sadeh-Weinraub
@ 2016-11-09  4:49   ` Haomai Wang
  1 sibling, 0 replies; 18+ messages in thread
From: Haomai Wang @ 2016-11-09  4:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yehuda Sadeh-Weinraub, ceph-devel

On Wed, Nov 9, 2016 at 7:40 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
>> I just started looking at spdk, and have a few comments and questions.
>>
>> First, it's not clear to me how we should handle build. At the moment
>> the spdk code resides as a submodule in the ceph tree, but it depends
>> on dpdk, which currently needs to be downloaded separately. We can add
>> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
>> said, getting it to build was a bit tricky and I think it might be
>> broken with cmake. In order to get it working I resorted to building a
>> system library and use that.
>
> Note that this PR is about to merge
>
>         https://github.com/ceph/ceph/pull/10748
>
> which adds the DPDK submodule, so hopefully this issue will go away when
> that merged or with a follow-on cleanup.

I rebased and I think we can merge now.

>
>> The way to currently configure an osd to use bluestore with spdk is by
>> creating a symbolic link that replaces the bluestore 'block' device to
>> point to a file that has a name that is prefixed with 'spdk:'.
>> Originally I assumed that the suffix would be the nvme device id, but
>> it seems that it's not really needed, however, the file itself needs
>> to contain the device id (see
>> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
>> minor fixes).
>
> Open a PR for those?

yep!

>
>> As I understand it, in order to support multiple osds on the same NVMe
>> device we have a few options. We can leverage NVMe namespaces, but
>> that's not supported on all devices. We can configure bluestore to
>> only use part of the device (device sharding? not sure if it supports
>> it). I think it's best if we could keep bluestore out of the loop
>> there and have the NVMe driver abstract multiple partitions of the
>> NVMe device. The idea is to be able to define multiple partitions on
>> the device (e.g., each partition will be defined by the offset, size,
>> and namespace), and have the osd set to use a specific partition.
>> We'll probably need a special tool to manage it, and potentially keep
>> the partition table information on the device itself. The tool could
>> also manage the creation of the block link. We should probably rethink
>> how the link is structure and what it points at.
>
> I agree that bluestore shouldn't get involved.
>
> Is the NVMe namespaces meant to support multiple processes sharing the
> same hardware device?

sure

>
> Also, if you do that, is it possible to give one of the namespaces to the
> kernel?  That might solve the bootstrapping problem we currently have
> where we have nowhere to put the $osd_data filesystem with the device
> metadata.  (This is admittedly not necessarily a blocking issue.  Putting
> those dirs on / wouldn't be the end of the world; it just means cards
> can't be easily moved between boxes.)

the spdk community is make nvme-cli support spdk backend. by default
nvmecli only can operate kernel nvme module, but intel is working on
making spdk can be operated by nvmecli. so it will make users much
convenient.

>
> sage



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-09  0:21     ` LIU, Fei
  2016-11-09  2:45       ` Dong Wu
@ 2016-11-09  4:59       ` Haomai Wang
  2016-11-09  5:02         ` LIU, Fei
  1 sibling, 1 reply; 18+ messages in thread
From: Haomai Wang @ 2016-11-09  4:59 UTC (permalink / raw)
  To: LIU, Fei; +Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel

On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote:
> Hi Yehuda and Haomai,
>    The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?

spdk nvme supports multi process is a undergoing spdk feature now, it
will be implemented via shared memory among multi process.

>
>    Regards,
>    James
>
>
>
> On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
>
>     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
>     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
>     >> I just started looking at spdk, and have a few comments and questions.
>     >>
>     >> First, it's not clear to me how we should handle build. At the moment
>     >> the spdk code resides as a submodule in the ceph tree, but it depends
>     >> on dpdk, which currently needs to be downloaded separately. We can add
>     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
>     >> said, getting it to build was a bit tricky and I think it might be
>     >> broken with cmake. In order to get it working I resorted to building a
>     >> system library and use that.
>     >
>     > Note that this PR is about to merge
>     >
>     >         https://github.com/ceph/ceph/pull/10748
>     >
>     > which adds the DPDK submodule, so hopefully this issue will go away when
>     > that merged or with a follow-on cleanup.
>     >
>     >> The way to currently configure an osd to use bluestore with spdk is by
>     >> creating a symbolic link that replaces the bluestore 'block' device to
>     >> point to a file that has a name that is prefixed with 'spdk:'.
>     >> Originally I assumed that the suffix would be the nvme device id, but
>     >> it seems that it's not really needed, however, the file itself needs
>     >> to contain the device id (see
>     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
>     >> minor fixes).
>     >
>     > Open a PR for those?
>
>     Sure
>
>     >
>     >> As I understand it, in order to support multiple osds on the same NVMe
>     >> device we have a few options. We can leverage NVMe namespaces, but
>     >> that's not supported on all devices. We can configure bluestore to
>     >> only use part of the device (device sharding? not sure if it supports
>     >> it). I think it's best if we could keep bluestore out of the loop
>     >> there and have the NVMe driver abstract multiple partitions of the
>     >> NVMe device. The idea is to be able to define multiple partitions on
>     >> the device (e.g., each partition will be defined by the offset, size,
>     >> and namespace), and have the osd set to use a specific partition.
>     >> We'll probably need a special tool to manage it, and potentially keep
>     >> the partition table information on the device itself. The tool could
>     >> also manage the creation of the block link. We should probably rethink
>     >> how the link is structure and what it points at.
>     >
>     > I agree that bluestore shouldn't get involved.
>     >
>     > Is the NVMe namespaces meant to support multiple processes sharing the
>     > same hardware device?
>
>     More of a partitioning solution, but yes (as far as I undestand).
>
>     >
>     > Also, if you do that, is it possible to give one of the namespaces to the
>     > kernel?  That might solve the bootstrapping problem we currently have
>
>     Theoretically, but not right now (or ever?). See here:
>
>     https://lists.01.org/pipermail/spdk/2016-July/000073.html
>
>     > where we have nowhere to put the $osd_data filesystem with the device
>     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
>     > those dirs on / wouldn't be the end of the world; it just means cards
>     > can't be easily moved between boxes.)
>     >
>
>     Maybe we can use bluestore for these too ;) that been said, there
>     might be some kind of a loopback solution that could work, but not
>     sure if it won't create major bottlenecks that we'd want to avoid.
>
>     Yehuda
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-09  4:59       ` Haomai Wang
@ 2016-11-09  5:02         ` LIU, Fei
  2016-11-09  5:09           ` Liu, Changpeng
  0 siblings, 1 reply; 18+ messages in thread
From: LIU, Fei @ 2016-11-09  5:02 UTC (permalink / raw)
  To: Haomai Wang, Liu, Changpeng; +Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel

Haomai,
   Thanks a lot.

   Regards,
   James

Hi Changpeng,
   Would you mind updating us about the status of multi processes support of spdk?

   Regards,
   James 

On 11/8/16, 8:59 PM, "Haomai Wang" <haomaiwang@gmail.com> wrote:

    On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote:
    > Hi Yehuda and Haomai,
    >    The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?
    
    spdk nvme supports multi process is a undergoing spdk feature now, it
    will be implemented via shared memory among multi process.
    
    >
    >    Regards,
    >    James
    >
    >
    >
    > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
    >
    >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
    >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
    >     >> I just started looking at spdk, and have a few comments and questions.
    >     >>
    >     >> First, it's not clear to me how we should handle build. At the moment
    >     >> the spdk code resides as a submodule in the ceph tree, but it depends
    >     >> on dpdk, which currently needs to be downloaded separately. We can add
    >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
    >     >> said, getting it to build was a bit tricky and I think it might be
    >     >> broken with cmake. In order to get it working I resorted to building a
    >     >> system library and use that.
    >     >
    >     > Note that this PR is about to merge
    >     >
    >     >         https://github.com/ceph/ceph/pull/10748
    >     >
    >     > which adds the DPDK submodule, so hopefully this issue will go away when
    >     > that merged or with a follow-on cleanup.
    >     >
    >     >> The way to currently configure an osd to use bluestore with spdk is by
    >     >> creating a symbolic link that replaces the bluestore 'block' device to
    >     >> point to a file that has a name that is prefixed with 'spdk:'.
    >     >> Originally I assumed that the suffix would be the nvme device id, but
    >     >> it seems that it's not really needed, however, the file itself needs
    >     >> to contain the device id (see
    >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
    >     >> minor fixes).
    >     >
    >     > Open a PR for those?
    >
    >     Sure
    >
    >     >
    >     >> As I understand it, in order to support multiple osds on the same NVMe
    >     >> device we have a few options. We can leverage NVMe namespaces, but
    >     >> that's not supported on all devices. We can configure bluestore to
    >     >> only use part of the device (device sharding? not sure if it supports
    >     >> it). I think it's best if we could keep bluestore out of the loop
    >     >> there and have the NVMe driver abstract multiple partitions of the
    >     >> NVMe device. The idea is to be able to define multiple partitions on
    >     >> the device (e.g., each partition will be defined by the offset, size,
    >     >> and namespace), and have the osd set to use a specific partition.
    >     >> We'll probably need a special tool to manage it, and potentially keep
    >     >> the partition table information on the device itself. The tool could
    >     >> also manage the creation of the block link. We should probably rethink
    >     >> how the link is structure and what it points at.
    >     >
    >     > I agree that bluestore shouldn't get involved.
    >     >
    >     > Is the NVMe namespaces meant to support multiple processes sharing the
    >     > same hardware device?
    >
    >     More of a partitioning solution, but yes (as far as I undestand).
    >
    >     >
    >     > Also, if you do that, is it possible to give one of the namespaces to the
    >     > kernel?  That might solve the bootstrapping problem we currently have
    >
    >     Theoretically, but not right now (or ever?). See here:
    >
    >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
    >
    >     > where we have nowhere to put the $osd_data filesystem with the device
    >     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
    >     > those dirs on / wouldn't be the end of the world; it just means cards
    >     > can't be easily moved between boxes.)
    >     >
    >
    >     Maybe we can use bluestore for these too ;) that been said, there
    >     might be some kind of a loopback solution that could work, but not
    >     sure if it won't create major bottlenecks that we'd want to avoid.
    >
    >     Yehuda
    >     --
    >     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    >     the body of a message to majordomo@vger.kernel.org
    >     More majordomo info at  http://vger.kernel.org/majordomo-info.html
    >
    >
    >
    
    
    
    -- 
    Best Regards,
    
    Wheat
    



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: status of spdk
  2016-11-09  5:02         ` LIU, Fei
@ 2016-11-09  5:09           ` Liu, Changpeng
  2016-11-09  5:23             ` LIU, Fei
  0 siblings, 1 reply; 18+ messages in thread
From: Liu, Changpeng @ 2016-11-09  5:09 UTC (permalink / raw)
  To: LIU, Fei, Haomai Wang
  Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel, Cao, Gang, Yang,
	Ziye, Dai, Qihua, Harris, James R

Hi James,

Yes, the multi processes support of SPDK is under development, Gang is the developer for the feature of  SPDK.
We are targeting to release the feature in 16.12 version for SPDK(WW50).


> -----Original Message-----
> From: LIU, Fei [mailto:james.liu@alibaba-inc.com]
> Sent: Wednesday, November 9, 2016 1:03 PM
> To: Haomai Wang <haomaiwang@gmail.com>; Liu, Changpeng
> <changpeng.liu@intel.com>
> Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil
> <sweil@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: status of spdk
> 
> Haomai,
>    Thanks a lot.
> 
>    Regards,
>    James
> 
> Hi Changpeng,
>    Would you mind updating us about the status of multi processes support of
> spdk?
> 
>    Regards,
>    James
> 
> On 11/8/16, 8:59 PM, "Haomai Wang" <haomaiwang@gmail.com> wrote:
> 
>     On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote:
>     > Hi Yehuda and Haomai,
>     >    The issue of drives driven by SPDK is not able to be shared by multiple OSDs
> as kernel NVMe drive since SPDK as a process so far can not be shared across
> multiple processes like OSDs, right?
> 
>     spdk nvme supports multi process is a undergoing spdk feature now, it
>     will be implemented via shared memory among multi process.
> 
>     >
>     >    Regards,
>     >    James
>     >
>     >
>     >
>     > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-
> owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
>     >
>     >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
>     >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
>     >     >> I just started looking at spdk, and have a few comments and questions.
>     >     >>
>     >     >> First, it's not clear to me how we should handle build. At the moment
>     >     >> the spdk code resides as a submodule in the ceph tree, but it depends
>     >     >> on dpdk, which currently needs to be downloaded separately. We can
> add
>     >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
>     >     >> said, getting it to build was a bit tricky and I think it might be
>     >     >> broken with cmake. In order to get it working I resorted to building a
>     >     >> system library and use that.
>     >     >
>     >     > Note that this PR is about to merge
>     >     >
>     >     >         https://github.com/ceph/ceph/pull/10748
>     >     >
>     >     > which adds the DPDK submodule, so hopefully this issue will go away
> when
>     >     > that merged or with a follow-on cleanup.
>     >     >
>     >     >> The way to currently configure an osd to use bluestore with spdk is by
>     >     >> creating a symbolic link that replaces the bluestore 'block' device to
>     >     >> point to a file that has a name that is prefixed with 'spdk:'.
>     >     >> Originally I assumed that the suffix would be the nvme device id, but
>     >     >> it seems that it's not really needed, however, the file itself needs
>     >     >> to contain the device id (see
>     >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple
> of
>     >     >> minor fixes).
>     >     >
>     >     > Open a PR for those?
>     >
>     >     Sure
>     >
>     >     >
>     >     >> As I understand it, in order to support multiple osds on the same NVMe
>     >     >> device we have a few options. We can leverage NVMe namespaces, but
>     >     >> that's not supported on all devices. We can configure bluestore to
>     >     >> only use part of the device (device sharding? not sure if it supports
>     >     >> it). I think it's best if we could keep bluestore out of the loop
>     >     >> there and have the NVMe driver abstract multiple partitions of the
>     >     >> NVMe device. The idea is to be able to define multiple partitions on
>     >     >> the device (e.g., each partition will be defined by the offset, size,
>     >     >> and namespace), and have the osd set to use a specific partition.
>     >     >> We'll probably need a special tool to manage it, and potentially keep
>     >     >> the partition table information on the device itself. The tool could
>     >     >> also manage the creation of the block link. We should probably rethink
>     >     >> how the link is structure and what it points at.
>     >     >
>     >     > I agree that bluestore shouldn't get involved.
>     >     >
>     >     > Is the NVMe namespaces meant to support multiple processes sharing
> the
>     >     > same hardware device?
>     >
>     >     More of a partitioning solution, but yes (as far as I undestand).
>     >
>     >     >
>     >     > Also, if you do that, is it possible to give one of the namespaces to the
>     >     > kernel?  That might solve the bootstrapping problem we currently have
>     >
>     >     Theoretically, but not right now (or ever?). See here:
>     >
>     >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
>     >
>     >     > where we have nowhere to put the $osd_data filesystem with the device
>     >     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
>     >     > those dirs on / wouldn't be the end of the world; it just means cards
>     >     > can't be easily moved between boxes.)
>     >     >
>     >
>     >     Maybe we can use bluestore for these too ;) that been said, there
>     >     might be some kind of a loopback solution that could work, but not
>     >     sure if it won't create major bottlenecks that we'd want to avoid.
>     >
>     >     Yehuda
>     >     --
>     >     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     >     the body of a message to majordomo@vger.kernel.org
>     >     More majordomo info at  http://vger.kernel.org/majordomo-info.html
>     >
>     >
>     >
> 
> 
> 
>     --
>     Best Regards,
> 
>     Wheat
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-09  5:09           ` Liu, Changpeng
@ 2016-11-09  5:23             ` LIU, Fei
  0 siblings, 0 replies; 18+ messages in thread
From: LIU, Fei @ 2016-11-09  5:23 UTC (permalink / raw)
  To: Liu, Changpeng, Haomai Wang
  Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel, Cao, Gang, Yang,
	Ziye, Dai, Qihua, Harris, James R

Hi Changpeng,
  Thanks a lot for your update.

  Regards,
  James

On 11/8/16, 9:09 PM, "Liu, Changpeng" <changpeng.liu@intel.com> wrote:

    Hi James,
    
    Yes, the multi processes support of SPDK is under development, Gang is the developer for the feature of  SPDK.
    We are targeting to release the feature in 16.12 version for SPDK(WW50).
    
    
    > -----Original Message-----
    > From: LIU, Fei [mailto:james.liu@alibaba-inc.com]
    > Sent: Wednesday, November 9, 2016 1:03 PM
    > To: Haomai Wang <haomaiwang@gmail.com>; Liu, Changpeng
    > <changpeng.liu@intel.com>
    > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil
    > <sweil@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
    > Subject: Re: status of spdk
    > 
    > Haomai,
    >    Thanks a lot.
    > 
    >    Regards,
    >    James
    > 
    > Hi Changpeng,
    >    Would you mind updating us about the status of multi processes support of
    > spdk?
    > 
    >    Regards,
    >    James
    > 
    > On 11/8/16, 8:59 PM, "Haomai Wang" <haomaiwang@gmail.com> wrote:
    > 
    >     On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote:
    >     > Hi Yehuda and Haomai,
    >     >    The issue of drives driven by SPDK is not able to be shared by multiple OSDs
    > as kernel NVMe drive since SPDK as a process so far can not be shared across
    > multiple processes like OSDs, right?
    > 
    >     spdk nvme supports multi process is a undergoing spdk feature now, it
    >     will be implemented via shared memory among multi process.
    > 
    >     >
    >     >    Regards,
    >     >    James
    >     >
    >     >
    >     >
    >     > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-
    > owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
    >     >
    >     >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
    >     >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
    >     >     >> I just started looking at spdk, and have a few comments and questions.
    >     >     >>
    >     >     >> First, it's not clear to me how we should handle build. At the moment
    >     >     >> the spdk code resides as a submodule in the ceph tree, but it depends
    >     >     >> on dpdk, which currently needs to be downloaded separately. We can
    > add
    >     >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
    >     >     >> said, getting it to build was a bit tricky and I think it might be
    >     >     >> broken with cmake. In order to get it working I resorted to building a
    >     >     >> system library and use that.
    >     >     >
    >     >     > Note that this PR is about to merge
    >     >     >
    >     >     >         https://github.com/ceph/ceph/pull/10748
    >     >     >
    >     >     > which adds the DPDK submodule, so hopefully this issue will go away
    > when
    >     >     > that merged or with a follow-on cleanup.
    >     >     >
    >     >     >> The way to currently configure an osd to use bluestore with spdk is by
    >     >     >> creating a symbolic link that replaces the bluestore 'block' device to
    >     >     >> point to a file that has a name that is prefixed with 'spdk:'.
    >     >     >> Originally I assumed that the suffix would be the nvme device id, but
    >     >     >> it seems that it's not really needed, however, the file itself needs
    >     >     >> to contain the device id (see
    >     >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple
    > of
    >     >     >> minor fixes).
    >     >     >
    >     >     > Open a PR for those?
    >     >
    >     >     Sure
    >     >
    >     >     >
    >     >     >> As I understand it, in order to support multiple osds on the same NVMe
    >     >     >> device we have a few options. We can leverage NVMe namespaces, but
    >     >     >> that's not supported on all devices. We can configure bluestore to
    >     >     >> only use part of the device (device sharding? not sure if it supports
    >     >     >> it). I think it's best if we could keep bluestore out of the loop
    >     >     >> there and have the NVMe driver abstract multiple partitions of the
    >     >     >> NVMe device. The idea is to be able to define multiple partitions on
    >     >     >> the device (e.g., each partition will be defined by the offset, size,
    >     >     >> and namespace), and have the osd set to use a specific partition.
    >     >     >> We'll probably need a special tool to manage it, and potentially keep
    >     >     >> the partition table information on the device itself. The tool could
    >     >     >> also manage the creation of the block link. We should probably rethink
    >     >     >> how the link is structure and what it points at.
    >     >     >
    >     >     > I agree that bluestore shouldn't get involved.
    >     >     >
    >     >     > Is the NVMe namespaces meant to support multiple processes sharing
    > the
    >     >     > same hardware device?
    >     >
    >     >     More of a partitioning solution, but yes (as far as I undestand).
    >     >
    >     >     >
    >     >     > Also, if you do that, is it possible to give one of the namespaces to the
    >     >     > kernel?  That might solve the bootstrapping problem we currently have
    >     >
    >     >     Theoretically, but not right now (or ever?). See here:
    >     >
    >     >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
    >     >
    >     >     > where we have nowhere to put the $osd_data filesystem with the device
    >     >     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
    >     >     > those dirs on / wouldn't be the end of the world; it just means cards
    >     >     > can't be easily moved between boxes.)
    >     >     >
    >     >
    >     >     Maybe we can use bluestore for these too ;) that been said, there
    >     >     might be some kind of a loopback solution that could work, but not
    >     >     sure if it won't create major bottlenecks that we'd want to avoid.
    >     >
    >     >     Yehuda
    >     >     --
    >     >     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    >     >     the body of a message to majordomo@vger.kernel.org
    >     >     More majordomo info at  http://vger.kernel.org/majordomo-info.html
    >     >
    >     >
    >     >
    > 
    > 
    > 
    >     --
    >     Best Regards,
    > 
    >     Wheat
    > 
    > 
    
    



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: status of spdk
  2016-11-09  2:45       ` Dong Wu
@ 2016-11-09 20:53         ` Moreno, Orlando
  2016-11-09 20:58           ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Moreno, Orlando @ 2016-11-09 20:53 UTC (permalink / raw)
  To: Dong Wu, LIU, Fei
  Cc: Yehuda Sadeh-Weinraub, Sage Weil, Wang, Haomai, ceph-devel,
	'ifedotov@mirantis.com'

Hi all,

Multiple DPDK/SPDK instances on a single host does not work because the current implementation in Ceph does not support it. This issue is tracked here: http://tracker.ceph.com/issues/16966 There is multi-process support in DPDK, but you must configure the EAL correctly for it to work. I have been working on a patch, https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user to configure multiple BlueStore OSDs backed by SPDK. Though this patch works, I think it needs a few additions to actually make it performant.

This is just to get the 1 OSD process per NVMe case working. A multi-OSD per NVMe solution will probably require more work as described in this thread.

Thanks,
Orlando


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu
Sent: Tuesday, November 8, 2016 7:45 PM
To: LIU, Fei <james.liu@alibaba-inc.com>
Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: status of spdk

Hi, Yehuda and Haomai,
    DPDK backend may have the same problem. I had tried to use haomai's  PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support:
http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config  to set multi-process support. Anything wrong or multi-process support not been implemented?

2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> Hi Yehuda and Haomai,
>    The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?
>
>    Regards,
>    James
>
>
>
> On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
>
>     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
>     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
>     >> I just started looking at spdk, and have a few comments and questions.
>     >>
>     >> First, it's not clear to me how we should handle build. At the moment
>     >> the spdk code resides as a submodule in the ceph tree, but it depends
>     >> on dpdk, which currently needs to be downloaded separately. We can add
>     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
>     >> said, getting it to build was a bit tricky and I think it might be
>     >> broken with cmake. In order to get it working I resorted to building a
>     >> system library and use that.
>     >
>     > Note that this PR is about to merge
>     >
>     >         https://github.com/ceph/ceph/pull/10748
>     >
>     > which adds the DPDK submodule, so hopefully this issue will go away when
>     > that merged or with a follow-on cleanup.
>     >
>     >> The way to currently configure an osd to use bluestore with spdk is by
>     >> creating a symbolic link that replaces the bluestore 'block' device to
>     >> point to a file that has a name that is prefixed with 'spdk:'.
>     >> Originally I assumed that the suffix would be the nvme device id, but
>     >> it seems that it's not really needed, however, the file itself needs
>     >> to contain the device id (see
>     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
>     >> minor fixes).
>     >
>     > Open a PR for those?
>
>     Sure
>
>     >
>     >> As I understand it, in order to support multiple osds on the same NVMe
>     >> device we have a few options. We can leverage NVMe namespaces, but
>     >> that's not supported on all devices. We can configure bluestore to
>     >> only use part of the device (device sharding? not sure if it supports
>     >> it). I think it's best if we could keep bluestore out of the loop
>     >> there and have the NVMe driver abstract multiple partitions of the
>     >> NVMe device. The idea is to be able to define multiple partitions on
>     >> the device (e.g., each partition will be defined by the offset, size,
>     >> and namespace), and have the osd set to use a specific partition.
>     >> We'll probably need a special tool to manage it, and potentially keep
>     >> the partition table information on the device itself. The tool could
>     >> also manage the creation of the block link. We should probably rethink
>     >> how the link is structure and what it points at.
>     >
>     > I agree that bluestore shouldn't get involved.
>     >
>     > Is the NVMe namespaces meant to support multiple processes sharing the
>     > same hardware device?
>
>     More of a partitioning solution, but yes (as far as I undestand).
>
>     >
>     > Also, if you do that, is it possible to give one of the namespaces to the
>     > kernel?  That might solve the bootstrapping problem we currently 
> have
>
>     Theoretically, but not right now (or ever?). See here:
>
>     https://lists.01.org/pipermail/spdk/2016-July/000073.html
>
>     > where we have nowhere to put the $osd_data filesystem with the device
>     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
>     > those dirs on / wouldn't be the end of the world; it just means cards
>     > can't be easily moved between boxes.)
>     >
>
>     Maybe we can use bluestore for these too ;) that been said, there
>     might be some kind of a loopback solution that could work, but not
>     sure if it won't create major bottlenecks that we'd want to avoid.
>
>     Yehuda
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: status of spdk
  2016-11-09 20:53         ` Moreno, Orlando
@ 2016-11-09 20:58           ` Sage Weil
  2016-11-09 21:00             ` Gohad, Tushar
  2016-11-09 21:10             ` Gohad, Tushar
  0 siblings, 2 replies; 18+ messages in thread
From: Sage Weil @ 2016-11-09 20:58 UTC (permalink / raw)
  To: Moreno, Orlando
  Cc: Dong Wu, LIU, Fei, Yehuda Sadeh-Weinraub, Wang, Haomai,
	ceph-devel, 'ifedotov@mirantis.com'

On Wed, 9 Nov 2016, Moreno, Orlando wrote:
> Hi all,
> 
> Multiple DPDK/SPDK instances on a single host does not work because the 
> current implementation in Ceph does not support it. This issue is 
> tracked here: http://tracker.ceph.com/issues/16966 There is 
> multi-process support in DPDK, but you must configure the EAL correctly 
> for it to work. I have been working on a patch, 
> https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user to 
> configure multiple BlueStore OSDs backed by SPDK. Though this patch 
> works, I think it needs a few additions to actually make it performant.
> 
> This is just to get the 1 OSD process per NVMe case working. A multi-OSD 
> per NVMe solution will probably require more work as described in this 
> thread.

TBH I'm not sure how important the multi-osd per NVMe case is.  The only 
reason to do that would be performance bottlenecks within the OSD itself, 
and I'd rather focus our efforts on eliminating those than on enabling a 
bandaid solution.

As I understand it the scenarios that are most interesting are

1- sharing the same network device to multiple osds with DPDK (this will 
presumably be pretty common unless/until we combine many OSDs into a 
single process), and

2- sharing a tiny portion of the NVMe device for the osd_data (usually a 
few MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
Not sure this will be feasible or not.

As I think Haomai mentioned, the next barrier is probably the requirements 
around DPDK event loop and dedicated core?

sage



 > 
> Thanks,
> Orlando
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu
> Sent: Tuesday, November 8, 2016 7:45 PM
> To: LIU, Fei <james.liu@alibaba-inc.com>
> Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: status of spdk
> 
> Hi, Yehuda and Haomai,
>     DPDK backend may have the same problem. I had tried to use haomai's  PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support:
> http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config  to set multi-process support. Anything wrong or multi-process support not been implemented?
> 
> 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> > Hi Yehuda and Haomai,
> >    The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?
> >
> >    Regards,
> >    James
> >
> >
> >
> > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
> >
> >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
> >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> >     >> I just started looking at spdk, and have a few comments and questions.
> >     >>
> >     >> First, it's not clear to me how we should handle build. At the moment
> >     >> the spdk code resides as a submodule in the ceph tree, but it depends
> >     >> on dpdk, which currently needs to be downloaded separately. We can add
> >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
> >     >> said, getting it to build was a bit tricky and I think it might be
> >     >> broken with cmake. In order to get it working I resorted to building a
> >     >> system library and use that.
> >     >
> >     > Note that this PR is about to merge
> >     >
> >     >         https://github.com/ceph/ceph/pull/10748
> >     >
> >     > which adds the DPDK submodule, so hopefully this issue will go away when
> >     > that merged or with a follow-on cleanup.
> >     >
> >     >> The way to currently configure an osd to use bluestore with spdk is by
> >     >> creating a symbolic link that replaces the bluestore 'block' device to
> >     >> point to a file that has a name that is prefixed with 'spdk:'.
> >     >> Originally I assumed that the suffix would be the nvme device id, but
> >     >> it seems that it's not really needed, however, the file itself needs
> >     >> to contain the device id (see
> >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
> >     >> minor fixes).
> >     >
> >     > Open a PR for those?
> >
> >     Sure
> >
> >     >
> >     >> As I understand it, in order to support multiple osds on the same NVMe
> >     >> device we have a few options. We can leverage NVMe namespaces, but
> >     >> that's not supported on all devices. We can configure bluestore to
> >     >> only use part of the device (device sharding? not sure if it supports
> >     >> it). I think it's best if we could keep bluestore out of the loop
> >     >> there and have the NVMe driver abstract multiple partitions of the
> >     >> NVMe device. The idea is to be able to define multiple partitions on
> >     >> the device (e.g., each partition will be defined by the offset, size,
> >     >> and namespace), and have the osd set to use a specific partition.
> >     >> We'll probably need a special tool to manage it, and potentially keep
> >     >> the partition table information on the device itself. The tool could
> >     >> also manage the creation of the block link. We should probably rethink
> >     >> how the link is structure and what it points at.
> >     >
> >     > I agree that bluestore shouldn't get involved.
> >     >
> >     > Is the NVMe namespaces meant to support multiple processes sharing the
> >     > same hardware device?
> >
> >     More of a partitioning solution, but yes (as far as I undestand).
> >
> >     >
> >     > Also, if you do that, is it possible to give one of the namespaces to the
> >     > kernel?  That might solve the bootstrapping problem we currently 
> > have
> >
> >     Theoretically, but not right now (or ever?). See here:
> >
> >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> >
> >     > where we have nowhere to put the $osd_data filesystem with the device
> >     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
> >     > those dirs on / wouldn't be the end of the world; it just means cards
> >     > can't be easily moved between boxes.)
> >     >
> >
> >     Maybe we can use bluestore for these too ;) that been said, there
> >     might be some kind of a loopback solution that could work, but not
> >     sure if it won't create major bottlenecks that we'd want to avoid.
> >
> >     Yehuda
> >     --
> >     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >     the body of a message to majordomo@vger.kernel.org
> >     More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: status of spdk
  2016-11-09 20:58           ` Sage Weil
@ 2016-11-09 21:00             ` Gohad, Tushar
  2016-11-09 21:10             ` Gohad, Tushar
  1 sibling, 0 replies; 18+ messages in thread
From: Gohad, Tushar @ 2016-11-09 21:00 UTC (permalink / raw)
  To: Sage Weil, Moreno, Orlando
  Cc: Dong Wu, LIU, Fei, Yehuda Sadeh-Weinraub, Wang, Haomai,
	ceph-devel, 'ifedotov@mirantis.com'



-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, November 9, 2016 1:59 PM
To: Moreno, Orlando <orlando.moreno@intel.com>
Cc: Dong Wu <archer.wudong@gmail.com>; LIU, Fei <james.liu@alibaba-inc.com>; Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel <ceph-devel@vger.kernel.org>; 'ifedotov@mirantis.com' <ifedotov@mirantis.com>
Subject: RE: status of spdk

On Wed, 9 Nov 2016, Moreno, Orlando wrote:
> Hi all,
> 
> Multiple DPDK/SPDK instances on a single host does not work because 
> the current implementation in Ceph does not support it. This issue is 
> tracked here: http://tracker.ceph.com/issues/16966 There is 
> multi-process support in DPDK, but you must configure the EAL 
> correctly for it to work. I have been working on a patch, 
> https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user 
> to configure multiple BlueStore OSDs backed by SPDK. Though this patch 
> works, I think it needs a few additions to actually make it performant.
> 
> This is just to get the 1 OSD process per NVMe case working. A 
> multi-OSD per NVMe solution will probably require more work as 
> described in this thread.

TBH I'm not sure how important the multi-osd per NVMe case is.  The only reason to do that would be performance bottlenecks within the OSD itself, and I'd rather focus our efforts on eliminating those than on enabling a bandaid solution.

As I understand it the scenarios that are most interesting are

1- sharing the same network device to multiple osds with DPDK (this will presumably be pretty common unless/until we combine many OSDs into a single process), and

2- sharing a tiny portion of the NVMe device for the osd_data (usually a few MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
Not sure this will be feasible or not.

As I think Haomai mentioned, the next barrier is probably the requirements around DPDK event loop and dedicated core?

sage



 > 
> Thanks,
> Orlando
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu
> Sent: Tuesday, November 8, 2016 7:45 PM
> To: LIU, Fei <james.liu@alibaba-inc.com>
> Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil 
> <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel 
> <ceph-devel@vger.kernel.org>
> Subject: Re: status of spdk
> 
> Hi, Yehuda and Haomai,
>     DPDK backend may have the same problem. I had tried to use haomai's  PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support:
> http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config  to set multi-process support. Anything wrong or multi-process support not been implemented?
> 
> 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> > Hi Yehuda and Haomai,
> >    The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?
> >
> >    Regards,
> >    James
> >
> >
> >
> > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
> >
> >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
> >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> >     >> I just started looking at spdk, and have a few comments and questions.
> >     >>
> >     >> First, it's not clear to me how we should handle build. At the moment
> >     >> the spdk code resides as a submodule in the ceph tree, but it depends
> >     >> on dpdk, which currently needs to be downloaded separately. We can add
> >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
> >     >> said, getting it to build was a bit tricky and I think it might be
> >     >> broken with cmake. In order to get it working I resorted to building a
> >     >> system library and use that.
> >     >
> >     > Note that this PR is about to merge
> >     >
> >     >         https://github.com/ceph/ceph/pull/10748
> >     >
> >     > which adds the DPDK submodule, so hopefully this issue will go away when
> >     > that merged or with a follow-on cleanup.
> >     >
> >     >> The way to currently configure an osd to use bluestore with spdk is by
> >     >> creating a symbolic link that replaces the bluestore 'block' device to
> >     >> point to a file that has a name that is prefixed with 'spdk:'.
> >     >> Originally I assumed that the suffix would be the nvme device id, but
> >     >> it seems that it's not really needed, however, the file itself needs
> >     >> to contain the device id (see
> >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
> >     >> minor fixes).
> >     >
> >     > Open a PR for those?
> >
> >     Sure
> >
> >     >
> >     >> As I understand it, in order to support multiple osds on the same NVMe
> >     >> device we have a few options. We can leverage NVMe namespaces, but
> >     >> that's not supported on all devices. We can configure bluestore to
> >     >> only use part of the device (device sharding? not sure if it supports
> >     >> it). I think it's best if we could keep bluestore out of the loop
> >     >> there and have the NVMe driver abstract multiple partitions of the
> >     >> NVMe device. The idea is to be able to define multiple partitions on
> >     >> the device (e.g., each partition will be defined by the offset, size,
> >     >> and namespace), and have the osd set to use a specific partition.
> >     >> We'll probably need a special tool to manage it, and potentially keep
> >     >> the partition table information on the device itself. The tool could
> >     >> also manage the creation of the block link. We should probably rethink
> >     >> how the link is structure and what it points at.
> >     >
> >     > I agree that bluestore shouldn't get involved.
> >     >
> >     > Is the NVMe namespaces meant to support multiple processes sharing the
> >     > same hardware device?
> >
> >     More of a partitioning solution, but yes (as far as I undestand).
> >
> >     >
> >     > Also, if you do that, is it possible to give one of the namespaces to the
> >     > kernel?  That might solve the bootstrapping problem we 
> > currently have
> >
> >     Theoretically, but not right now (or ever?). See here:
> >
> >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> >
> >     > where we have nowhere to put the $osd_data filesystem with the device
> >     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
> >     > those dirs on / wouldn't be the end of the world; it just means cards
> >     > can't be easily moved between boxes.)
> >     >
> >
> >     Maybe we can use bluestore for these too ;) that been said, there
> >     might be some kind of a loopback solution that could work, but not
> >     sure if it won't create major bottlenecks that we'd want to avoid.
> >
> >     Yehuda
> >     --
> >     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >     the body of a message to majordomo@vger.kernel.org
> >     More majordomo info at  
> > http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: status of spdk
  2016-11-09 20:58           ` Sage Weil
  2016-11-09 21:00             ` Gohad, Tushar
@ 2016-11-09 21:10             ` Gohad, Tushar
  2016-11-10 22:39               ` Walker, Benjamin
  1 sibling, 1 reply; 18+ messages in thread
From: Gohad, Tushar @ 2016-11-09 21:10 UTC (permalink / raw)
  To: Sage Weil, Moreno, Orlando
  Cc: Dong Wu, LIU, Fei, Yehuda Sadeh-Weinraub, Wang, Haomai,
	ceph-devel, 'ifedotov@mirantis.com'

>> Multiple DPDK/SPDK instances on a single host does not work because 
>> the current implementation in Ceph does not support it. This issue is 
>> tracked here: http://tracker.ceph.com/issues/16966 There is 
>> multi-process support in DPDK, but you must configure the EAL 
>> correctly for it to work. I have been working on a patch, 
>> https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user 
>> to configure multiple BlueStore OSDs backed by SPDK. Though this patch 
>> works, I think it needs a few additions to actually make it performant.
>> This is just to get the 1 OSD process per NVMe case working. A 
>> multi-OSD per NVMe solution will probably require more work as 
>> described in this thread.

> TBH I'm not sure how important the multi-osd per NVMe case is.  
> The only reason to do that would be performance bottlenecks within the 
> OSD itself, and I'd rather focus our efforts on eliminating those than on 
> enabling a bandaid solution.

Completely agree here.

> As I understand it the scenarios that are most interesting are

> 1- sharing the same network device to multiple osds with DPDK (this will presumably be pretty common unless/until we combine many OSDs into a single process), and

> 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
> Not sure this will be feasible or not.

Unfortunately, this is not feasible without some form of partitioning support in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK bdev) - the latter is under development at the moment.

The limitation that Orlando identified is, not being able to launch multiple SPDK-based OSDs on a node today.  Igor and Orlando's PR (16966) is to add a config option to limit the number of hugepages assigned to an OSD via an EAL switch.  The other limitation today is being able to specify the CPU mask assigned to each OSD which would require config addition.

Tushar 


> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu
> Sent: Tuesday, November 8, 2016 7:45 PM
> To: LIU, Fei <james.liu@alibaba-inc.com>
> Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil 
> <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel 
> <ceph-devel@vger.kernel.org>
> Subject: Re: status of spdk
> 
> Hi, Yehuda and Haomai,
>     DPDK backend may have the same problem. I had tried to use haomai's  PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support:
> http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config  to set multi-process support. Anything wrong or multi-process support not been implemented?
> 
> 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> > Hi Yehuda and Haomai,
> >    The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right?
> >
> >    Regards,
> >    James
> >
> >
> >
> > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote:
> >
> >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
> >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> >     >> I just started looking at spdk, and have a few comments and questions.
> >     >>
> >     >> First, it's not clear to me how we should handle build. At the moment
> >     >> the spdk code resides as a submodule in the ceph tree, but it depends
> >     >> on dpdk, which currently needs to be downloaded separately. We can add
> >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been
> >     >> said, getting it to build was a bit tricky and I think it might be
> >     >> broken with cmake. In order to get it working I resorted to building a
> >     >> system library and use that.
> >     >
> >     > Note that this PR is about to merge
> >     >
> >     >         https://github.com/ceph/ceph/pull/10748
> >     >
> >     > which adds the DPDK submodule, so hopefully this issue will go away when
> >     > that merged or with a follow-on cleanup.
> >     >
> >     >> The way to currently configure an osd to use bluestore with spdk is by
> >     >> creating a symbolic link that replaces the bluestore 'block' device to
> >     >> point to a file that has a name that is prefixed with 'spdk:'.
> >     >> Originally I assumed that the suffix would be the nvme device id, but
> >     >> it seems that it's not really needed, however, the file itself needs
> >     >> to contain the device id (see
> >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of
> >     >> minor fixes).
> >     >
> >     > Open a PR for those?
> >
> >     Sure
> >
> >     >
> >     >> As I understand it, in order to support multiple osds on the same NVMe
> >     >> device we have a few options. We can leverage NVMe namespaces, but
> >     >> that's not supported on all devices. We can configure bluestore to
> >     >> only use part of the device (device sharding? not sure if it supports
> >     >> it). I think it's best if we could keep bluestore out of the loop
> >     >> there and have the NVMe driver abstract multiple partitions of the
> >     >> NVMe device. The idea is to be able to define multiple partitions on
> >     >> the device (e.g., each partition will be defined by the offset, size,
> >     >> and namespace), and have the osd set to use a specific partition.
> >     >> We'll probably need a special tool to manage it, and potentially keep
> >     >> the partition table information on the device itself. The tool could
> >     >> also manage the creation of the block link. We should probably rethink
> >     >> how the link is structure and what it points at.
> >     >
> >     > I agree that bluestore shouldn't get involved.
> >     >
> >     > Is the NVMe namespaces meant to support multiple processes sharing the
> >     > same hardware device?
> >
> >     More of a partitioning solution, but yes (as far as I undestand).
> >
> >     >
> >     > Also, if you do that, is it possible to give one of the namespaces to the
> >     > kernel?  That might solve the bootstrapping problem we 
> > currently have
> >
> >     Theoretically, but not right now (or ever?). See here:
> >
> >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> >
> >     > where we have nowhere to put the $osd_data filesystem with the device
> >     > metadata.  (This is admittedly not necessarily a blocking issue.  Putting
> >     > those dirs on / wouldn't be the end of the world; it just means cards
> >     > can't be easily moved between boxes.)
> >     >
> >
> >     Maybe we can use bluestore for these too ;) that been said, there
> >     might be some kind of a loopback solution that could work, but not
> >     sure if it won't create major bottlenecks that we'd want to avoid.
> >
> >     Yehuda
> >     --
> >     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >     the body of a message to majordomo@vger.kernel.org
> >     More majordomo info at  
> > http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-09 21:10             ` Gohad, Tushar
@ 2016-11-10 22:39               ` Walker, Benjamin
  2016-11-10 22:59                 ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Walker, Benjamin @ 2016-11-10 22:39 UTC (permalink / raw)
  To: Gohad, Tushar, sage, Moreno, Orlando
  Cc: haomaiwang, archer.wudong, ifedotov, james.liu, ceph-devel, yehuda

On Wed, 2016-11-09 at 21:10 +0000, Gohad, Tushar wrote:
> > 
> > > 
> > > Multiple DPDK/SPDK instances on a single host does not work because 
> > > the current implementation in Ceph does not support it. This issue is 
> > > tracked here: http://tracker.ceph.com/issues/16966 There is 
> > > multi-process support in DPDK, but you must configure the EAL 
> > > correctly for it to work. I have been working on a patch, 
> > > https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user 
> > > to configure multiple BlueStore OSDs backed by SPDK. Though this patch 
> > > works, I think it needs a few additions to actually make it performant.
> > > This is just to get the 1 OSD process per NVMe case working. A 
> > > multi-OSD per NVMe solution will probably require more work as 
> > > described in this thread.
> 
> > 
> > TBH I'm not sure how important the multi-osd per NVMe case is.  
> > The only reason to do that would be performance bottlenecks within the 
> > OSD itself, and I'd rather focus our efforts on eliminating those than on 
> > enabling a bandaid solution.
> 
> Completely agree here.

I'm not a Ceph expert (I'm the technical lead for SPDK), but I echo this
sentiment 1000x. Even the fastest NVMe device can only do <1M 4k I/Ops which is
a very modest number in terms of CPU time, so there is no technical reason that
a single OSD can't saturate that. I understand that the OSDs of today aren't
able to achieve that level of performance, but I'm optimistic a concerted long-
term effort involving the experts could make it happen.

I'd also like to explain a few things about NVMe to clear up some confusion I
saw earlier in this thread. NVMe devices are composed of three major primitives
- a singleton controller, some set of namespaces, and a set of queues. The
namespaces are constructs on the SSD itself, and they're basically contiguous
sets of logical blocks within a single NVMe controller. The vast majority of
SSDs support exactly 1 namespace and I don't expect that to change going
forward. The singular NVMe controller is what the NVMe driver is loaded against,
so you can either have SPDK loaded or the kernel - you can't mix and match or
split namespaces, etc. 

NVMe also exposes a set of queues on which I/O requests can be submitted. These
queues can submit an I/O request to any namespace on the device and there is no
way to enforce particular queues mapping to particular namespaces. Therefore,
namespaces aren't that valuable as a mechanism for sharing the drive - you
basically still have to have a software/driver layer verifying that everyone is
keeping their requests separate (the namespace mechanism is there so that the
media can be formatted in different ways - i.e. different block sizes,
additional metadata, etc.). SPDK exposes these queues to the user so that
applications can submit I/O on each queue entirely locklessly and with no
coordination. Unfortunately, the version of SPDK currently in use by BlueStore
is ancient and the queues are all implicit still. It probably doesn't matter for
performance, since the BlueStore SPDK backend only sends I/O from a single
thread, which means it is using just a single queue. NVMe devices almost
universally can get their full performance using a single queue, so multiple
queues is only useful for the application software to submit I/O from many
threads simultaneously without locking (which BlueStore is not doing).

The SPDK NVMe driver unbinds the nvme driver in the kernel, then maps the NVMe
controller registers into a userspace process, so only that process has access
to the device. We're currently modifying the driver to allocate the critical
structures in shared memory so certain parts can be mapped by secondary
processes. This does allow for some level of multi-process support. We mostly
intended this for use with management tools like nvme-cli - they can attach to
the main process and send some management commands and then detach. I'm not sure
this is a great solution for sharing an NVMe device across multiple primary OSD
processes though. We can definitely do something in this space to create a
daemon process that owns the device and allows other processes to attach and
allocate queues, but like I said above I think the effort is best spent on
making the OSD faster.

Further, given what the NVMe hardware is actually capable of, I think the right
solution for sharing an NVMe device within a process is to write a partition
layer in software based on standard GPT partitioning. That could sit on top of
the NVMe driver and do the enforcement of which parts can write to which logical
blocks on the SSD. This would be the best way forward if the Ceph community
pursues multiple OSDs in a single process (again, I think the time should be
spent making one OSD fast enough to saturate one SSD instead).

> 
> > 
> > As I understand it the scenarios that are most interesting are
> 
> > 
> > 1- sharing the same network device to multiple osds with DPDK (this will
> > presumably be pretty common unless/until we combine many OSDs into a single
> > process), and

I believe the best path forward here is SR-IOV hardware support in the NICs. I
don't know what the state of the hardware on the NIC side is here, but I think
SR-IOV is commonly available already on the network side.

> 
> > 
> > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few
> > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
> > Not sure this will be feasible or not.
> 
> Unfortunately, this is not feasible without some form of partitioning support
> in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK bdev)
> - the latter is under development at the moment.

We are currently developing both a persistent block allocator and a very
lightweight, minimally featured filesystem (no directories, no permissions, no
times). The original target for these are as the backing store of RocksDB, but
they can be easily expanded to store other data. It isn't necessarily our
primary aim to incorporate this into Ceph, but it clearly fits into the Ceph
internals very well. We haven't provided a timeline on open sourcing this, but
we're actively writing the code now.

> 
> The limitation that Orlando identified is, not being able to launch multiple
> SPDK-based OSDs on a node today.  Igor and Orlando's PR (16966) is to add a
> config option to limit the number of hugepages assigned to an OSD via an EAL
> switch.  The other limitation today is being able to specify the CPU mask
> assigned to each OSD which would require config addition.
> 

To clarify, DPDK requires a process to declare the amount of memory (hugepages)
and which CPU cores the process will use up front. If you want to run multiple
DPDK-based processes on the same system, you just have to make sure there are
enough hugepages and that the cores you specify don't overlap. That PR is just
making it so you can configure these values. I just wanted to clarify that there
isn't any deeper technical problem with running multiple DPDK processes on the
same system - you all probably knew that but it's best to be clear. 

Also, DPDK uses hugepages because it's the only good way to get "pinned" memory
in userspace that userspace drivers can DMA into and out of. That's because the
kernel doesn't page out or move around hugepages (hugepages also happen to be
more efficient TLB-wise given that the data buffers are often large transfers).
There is some work on vfio-pci in the kernel that may provide a better solution
in the long term, but I'm not totally up to speed on that. Because data must
reside in hugepages currently, all buffers sent to the SPDK backend for
BlueStore are copied from wherever they are into a buffer from a pool allocated
out of hugepages. It would be better if all data buffers were originally
allocated from hugepage memory, but that's a bigger change to Ceph of course.
Note that incoming packets from DPDK will also reside in hugepages upon DMA from
the NIC, which would be convenient except that almost all NVMe devices today
don't support fully flexible scatter-gather specification of buffers and you end
up forced to copy simply to satisfy the alignment requirements of the DMA
engine. Some day though!

Sorry to be so long-winded, but I'm happy to help with SPDK.

Thanks,
Ben

> Tushar 
> 
> 
> > 
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu
> > Sent: Tuesday, November 8, 2016 7:45 PM
> > To: LIU, Fei <james.liu@alibaba-inc.com>
> > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil 
> > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel 
> > <ceph-devel@vger.kernel.org>
> > Subject: Re: status of spdk
> > 
> > Hi, Yehuda and Haomai,
> >     DPDK backend may have the same problem. I had tried to use haomai's  PR:
> > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to
> > start multiple OSDs on the host with only one network card, alse i read
> > about the dpdk multi-process support:
> > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not
> > find any config  to set multi-process support. Anything wrong or multi-
> > process support not been implemented?
> > 
> > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> > > 
> > > Hi Yehuda and Haomai,
> > >    The issue of drives driven by SPDK is not able to be shared by multiple
> > > OSDs as kernel NVMe drive since SPDK as a process so far can not be shared
> > > across multiple processes like OSDs, right?
> > > 
> > >    Regards,
> > >    James
> > > 
> > > 
> > > 
> > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel
> > > .org on behalf of yehuda@redhat.com> wrote:
> > > 
> > >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
> > >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> > >     >> I just started looking at spdk, and have a few comments and
> > > questions.
> > >     >>
> > >     >> First, it's not clear to me how we should handle build. At the
> > > moment
> > >     >> the spdk code resides as a submodule in the ceph tree, but it
> > > depends
> > >     >> on dpdk, which currently needs to be downloaded separately. We can
> > > add
> > >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That
> > > been
> > >     >> said, getting it to build was a bit tricky and I think it might be
> > >     >> broken with cmake. In order to get it working I resorted to
> > > building a
> > >     >> system library and use that.
> > >     >
> > >     > Note that this PR is about to merge
> > >     >
> > >     >         https://github.com/ceph/ceph/pull/10748
> > >     >
> > >     > which adds the DPDK submodule, so hopefully this issue will go away
> > > when
> > >     > that merged or with a follow-on cleanup.
> > >     >
> > >     >> The way to currently configure an osd to use bluestore with spdk is
> > > by
> > >     >> creating a symbolic link that replaces the bluestore 'block' device
> > > to
> > >     >> point to a file that has a name that is prefixed with 'spdk:'.
> > >     >> Originally I assumed that the suffix would be the nvme device id,
> > > but
> > >     >> it seems that it's not really needed, however, the file itself
> > > needs
> > >     >> to contain the device id (see
> > >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple
> > > of
> > >     >> minor fixes).
> > >     >
> > >     > Open a PR for those?
> > > 
> > >     Sure
> > > 
> > >     >
> > >     >> As I understand it, in order to support multiple osds on the same
> > > NVMe
> > >     >> device we have a few options. We can leverage NVMe namespaces, but
> > >     >> that's not supported on all devices. We can configure bluestore to
> > >     >> only use part of the device (device sharding? not sure if it
> > > supports
> > >     >> it). I think it's best if we could keep bluestore out of the loop
> > >     >> there and have the NVMe driver abstract multiple partitions of the
> > >     >> NVMe device. The idea is to be able to define multiple partitions
> > > on
> > >     >> the device (e.g., each partition will be defined by the offset,
> > > size,
> > >     >> and namespace), and have the osd set to use a specific partition.
> > >     >> We'll probably need a special tool to manage it, and potentially
> > > keep
> > >     >> the partition table information on the device itself. The tool
> > > could
> > >     >> also manage the creation of the block link. We should probably
> > > rethink
> > >     >> how the link is structure and what it points at.
> > >     >
> > >     > I agree that bluestore shouldn't get involved.
> > >     >
> > >     > Is the NVMe namespaces meant to support multiple processes sharing
> > > the
> > >     > same hardware device?
> > > 
> > >     More of a partitioning solution, but yes (as far as I undestand).
> > > 
> > >     >
> > >     > Also, if you do that, is it possible to give one of the namespaces
> > > to the
> > >     > kernel?  That might solve the bootstrapping problem we 
> > > currently have
> > > 
> > >     Theoretically, but not right now (or ever?). See here:
> > > 
> > >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> > > 
> > >     > where we have nowhere to put the $osd_data filesystem with the
> > > device
> > >     > metadata.  (This is admittedly not necessarily a blocking
> > > issue.  Putting
> > >     > those dirs on / wouldn't be the end of the world; it just means
> > > cards
> > >     > can't be easily moved between boxes.)
> > >     >
> > > 
> > >     Maybe we can use bluestore for these too ;) that been said, there
> > >     might be some kind of a loopback solution that could work, but not
> > >     sure if it won't create major bottlenecks that we'd want to avoid.
> > > 
> > >     Yehuda
> > >     --
> > >     To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in
> > >     the body of a message to majordomo@vger.kernel.org
> > >     More majordomo info at  
> > > http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at  http://
> vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-10 22:39               ` Walker, Benjamin
@ 2016-11-10 22:59                 ` Sage Weil
  2016-11-10 23:54                   ` Walker, Benjamin
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-11-10 22:59 UTC (permalink / raw)
  To: Walker, Benjamin
  Cc: Gohad, Tushar, Moreno, Orlando, haomaiwang, archer.wudong,
	ifedotov, james.liu, ceph-devel, yehuda

[-- Attachment #1: Type: TEXT/PLAIN, Size: 13289 bytes --]

Hi-

Thanks, Ben-- this is super helpful!

On Thu, 10 Nov 2016, Walker, Benjamin wrote:
> > > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few
> > > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
> > > Not sure this will be feasible or not.
> > 
> > Unfortunately, this is not feasible without some form of partitioning support
> > in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK bdev)
> > - the latter is under development at the moment.
> 
> We are currently developing both a persistent block allocator and a very
> lightweight, minimally featured filesystem (no directories, no permissions, no
> times). The original target for these are as the backing store of RocksDB, but
> they can be easily expanded to store other data. It isn't necessarily our
> primary aim to incorporate this into Ceph, but it clearly fits into the Ceph
> internals very well. We haven't provided a timeline on open sourcing this, but
> we're actively writing the code now.

This may not actually be that helpful.  It's basically what BlueFS is 
already doing (it's a rocksdb::Env that implements minimal "file system" 
and shares the device with teh rest of BlueStore).

What this point is really about is more operational than anything.  
Currently disks (HDDs or SSDs) can be easily swapped between machines 
because they have GPT partition labels and udev rules to run 'ceph-disk 
trigger' on them.  That basically mounts the tagged partition to a 
temporary location, figures out which OSD it is, bind mounts it to the 
appropriate /var/lib/ceph/osd/* directory, and then starts up the process.  
With BlueStore there are just a handful of metadata/bootstrap files here 
to get the OSD started:

-rw-r--r-- 1 sage sage           2 Nov  9 11:40 bluefs
-rw-r--r-- 1 sage sage          37 Nov  9 11:40 ceph_fsid
-rw-r--r-- 1 sage sage          37 Nov  9 11:40 fsid
-rw------- 1 sage sage          56 Nov  9 11:40 keyring
-rw-r--r-- 1 sage sage           8 Nov  9 11:40 kv_backend
-rw-r--r-- 1 sage sage          21 Nov  9 11:40 magic
-rw-r--r-- 1 sage sage           4 Nov  9 11:40 mkfs_done
-rw-r--r-- 1 sage sage           6 Nov  9 11:40 ready
-rw-r--r-- 1 sage sage          10 Nov  9 11:40 type
-rw-r--r-- 1 sage sage           2 Nov  9 11:40 whoami

plus a symlink for block, block.db, and block.wal to the other partitions 
or devices with the actual block data.

With SPDK, we can't carve out a partition or label it, a certainly can't 
mount it, so we'll need to rethink the bootstrapping process.  Fortunatley 
that can be wrapped up reasonably neatly in the 'ceph-disk activate' 
function, but eventually we'll need to decide how to storage/manage this 
metadata about the device.

Or just forget about easy hot swapping and stick these files on the 
hosts root partition.

> > The limitation that Orlando identified is, not being able to launch multiple
> > SPDK-based OSDs on a node today.  Igor and Orlando's PR (16966) is to add a
> > config option to limit the number of hugepages assigned to an OSD via an EAL
> > switch.  The other limitation today is being able to specify the CPU mask
> > assigned to each OSD which would require config addition.
> > 
> 
> To clarify, DPDK requires a process to declare the amount of memory (hugepages)
> and which CPU cores the process will use up front. If you want to run multiple
> DPDK-based processes on the same system, you just have to make sure there are
> enough hugepages and that the cores you specify don't overlap. That PR is just
> making it so you can configure these values. I just wanted to clarify that there
> isn't any deeper technical problem with running multiple DPDK processes on the
> same system - you all probably knew that but it's best to be clear. 
> 
> Also, DPDK uses hugepages because it's the only good way to get "pinned" memory
> in userspace that userspace drivers can DMA into and out of. That's because the
> kernel doesn't page out or move around hugepages (hugepages also happen to be
> more efficient TLB-wise given that the data buffers are often large transfers).
> There is some work on vfio-pci in the kernel that may provide a better solution
> in the long term, but I'm not totally up to speed on that. Because data must
> reside in hugepages currently, all buffers sent to the SPDK backend for
> BlueStore are copied from wherever they are into a buffer from a pool allocated
> out of hugepages. It would be better if all data buffers were originally
> allocated from hugepage memory, but that's a bigger change to Ceph of course.
> Note that incoming packets from DPDK will also reside in hugepages upon DMA from
> the NIC, which would be convenient except that almost all NVMe devices today
> don't support fully flexible scatter-gather specification of buffers and you end
> up forced to copy simply to satisfy the alignment requirements of the DMA
> engine. Some day though!

Yeah, we definitely want to get there eventually.  When Ceph sends data 
over the wire it is preceded by a header that includes an alignment 
so that (with TCP currently) we read data off the socket into 
properly aligned memory.  That way we can eventually do O_DIRECT writes 
with it.  If it's possible to direct what memory the DPDK data comes into 
we can hopefully do something similar here...  The rest of Ceph's 
bufferlist library should be flexible enough to enable zero-copy.

> Sorry to be so long-winded, but I'm happy to help with SPDK.

That's great to hear--this was very helpful for me!

Thanks-
sage




> 
> Thanks,
> Ben
> 
> > Tushar 
> > 
> > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org 
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu
> > > Sent: Tuesday, November 8, 2016 7:45 PM
> > > To: LIU, Fei <james.liu@alibaba-inc.com>
> > > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil 
> > > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel 
> > > <ceph-devel@vger.kernel.org>
> > > Subject: Re: status of spdk
> > > 
> > > Hi, Yehuda and Haomai,
> > >     DPDK backend may have the same problem. I had tried to use haomai's  PR:
> > > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to
> > > start multiple OSDs on the host with only one network card, alse i read
> > > about the dpdk multi-process support:
> > > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not
> > > find any config  to set multi-process support. Anything wrong or multi-
> > > process support not been implemented?
> > > 
> > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> > > > 
> > > > Hi Yehuda and Haomai,
> > > >    The issue of drives driven by SPDK is not able to be shared by multiple
> > > > OSDs as kernel NVMe drive since SPDK as a process so far can not be shared
> > > > across multiple processes like OSDs, right?
> > > > 
> > > >    Regards,
> > > >    James
> > > > 
> > > > 
> > > > 
> > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel
> > > > .org on behalf of yehuda@redhat.com> wrote:
> > > > 
> > > >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote:
> > > >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> > > >     >> I just started looking at spdk, and have a few comments and
> > > > questions.
> > > >     >>
> > > >     >> First, it's not clear to me how we should handle build. At the
> > > > moment
> > > >     >> the spdk code resides as a submodule in the ceph tree, but it
> > > > depends
> > > >     >> on dpdk, which currently needs to be downloaded separately. We can
> > > > add
> > > >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That
> > > > been
> > > >     >> said, getting it to build was a bit tricky and I think it might be
> > > >     >> broken with cmake. In order to get it working I resorted to
> > > > building a
> > > >     >> system library and use that.
> > > >     >
> > > >     > Note that this PR is about to merge
> > > >     >
> > > >     >         https://github.com/ceph/ceph/pull/10748
> > > >     >
> > > >     > which adds the DPDK submodule, so hopefully this issue will go away
> > > > when
> > > >     > that merged or with a follow-on cleanup.
> > > >     >
> > > >     >> The way to currently configure an osd to use bluestore with spdk is
> > > > by
> > > >     >> creating a symbolic link that replaces the bluestore 'block' device
> > > > to
> > > >     >> point to a file that has a name that is prefixed with 'spdk:'.
> > > >     >> Originally I assumed that the suffix would be the nvme device id,
> > > > but
> > > >     >> it seems that it's not really needed, however, the file itself
> > > > needs
> > > >     >> to contain the device id (see
> > > >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple
> > > > of
> > > >     >> minor fixes).
> > > >     >
> > > >     > Open a PR for those?
> > > > 
> > > >     Sure
> > > > 
> > > >     >
> > > >     >> As I understand it, in order to support multiple osds on the same
> > > > NVMe
> > > >     >> device we have a few options. We can leverage NVMe namespaces, but
> > > >     >> that's not supported on all devices. We can configure bluestore to
> > > >     >> only use part of the device (device sharding? not sure if it
> > > > supports
> > > >     >> it). I think it's best if we could keep bluestore out of the loop
> > > >     >> there and have the NVMe driver abstract multiple partitions of the
> > > >     >> NVMe device. The idea is to be able to define multiple partitions
> > > > on
> > > >     >> the device (e.g., each partition will be defined by the offset,
> > > > size,
> > > >     >> and namespace), and have the osd set to use a specific partition.
> > > >     >> We'll probably need a special tool to manage it, and potentially
> > > > keep
> > > >     >> the partition table information on the device itself. The tool
> > > > could
> > > >     >> also manage the creation of the block link. We should probably
> > > > rethink
> > > >     >> how the link is structure and what it points at.
> > > >     >
> > > >     > I agree that bluestore shouldn't get involved.
> > > >     >
> > > >     > Is the NVMe namespaces meant to support multiple processes sharing
> > > > the
> > > >     > same hardware device?
> > > > 
> > > >     More of a partitioning solution, but yes (as far as I undestand).
> > > > 
> > > >     >
> > > >     > Also, if you do that, is it possible to give one of the namespaces
> > > > to the
> > > >     > kernel?  That might solve the bootstrapping problem we 
> > > > currently have
> > > > 
> > > >     Theoretically, but not right now (or ever?). See here:
> > > > 
> > > >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> > > > 
> > > >     > where we have nowhere to put the $osd_data filesystem with the
> > > > device
> > > >     > metadata.  (This is admittedly not necessarily a blocking
> > > > issue.  Putting
> > > >     > those dirs on / wouldn't be the end of the world; it just means
> > > > cards
> > > >     > can't be easily moved between boxes.)
> > > >     >
> > > > 
> > > >     Maybe we can use bluestore for these too ;) that been said, there
> > > >     might be some kind of a loopback solution that could work, but not
> > > >     sure if it won't create major bottlenecks that we'd want to avoid.
> > > > 
> > > >     Yehuda
> > > >     --
> > > >     To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in
> > > >     the body of a message to majordomo@vger.kernel.org
> > > >     More majordomo info at  
> > > > http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@vger.kernel.org More majordomo info at  http://
> > vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: status of spdk
  2016-11-10 22:59                 ` Sage Weil
@ 2016-11-10 23:54                   ` Walker, Benjamin
  0 siblings, 0 replies; 18+ messages in thread
From: Walker, Benjamin @ 2016-11-10 23:54 UTC (permalink / raw)
  To: sage
  Cc: haomaiwang, james.liu, archer.wudong, Moreno, Orlando, yehuda,
	Gohad, Tushar, ifedotov, ceph-devel

On Thu, 2016-11-10 at 22:59 +0000, Sage Weil wrote:
> Hi-
> 
> Thanks, Ben-- this is super helpful!
> 
> On Thu, 10 Nov 2016, Walker, Benjamin wrote:
> > 
> > > 
> > > > 
> > > > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a
> > > > few
> > > > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id).  
> > > > Not sure this will be feasible or not.
> > > 
> > > Unfortunately, this is not feasible without some form of partitioning
> > > support
> > > in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK
> > > bdev)
> > > - the latter is under development at the moment.
> > 
> > We are currently developing both a persistent block allocator and a very
> > lightweight, minimally featured filesystem (no directories, no permissions,
> > no
> > times). The original target for these are as the backing store of RocksDB,
> > but
> > they can be easily expanded to store other data. It isn't necessarily our
> > primary aim to incorporate this into Ceph, but it clearly fits into the Ceph
> > internals very well. We haven't provided a timeline on open sourcing this,
> > but
> > we're actively writing the code now.
> 
> This may not actually be that helpful.  It's basically what BlueFS is 
> already doing (it's a rocksdb::Env that implements minimal "file system" 
> and shares the device with teh rest of BlueStore).

Understood - this is pretty much BlueFS + BlueStore in a standalone format. On
the surface it seems like it's duplicated work, but we are very heavily focused
on solid state media (particularly, next generation media beyond NAND) and that
has led us to diverge quite a bit in design from BlueStore. If our work ends up
benefiting Ceph in some way in the longer term, that's great, but I understand
Ceph already has code doing somewhat similar things.

> 
> What this point is really about is more operational than anything.  
> Currently disks (HDDs or SSDs) can be easily swapped between machines 
> because they have GPT partition labels and udev rules to run 'ceph-disk 
> trigger' on them.  That basically mounts the tagged partition to a 
> temporary location, figures out which OSD it is, bind mounts it to the 
> appropriate /var/lib/ceph/osd/* directory, and then starts up the process.  
> With BlueStore there are just a handful of metadata/bootstrap files here 
> to get the OSD started:
> 
> -rw-r--r-- 1 sage sage           2 Nov  9 11:40 bluefs
> -rw-r--r-- 1 sage sage          37 Nov  9 11:40 ceph_fsid
> -rw-r--r-- 1 sage sage          37 Nov  9 11:40 fsid
> -rw------- 1 sage sage          56 Nov  9 11:40 keyring
> -rw-r--r-- 1 sage sage           8 Nov  9 11:40 kv_backend
> -rw-r--r-- 1 sage sage          21 Nov  9 11:40 magic
> -rw-r--r-- 1 sage sage           4 Nov  9 11:40 mkfs_done
> -rw-r--r-- 1 sage sage           6 Nov  9 11:40 ready
> -rw-r--r-- 1 sage sage          10 Nov  9 11:40 type
> -rw-r--r-- 1 sage sage           2 Nov  9 11:40 whoami
> 
> plus a symlink for block, block.db, and block.wal to the other partitions 
> or devices with the actual block data.
> 
> With SPDK, we can't carve out a partition or label it, a certainly can't 
> mount it, so we'll need to rethink the bootstrapping process.  Fortunatley 
> that can be wrapped up reasonably neatly in the 'ceph-disk activate' 
> function, but eventually we'll need to decide how to storage/manage this 
> metadata about the device.

This sounds like a solvable problem to me. An OSD using BlueStore uses a block
device that has a one GPT partition with a filesystem (XFS?) that contains the
above bootstrapping data, plus some number of other GPT partitions with no
filesystems that are used for everything else, right? I think there are two
changes that could be made here. First, the bootstrap partition needs to contain
a BlueStore/Ceph-specific formatted data layout instead of using a kernel
filesystem. Maybe it could even be simpler and just have a flat binary layout
containing the above files sequentially or something.

Second, the BlueStore SPDK backend needs to comprehend real GPT partition
metadata (this part is not particularly hard - GPT is simple). That way, the
disk format between OSDs using SPDK and those using the kernel are identical and
SPDK respects the partitions and can locate them by partition label. Once
they're identical, Ceph can simply load using the GPT partition label and udev
mechanism as it does today, then dynamically unbind the kernel nvme driver from
the device (you just write to sysfs) and load SPDK in its place. Because the
SPDK backend is expecting the same disk format as the kernel, it will load
without issue.

I think this probably solves a few of the other pain points of using Ceph with
SPDK too around configuration. With this strategy all you have to do is flag the
OSD to use SPDK with no other configuration changes (well, maybe the number of
hugepages and which cores are allowed). This is because most of the
configuration for the disks is around specifying which data is where, and that
seems to be done by GPT partition label which the SPDK backend would now
comprehend.

> 
> Or just forget about easy hot swapping and stick these files on the 
> hosts root partition.
> 
> > 
> > > 
> > > The limitation that Orlando identified is, not being able to launch
> > > multiple
> > > SPDK-based OSDs on a node today.  Igor and Orlando's PR (16966) is to add
> > > a
> > > config option to limit the number of hugepages assigned to an OSD via an
> > > EAL
> > > switch.  The other limitation today is being able to specify the CPU mask
> > > assigned to each OSD which would require config addition.
> > > 
> > 
> > To clarify, DPDK requires a process to declare the amount of memory
> > (hugepages)
> > and which CPU cores the process will use up front. If you want to run
> > multiple
> > DPDK-based processes on the same system, you just have to make sure there
> > are
> > enough hugepages and that the cores you specify don't overlap. That PR is
> > just
> > making it so you can configure these values. I just wanted to clarify that
> > there
> > isn't any deeper technical problem with running multiple DPDK processes on
> > the
> > same system - you all probably knew that but it's best to be clear. 
> > 
> > Also, DPDK uses hugepages because it's the only good way to get "pinned"
> > memory
> > in userspace that userspace drivers can DMA into and out of. That's because
> > the
> > kernel doesn't page out or move around hugepages (hugepages also happen to
> > be
> > more efficient TLB-wise given that the data buffers are often large
> > transfers).
> > There is some work on vfio-pci in the kernel that may provide a better
> > solution
> > in the long term, but I'm not totally up to speed on that. Because data must
> > reside in hugepages currently, all buffers sent to the SPDK backend for
> > BlueStore are copied from wherever they are into a buffer from a pool
> > allocated
> > out of hugepages. It would be better if all data buffers were originally
> > allocated from hugepage memory, but that's a bigger change to Ceph of
> > course.
> > Note that incoming packets from DPDK will also reside in hugepages upon DMA
> > from
> > the NIC, which would be convenient except that almost all NVMe devices today
> > don't support fully flexible scatter-gather specification of buffers and you
> > end
> > up forced to copy simply to satisfy the alignment requirements of the DMA
> > engine. Some day though!
> 
> Yeah, we definitely want to get there eventually.  When Ceph sends data 
> over the wire it is preceded by a header that includes an alignment 
> so that (with TCP currently) we read data off the socket into 
> properly aligned memory.  That way we can eventually do O_DIRECT writes 
> with it.  If it's possible to direct what memory the DPDK data comes into 
> we can hopefully do something similar here...  The rest of Ceph's 
> bufferlist library should be flexible enough to enable zero-copy.
> 
> > 
> > Sorry to be so long-winded, but I'm happy to help with SPDK.
> 
> That's great to hear--this was very helpful for me!
> 
> Thanks-
> sage
> 
> 
> 
> 
> > 
> > 
> > Thanks,
> > Ben
> > 
> > > 
> > > Tushar 
> > > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org 
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu
> > > > Sent: Tuesday, November 8, 2016 7:45 PM
> > > > To: LIU, Fei <james.liu@alibaba-inc.com>
> > > > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil 
> > > > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel 
> > > > <ceph-devel@vger.kernel.org>
> > > > Subject: Re: status of spdk
> > > > 
> > > > Hi, Yehuda and Haomai,
> > > >     DPDK backend may have the same problem. I had tried to use
> > > > haomai's  PR:
> > > > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed
> > > > to
> > > > start multiple OSDs on the host with only one network card, alse i read
> > > > about the dpdk multi-process support:
> > > > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did
> > > > not
> > > > find any config  to set multi-process support. Anything wrong or multi-
> > > > process support not been implemented?
> > > > 
> > > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>:
> > > > > 
> > > > > 
> > > > > Hi Yehuda and Haomai,
> > > > >    The issue of drives driven by SPDK is not able to be shared by
> > > > > multiple
> > > > > OSDs as kernel NVMe drive since SPDK as a process so far can not be
> > > > > shared
> > > > > across multiple processes like OSDs, right?
> > > > > 
> > > > >    Regards,
> > > > >    James
> > > > > 
> > > > > 
> > > > > 
> > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.ke
> > > > > rnel
> > > > > .org on behalf of yehuda@redhat.com> wrote:
> > > > > 
> > > > >     On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com>
> > > > > wrote:
> > > > >     > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote:
> > > > >     >> I just started looking at spdk, and have a few comments and
> > > > > questions.
> > > > >     >>
> > > > >     >> First, it's not clear to me how we should handle build. At the
> > > > > moment
> > > > >     >> the spdk code resides as a submodule in the ceph tree, but it
> > > > > depends
> > > > >     >> on dpdk, which currently needs to be downloaded separately. We
> > > > > can
> > > > > add
> > > > >     >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That
> > > > > been
> > > > >     >> said, getting it to build was a bit tricky and I think it might
> > > > > be
> > > > >     >> broken with cmake. In order to get it working I resorted to
> > > > > building a
> > > > >     >> system library and use that.
> > > > >     >
> > > > >     > Note that this PR is about to merge
> > > > >     >
> > > > >     >         https://github.com/ceph/ceph/pull/10748
> > > > >     >
> > > > >     > which adds the DPDK submodule, so hopefully this issue will go
> > > > > away
> > > > > when
> > > > >     > that merged or with a follow-on cleanup.
> > > > >     >
> > > > >     >> The way to currently configure an osd to use bluestore with
> > > > > spdk is
> > > > > by
> > > > >     >> creating a symbolic link that replaces the bluestore 'block'
> > > > > device
> > > > > to
> > > > >     >> point to a file that has a name that is prefixed with 'spdk:'.
> > > > >     >> Originally I assumed that the suffix would be the nvme device
> > > > > id,
> > > > > but
> > > > >     >> it seems that it's not really needed, however, the file itself
> > > > > needs
> > > > >     >> to contain the device id (see
> > > > >     >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a
> > > > > couple
> > > > > of
> > > > >     >> minor fixes).
> > > > >     >
> > > > >     > Open a PR for those?
> > > > > 
> > > > >     Sure
> > > > > 
> > > > >     >
> > > > >     >> As I understand it, in order to support multiple osds on the
> > > > > same
> > > > > NVMe
> > > > >     >> device we have a few options. We can leverage NVMe namespaces,
> > > > > but
> > > > >     >> that's not supported on all devices. We can configure bluestore
> > > > > to
> > > > >     >> only use part of the device (device sharding? not sure if it
> > > > > supports
> > > > >     >> it). I think it's best if we could keep bluestore out of the
> > > > > loop
> > > > >     >> there and have the NVMe driver abstract multiple partitions of
> > > > > the
> > > > >     >> NVMe device. The idea is to be able to define multiple
> > > > > partitions
> > > > > on
> > > > >     >> the device (e.g., each partition will be defined by the offset,
> > > > > size,
> > > > >     >> and namespace), and have the osd set to use a specific
> > > > > partition.
> > > > >     >> We'll probably need a special tool to manage it, and
> > > > > potentially
> > > > > keep
> > > > >     >> the partition table information on the device itself. The tool
> > > > > could
> > > > >     >> also manage the creation of the block link. We should probably
> > > > > rethink
> > > > >     >> how the link is structure and what it points at.
> > > > >     >
> > > > >     > I agree that bluestore shouldn't get involved.
> > > > >     >
> > > > >     > Is the NVMe namespaces meant to support multiple processes
> > > > > sharing
> > > > > the
> > > > >     > same hardware device?
> > > > > 
> > > > >     More of a partitioning solution, but yes (as far as I undestand).
> > > > > 
> > > > >     >
> > > > >     > Also, if you do that, is it possible to give one of the
> > > > > namespaces
> > > > > to the
> > > > >     > kernel?  That might solve the bootstrapping problem we 
> > > > > currently have
> > > > > 
> > > > >     Theoretically, but not right now (or ever?). See here:
> > > > > 
> > > > >     https://lists.01.org/pipermail/spdk/2016-July/000073.html
> > > > > 
> > > > >     > where we have nowhere to put the $osd_data filesystem with the
> > > > > device
> > > > >     > metadata.  (This is admittedly not necessarily a blocking
> > > > > issue.  Putting
> > > > >     > those dirs on / wouldn't be the end of the world; it just means
> > > > > cards
> > > > >     > can't be easily moved between boxes.)
> > > > >     >
> > > > > 
> > > > >     Maybe we can use bluestore for these too ;) that been said, there
> > > > >     might be some kind of a loopback solution that could work, but not
> > > > >     sure if it won't create major bottlenecks that we'd want to avoid.
> > > > > 
> > > > >     Yehuda
> > > > >     --
> > > > >     To unsubscribe from this list: send the line "unsubscribe ceph-
> > > > > devel"
> > > > > in
> > > > >     the body of a message to majordomo@vger.kernel.org
> > > > >     More majordomo info at  
> > > > > http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the
> > > body of a message to majordomo@vger.kernel.org More majordomo info
> > > at  http://
> > > vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???
> > \f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-11-10 23:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-08 23:31 status of spdk Yehuda Sadeh-Weinraub
2016-11-08 23:40 ` Sage Weil
2016-11-09  0:06   ` Yehuda Sadeh-Weinraub
2016-11-09  0:21     ` LIU, Fei
2016-11-09  2:45       ` Dong Wu
2016-11-09 20:53         ` Moreno, Orlando
2016-11-09 20:58           ` Sage Weil
2016-11-09 21:00             ` Gohad, Tushar
2016-11-09 21:10             ` Gohad, Tushar
2016-11-10 22:39               ` Walker, Benjamin
2016-11-10 22:59                 ` Sage Weil
2016-11-10 23:54                   ` Walker, Benjamin
2016-11-09  4:59       ` Haomai Wang
2016-11-09  5:02         ` LIU, Fei
2016-11-09  5:09           ` Liu, Changpeng
2016-11-09  5:23             ` LIU, Fei
2016-11-09  4:49   ` Haomai Wang
2016-11-09  4:45 ` Haomai Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.