All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph journal
@ 2012-10-31 21:18 Gandalf Corvotempesta
  2012-10-31 21:24 ` Tren Blackburn
  0 siblings, 1 reply; 14+ messages in thread
From: Gandalf Corvotempesta @ 2012-10-31 21:18 UTC (permalink / raw)
  To: ceph-devel

In a multi replica cluster (for example, replica = 3) is safe to set
journal on a tmpfs?
As fa as I understood with journal enabled all writes are wrote on
journal and then to disk in a second time.
If node hangs when datas are still on journal (and journal is not on a
permanent disk), some data lost could happens.

In a multi replica environment, other nodes should be able to write
the same datas to disk, right? I this case, using a journal on a tmpfs
should be safe enough.

I'm missing something?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 21:18 Ceph journal Gandalf Corvotempesta
@ 2012-10-31 21:24 ` Tren Blackburn
  2012-10-31 21:32   ` Stefan Kleijkers
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tren Blackburn @ 2012-10-31 21:24 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

On Wed, Oct 31, 2012 at 2:18 PM, Gandalf Corvotempesta
<gandalf.corvotempesta@gmail.com> wrote:
> In a multi replica cluster (for example, replica = 3) is safe to set
> journal on a tmpfs?
> As fa as I understood with journal enabled all writes are wrote on
> journal and then to disk in a second time.
> If node hangs when datas are still on journal (and journal is not on a
> permanent disk), some data lost could happens.
>
> In a multi replica environment, other nodes should be able to write
> the same datas to disk, right? I this case, using a journal on a tmpfs
> should be safe enough.

Unless you're using btrfs which writes to the journal and osd fs
concurrently, if you lose the journal device (such as due to a
reboot), you've lost the osd device, requiring it to be remade and
re-added.

This is what I understand at least. If I'm wrong one of the devs will
strike me down I'm sure ;)

t.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 21:24 ` Tren Blackburn
@ 2012-10-31 21:32   ` Stefan Kleijkers
  2012-10-31 21:54   ` Sage Weil
  2012-10-31 21:58   ` Gandalf Corvotempesta
  2 siblings, 0 replies; 14+ messages in thread
From: Stefan Kleijkers @ 2012-10-31 21:32 UTC (permalink / raw)
  To: Tren Blackburn; +Cc: Gandalf Corvotempesta, ceph-devel

Hello,

On 10/31/2012 10:24 PM, Tren Blackburn wrote:
> On Wed, Oct 31, 2012 at 2:18 PM, Gandalf Corvotempesta
> <gandalf.corvotempesta@gmail.com> wrote:
>> In a multi replica cluster (for example, replica = 3) is safe to set
>> journal on a tmpfs?
>> As fa as I understood with journal enabled all writes are wrote on
>> journal and then to disk in a second time.
>> If node hangs when datas are still on journal (and journal is not on a
>> permanent disk), some data lost could happens.
>>
>> In a multi replica environment, other nodes should be able to write
>> the same datas to disk, right? I this case, using a journal on a tmpfs
>> should be safe enough.
> Unless you're using btrfs which writes to the journal and osd fs
> concurrently, if you lose the journal device (such as due to a
> reboot), you've lost the osd device, requiring it to be remade and
> re-added.
>
> This is what I understand at least. If I'm wrong one of the devs will
> strike me down I'm sure ;)

There's an option to recreate the journal, so you don't lose the OSD. 
Ofcourse you lose the data in the journal.
It's possible to have the journal on a tmpfs device, ofcourse it's not 
100% safe, in case you lose the three nodes with the replica's you can 
lose data. But then again if you have more replica's the change of 
losing them all gets smaller. It's a trade between speed, reliability 
and space.

Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 21:24 ` Tren Blackburn
  2012-10-31 21:32   ` Stefan Kleijkers
@ 2012-10-31 21:54   ` Sage Weil
  2012-10-31 21:58   ` Gandalf Corvotempesta
  2 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2012-10-31 21:54 UTC (permalink / raw)
  To: Tren Blackburn; +Cc: Gandalf Corvotempesta, ceph-devel

On Wed, 31 Oct 2012, Tren Blackburn wrote:
> On Wed, Oct 31, 2012 at 2:18 PM, Gandalf Corvotempesta
> <gandalf.corvotempesta@gmail.com> wrote:
> > In a multi replica cluster (for example, replica = 3) is safe to set
> > journal on a tmpfs?
> > As fa as I understood with journal enabled all writes are wrote on
> > journal and then to disk in a second time.
> > If node hangs when datas are still on journal (and journal is not on a
> > permanent disk), some data lost could happens.
> >
> > In a multi replica environment, other nodes should be able to write
> > the same datas to disk, right? I this case, using a journal on a tmpfs
> > should be safe enough.
> 
> Unless you're using btrfs which writes to the journal and osd fs
> concurrently, if you lose the journal device (such as due to a
> reboot), you've lost the osd device, requiring it to be remade and
> re-added.

This is correct.  For non-btrfs file systems we rely on the journal for 
basic consistency.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 21:24 ` Tren Blackburn
  2012-10-31 21:32   ` Stefan Kleijkers
  2012-10-31 21:54   ` Sage Weil
@ 2012-10-31 21:58   ` Gandalf Corvotempesta
  2012-10-31 22:04     ` Stefan Kleijkers
  2 siblings, 1 reply; 14+ messages in thread
From: Gandalf Corvotempesta @ 2012-10-31 21:58 UTC (permalink / raw)
  To: Tren Blackburn; +Cc: ceph-devel

2012/10/31 Tren Blackburn <tren@eotnetworks.com>:
> Unless you're using btrfs which writes to the journal and osd fs
> concurrently, if you lose the journal device (such as due to a
> reboot), you've lost the osd device, requiring it to be remade and
> re-added.

I don't understood.
Loosing a journal, will result in the whole OSD lost?

AFAIK, Ceph will write to journal. After this write it will return an "OK".
After that, the journal is wrote (in background) to a disk, so,
loosing a journal should result in loosing that portion of data, not
the whole osd.

Now, in case of 3 replicated nodes, ceph will write the same data at
the same time to the three journals? If yes, loosing a single
journal/osd should not result in loss of data, because the same data
are still on other 2 nodes. In this case, should be possible to use a
tmpfs as journal and using the replica as redundancy.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 21:58   ` Gandalf Corvotempesta
@ 2012-10-31 22:04     ` Stefan Kleijkers
  2012-10-31 22:07       ` Gandalf Corvotempesta
  2012-11-01 21:18       ` Gandalf Corvotempesta
  0 siblings, 2 replies; 14+ messages in thread
From: Stefan Kleijkers @ 2012-10-31 22:04 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Tren Blackburn, ceph-devel

Hello,

On 10/31/2012 10:58 PM, Gandalf Corvotempesta wrote:
> 2012/10/31 Tren Blackburn <tren@eotnetworks.com>:
>> Unless you're using btrfs which writes to the journal and osd fs
>> concurrently, if you lose the journal device (such as due to a
>> reboot), you've lost the osd device, requiring it to be remade and
>> re-added.
> I don't understood.
> Loosing a journal, will result in the whole OSD lost?
>
> AFAIK, Ceph will write to journal. After this write it will return an "OK".
> After that, the journal is wrote (in background) to a disk, so,
> loosing a journal should result in loosing that portion of data, not
> the whole osd.
>
> Now, in case of 3 replicated nodes, ceph will write the same data at
> the same time to the three journals? If yes, loosing a single
> journal/osd should not result in loss of data, because the same data
> are still on other 2 nodes. In this case, should be possible to use a
> tmpfs as journal and using the replica as redundancy.

As far as I know, this is correct. You get a ACK (on the write) back 
after it landed on ALL three journals (or/and osds in case of BTRFS in 
parallel mode). So If you lose one node, you still have it in two more 
nodes and they will commit it to disk. After recovering the missing 
node/osd it will get the data from one of the other nodes. So you won't 
lose any data.

Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 22:04     ` Stefan Kleijkers
@ 2012-10-31 22:07       ` Gandalf Corvotempesta
  2012-10-31 22:55         ` Sébastien Han
  2012-11-01 21:18       ` Gandalf Corvotempesta
  1 sibling, 1 reply; 14+ messages in thread
From: Gandalf Corvotempesta @ 2012-10-31 22:07 UTC (permalink / raw)
  To: Stefan Kleijkers; +Cc: Tren Blackburn, ceph-devel

2012/10/31 Stefan Kleijkers <stefan@kleijkers.nl>:
> As far as I know, this is correct. You get a ACK (on the write) back after
> it landed on ALL three journals (or/and osds in case of BTRFS in parallel
> mode). So If you lose one node, you still have it in two more nodes and they
> will commit it to disk. After recovering the missing node/osd it will get
> the data from one of the other nodes. So you won't lose any data.

Sounds perfect, this will allow me to avoid SSD disks and using al 12
disks on a DELL R515 as OSD

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 22:07       ` Gandalf Corvotempesta
@ 2012-10-31 22:55         ` Sébastien Han
  0 siblings, 0 replies; 14+ messages in thread
From: Sébastien Han @ 2012-10-31 22:55 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Stefan Kleijkers, Tren Blackburn, ceph-devel

Hi,

Personally I won't take the risk to loose transactions. If a client
writes into a journal, assuming it's the first write and if the server
crashs for whatever reason, you have high risk of un-consistent data.
Because you just lost what was in the journal.
Tmpfs is the cheapest solution for achieving better performance, but
it's definitely not the most reliable. Keep in mind that you don't
really want to see your load/data balancing through the cluster while
recovering from a failure...
At last resort, I will use the root filesystem if it's decently fast
to store the journals.

My 2 cents...

Cheers!

--
Bien cordialement.
Sébastien HAN.



On Wed, Oct 31, 2012 at 11:07 PM, Gandalf Corvotempesta
<gandalf.corvotempesta@gmail.com> wrote:
>
> 2012/10/31 Stefan Kleijkers <stefan@kleijkers.nl>:
> > As far as I know, this is correct. You get a ACK (on the write) back after
> > it landed on ALL three journals (or/and osds in case of BTRFS in parallel
> > mode). So If you lose one node, you still have it in two more nodes and they
> > will commit it to disk. After recovering the missing node/osd it will get
> > the data from one of the other nodes. So you won't lose any data.
>
> Sounds perfect, this will allow me to avoid SSD disks and using al 12
> disks on a DELL R515 as OSD
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-10-31 22:04     ` Stefan Kleijkers
  2012-10-31 22:07       ` Gandalf Corvotempesta
@ 2012-11-01 21:18       ` Gandalf Corvotempesta
  2012-11-01 21:27         ` Mark Nelson
  1 sibling, 1 reply; 14+ messages in thread
From: Gandalf Corvotempesta @ 2012-11-01 21:18 UTC (permalink / raw)
  To: Stefan Kleijkers; +Cc: Tren Blackburn, ceph-devel

2012/10/31 Stefan Kleijkers <stefan@kleijkers.nl>:
> As far as I know, this is correct. You get a ACK (on the write) back after
> it landed on ALL three journals (or/and osds in case of BTRFS in parallel
> mode). So If you lose one node, you still have it in two more nodes and they
> will commit it to disk. After recovering the missing node/osd it will get
> the data from one of the other nodes. So you won't lose any data.

In this case I can suppose that ceph writing speed is relative to the
journal's writing speed and never to ODS disks.

Let's assume a journal size of 150GB, capable to write at 200MB/s in a
2gbit/s network (lacp between two gigabit ports), no replica between
OSDs and very very slow SATA disk (5400 RPM, for example, much slower
than jurnal). Just a single osd.
Ceph will write at 200MB/s, and in background it will flush journals
to disk, right?

I can assume that journal is a buffer and RBD will write only to it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-11-01 21:18       ` Gandalf Corvotempesta
@ 2012-11-01 21:27         ` Mark Nelson
  2012-11-01 21:33           ` Gandalf Corvotempesta
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Nelson @ 2012-11-01 21:27 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Stefan Kleijkers, Tren Blackburn, ceph-devel

On 11/01/2012 04:18 PM, Gandalf Corvotempesta wrote:
> 2012/10/31 Stefan Kleijkers <stefan@kleijkers.nl>:
>> As far as I know, this is correct. You get a ACK (on the write) back after
>> it landed on ALL three journals (or/and osds in case of BTRFS in parallel
>> mode). So If you lose one node, you still have it in two more nodes and they
>> will commit it to disk. After recovering the missing node/osd it will get
>> the data from one of the other nodes. So you won't lose any data.
>
> In this case I can suppose that ceph writing speed is relative to the
> journal's writing speed and never to ODS disks.
>

Eventually you will need to write all of that data out to disk and 
writes to the journal will have to stop to allow the underlying disk to 
catch up.  In cases like that you will often see performance going along 
seemingly speedily and then all of a sudden see long pauses and possibly 
chaotic performance characteristics.

> Let's assume a journal size of 150GB, capable to write at 200MB/s in a
> 2gbit/s network (lacp between two gigabit ports), no replica between
> OSDs and very very slow SATA disk (5400 RPM, for example, much slower
> than jurnal). Just a single osd.
> Ceph will write at 200MB/s, and in background it will flush journals
> to disk, right?

It will do that for a while, based on how you've tweaked the flush 
intervals and various journal settings to determine how much data ceph 
will allow to hang out in the journal while still accepting new requests.

>
> I can assume that journal is a buffer and RBD will write only to it.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-11-01 21:27         ` Mark Nelson
@ 2012-11-01 21:33           ` Gandalf Corvotempesta
  2012-11-03 17:29             ` Gregory Farnum
  0 siblings, 1 reply; 14+ messages in thread
From: Gandalf Corvotempesta @ 2012-11-01 21:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Kleijkers, Tren Blackburn, ceph-devel

2012/11/1 Mark Nelson <mark.nelson@inktank.com>:
> It will do that for a while, based on how you've tweaked the flush intervals
> and various journal settings to determine how much data ceph will allow to
> hang out in the journal while still accepting new requests.

Ceph is not able to write to both journal and disks simultaneously?
For example, by using SSD for operating system and journal, we will be
able to have not least than 100GB of journal that is a large amount of
data wrote at SSD speed.

When ceph has to flush this data to disks, will stop to write more
data to journal?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-11-01 21:33           ` Gandalf Corvotempesta
@ 2012-11-03 17:29             ` Gregory Farnum
  2012-11-04 11:48               ` Gandalf Corvotempesta
  2012-11-05 13:06               ` Jean-Daniel BUSSY
  0 siblings, 2 replies; 14+ messages in thread
From: Gregory Farnum @ 2012-11-03 17:29 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Nelson, Stefan Kleijkers, Tren Blackburn, ceph-devel

On Thu, Nov 1, 2012 at 10:33 PM, Gandalf Corvotempesta
<gandalf.corvotempesta@gmail.com> wrote:
> 2012/11/1 Mark Nelson <mark.nelson@inktank.com>:
>> It will do that for a while, based on how you've tweaked the flush intervals
>> and various journal settings to determine how much data ceph will allow to
>> hang out in the journal while still accepting new requests.
>
> Ceph is not able to write to both journal and disks simultaneously?
> For example, by using SSD for operating system and journal, we will be
> able to have not least than 100GB of journal that is a large amount of
> data wrote at SSD speed.
>
> When ceph has to flush this data to disks, will stop to write more
> data to journal?

No, of course it will flush data to the disk at the same time as it
will take writes to the journal. However, if you have a 1GB journal
that writes at 200MB/s and a backing disk that writes at 100MB/s, and
you then push 200MB/s through long enough that the journal fills up,
then you will slow down to writing at 100MB/s because that's as fast
as Ceph can fill up the backing store, and the journal is no longer
buffering.
-Greg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-11-03 17:29             ` Gregory Farnum
@ 2012-11-04 11:48               ` Gandalf Corvotempesta
  2012-11-05 13:06               ` Jean-Daniel BUSSY
  1 sibling, 0 replies; 14+ messages in thread
From: Gandalf Corvotempesta @ 2012-11-04 11:48 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Mark Nelson, Stefan Kleijkers, Tren Blackburn, ceph-devel

2012/11/3 Gregory Farnum <greg@inktank.com>:
> No, of course it will flush data to the disk at the same time as it
> will take writes to the journal. However, if you have a 1GB journal
> that writes at 200MB/s and a backing disk that writes at 100MB/s, and
> you then push 200MB/s through long enough that the journal fills up,
> then you will slow down to writing at 100MB/s because that's as fast
> as Ceph can fill up the backing store, and the journal is no longer
> buffering.

Ok, now it's clear. But in case of a 100GB SSD, the journal size
should be enough to allow a proper write to disks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph journal
  2012-11-03 17:29             ` Gregory Farnum
  2012-11-04 11:48               ` Gandalf Corvotempesta
@ 2012-11-05 13:06               ` Jean-Daniel BUSSY
  1 sibling, 0 replies; 14+ messages in thread
From: Jean-Daniel BUSSY @ 2012-11-05 13:06 UTC (permalink / raw)
  To: Gregory Farnum, mark.nelson, gandalf.corvotempesta,
	Sébastien Han, ceph-devel

Interesting.
I am thinking about using tmpfs cached journal in production if the
behavior of the cluster is the same as described by Gregory Farnum.
If application experience long pauses as described by Mark Nelson then
it's quite critical.
I hope the journal in tmpfs really acts as a buffer and doesn't just
commit to disk aggressively to disk when full (I guess that what would
create long pauses).
During my testings with 10G of journal in tmpfs I have experienced
quite linear performances.
My concern is that performance is ~100Mb/s on writes while is should
be 200Mb/s isn't it?
Compared to journal on disk it's still a 2X performance overhaul
(~55Mb/s on disk)

I will run tests on a long run to see if I can reproduce pauses and
report back here.

-JD

BUSSY Jean-Daniel
Computer Engineer | CyberAgent
Mobile: +81 090-3317-1337
Email: silversurfer972@gmail.com
Mobile mail: jakku972@docomo.ne.jp


On Sun, Nov 4, 2012 at 2:29 AM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Nov 1, 2012 at 10:33 PM, Gandalf Corvotempesta
> <gandalf.corvotempesta@gmail.com> wrote:
>> 2012/11/1 Mark Nelson <mark.nelson@inktank.com>:
>>> It will do that for a while, based on how you've tweaked the flush intervals
>>> and various journal settings to determine how much data ceph will allow to
>>> hang out in the journal while still accepting new requests.
>>
>> Ceph is not able to write to both journal and disks simultaneously?
>> For example, by using SSD for operating system and journal, we will be
>> able to have not least than 100GB of journal that is a large amount of
>> data wrote at SSD speed.
>>
>> When ceph has to flush this data to disks, will stop to write more
>> data to journal?
>
> No, of course it will flush data to the disk at the same time as it
> will take writes to the journal. However, if you have a 1GB journal
> that writes at 200MB/s and a backing disk that writes at 100MB/s, and
> you then push 200MB/s through long enough that the journal fills up,
> then you will slow down to writing at 100MB/s because that's as fast
> as Ceph can fill up the backing store, and the journal is no longer
> buffering.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-11-05 13:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-31 21:18 Ceph journal Gandalf Corvotempesta
2012-10-31 21:24 ` Tren Blackburn
2012-10-31 21:32   ` Stefan Kleijkers
2012-10-31 21:54   ` Sage Weil
2012-10-31 21:58   ` Gandalf Corvotempesta
2012-10-31 22:04     ` Stefan Kleijkers
2012-10-31 22:07       ` Gandalf Corvotempesta
2012-10-31 22:55         ` Sébastien Han
2012-11-01 21:18       ` Gandalf Corvotempesta
2012-11-01 21:27         ` Mark Nelson
2012-11-01 21:33           ` Gandalf Corvotempesta
2012-11-03 17:29             ` Gregory Farnum
2012-11-04 11:48               ` Gandalf Corvotempesta
2012-11-05 13:06               ` Jean-Daniel BUSSY

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.