All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Fwd: [newstore (again)] how disable double write WAL
       [not found] <9D046674-EA8B-4CB5-B049-3CF665D4ED64@aevoo.fr>
@ 2015-11-24 20:42 ` Sage Weil
       [not found]   ` <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2015-11-24 20:42 UTC (permalink / raw)
  To: Sébastien VALSEMEY
  Cc: Vish Maram-SSI, Ceph Development, David CASIER, Benoît LORIOT

[-- Attachment #1: Type: TEXT/PLAIN, Size: 31354 bytes --]

On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote:
> Hello Vish,
> 
> Please apologize for the delay in my answer.
> Following the conversation you had with my colleague David, here are 
> some more details about our work :
> 
> We are working on Filestore / Newstore optimizations by studying how we 
> could set ourselves free from using the journal.
> 
> It is very important to work with SSD, but it is also mandatory to 
> combine it with regular magnetic platter disks. This is why we are 
> combining metadata storing on flash with data storing on disk.

This is pretty common, and something we will support natively with 
newstore.
 
> Our main goal is to have the control on performance. Which is quite 
> difficult with NewStore, and needs fundamental hacks with FileStore.

Can you clarify what you mean by "quite difficult with NewStore"?

FWIW, the latest bleeding edge code is currently at 
github.com/liewegas/wip-bluestore.

sage


> Is Samsung working on ARM boards with embedded flash and a SATA port, in 
> order to allow us to work on a hybrid approach? What is your line of 
> work with Ceph?
> 
> How can we work together ?
> 
> Regards,
> Sébastien
> 
> > Début du message réexpédié :
> > 
> > De: David Casier <david.casier@aevoo.fr>
> > Date: 12 octobre 2015 20:52:26 UTC+2
> > À: Sage Weil <sage@newdream.net>, Ceph Development <ceph-devel@vger.kernel.org>
> > Cc: Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>, benoit.loriot@aevoo.fr, Denis Saget <geodni@gmail.com>, "luc.petetin" <luc.petetin@aevoo.fr>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Ok,
> > Great.
> > 
> > With these  settings :
> > //
> > newstore_max_dir_size = 4096
> > newstore_sync_io = true
> > newstore_sync_transaction = true
> > newstore_sync_submit_transaction = true
> > newstore_sync_wal_apply = true
> > newstore_overlay_max = 0
> > //
> > 
> > And direct IO in the benchmark tool (fio)
> > 
> > I see that the HDD is 100% charged and there are notransfer of /db to /fragments after stopping benchmark : Great !
> > 
> > But when i launch a bench with random blocs of 256k, i see random blocs between 32k and 256k on HDD. Any idea ?
> > 
> > Debits to the HDD are about 8MBps when they could be higher with larger blocs (~30MBps)
> > And 70 MBps without fsync (hard drive cache disabled).
> > 
> > Other questions :
> > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread fsync_wq) ?
> > newstore_sync_transaction -> true = sync in DB ?
> > newstore_sync_submit_transaction -> if false then kv_queue (only if newstore_sync_transaction=false) ?
> > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > 
> > Is it true ?
> > 
> > Way for cache with battery (sync DB and no sync data) ?
> > 
> > Thanks for everything !
> > 
> > On 10/12/2015 03:01 PM, Sage Weil wrote:
> >> On Mon, 12 Oct 2015, David Casier wrote:
> >>> Hello everybody,
> >>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>> I separed "/db" and "/fragments" but during the bench, everything is writing
> >>> to "/db"
> >>> I changed options "newstore_sync_*" without success.
> >>> 
> >>> Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >> But if you are overwriting an existing object, doing write-ahead logging
> >> is usually unavoidable because we need to make the update atomic (and the
> >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >> mitigates this somewhat for larger writes by limiting fragment size, but
> >> for small IOs this is pretty much always going to be the case.  For small
> >> IOs, though, putting things in db/ is generally better since we can
> >> combine many small ios into a single (rocksdb) journal/wal write.  And
> >> often leave them there (via the 'overlay' behavior).
> >> 
> >> sage
> >> 
> > 
> > 
> > -- 
> > ________________________________________________________
> > 
> > Cordialement,
> > 
> > *David CASIER
> > DCConsulting SARL
> > 
> > 
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > **Ligne directe: _01 75 98 53 85_
> > Email: _david.casier@aevoo.fr_
> > * ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: David Casier <david.casier@aevoo.fr>
> > Date: 2 novembre 2015 20:02:37 UTC+1
> > À: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@ssi.samsung.com>
> > Cc: benoit LORIOT <benoit.loriot@aevoo.fr>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Hi Vish,
> > In FileStore, data and metadata are stored in files, with xargs FS and omap.
> > NewStore works with RocksDB.
> > There are a lot of configuration in RocksDB but all options not implemented.
> > 
> > The best way, for me, is not to use the logs, with secure cache (for example SSD 845DC).
> > I don't think that is necessary to report I/O with a good metadata optimisation.
> > 
> > The problem with RocksDB is that not possible to control I/O blocs size.
> > 
> > We will resume work on NewStore soon. 
> > 
> > On 10/29/2015 05:30 PM, Vish (Vishwanath) Maram-SSI wrote:
> >> Thanks David for the reply.
> >>  
> >> Yeah We just wanted to know how different is it from Filestore and how do we contribute for this? My motive is to first understand the design of Newstore and get the Performance loopholes so that we can try looking into it.
> >>  
> >> It would be helpful if you can share what is your idea from your side to use Newstore and configuration? What plans you are having for contributions to help us understand and see if we can work together.
> >>  
> >> Thanks,
> >> -Vish
> >>   <>
> >> From: David Casier [mailto:david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>] 
> >> Sent: Thursday, October 29, 2015 4:41 AM
> >> To: Vish (Vishwanath) Maram-SSI
> >> Cc: benoit LORIOT; Sébastien VALSEMEY
> >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >>  
> >> Hi Vish,
> >> It's OK.
> >> 
> >> We have a lot of different configuration with newstore tests.
> >> 
> >> What is your goal with ?
> >> 
> >> On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> >> Hi David,
> >>  
> >> Sorry for sending you the mail directly.
> >>  
> >> This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO.
> >>  
> >> Can you please share your Ceph Configuration file which you have used to run the IO's using FIO?
> >>  
> >> Thanks,
> >> -Vish
> >>  
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org> [mailto:ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org>] On Behalf Of David Casier
> >> Sent: Monday, October 12, 2015 11:52 AM
> >> To: Sage Weil; Ceph Development
> >> Cc: Sébastien VALSEMEY; benoit.loriot@aevoo.fr <mailto:benoit.loriot@aevoo.fr>; Denis Saget; luc.petetin
> >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >>  
> >> Ok,
> >> Great.
> >>  
> >> With these  settings :
> >> //
> >> newstore_max_dir_size = 4096
> >> newstore_sync_io = true
> >> newstore_sync_transaction = true
> >> newstore_sync_submit_transaction = true
> >> newstore_sync_wal_apply = true
> >> newstore_overlay_max = 0
> >> //
> >>  
> >> And direct IO in the benchmark tool (fio)
> >>  
> >> I see that the HDD is 100% charged and there are notransfer of /db to 
> >> /fragments after stopping benchmark : Great !
> >>  
> >> But when i launch a bench with random blocs of 256k, i see random blocs 
> >> between 32k and 256k on HDD. Any idea ?
> >>  
> >> Debits to the HDD are about 8MBps when they could be higher with larger 
> >> blocs (~30MBps)
> >> And 70 MBps without fsync (hard drive cache disabled).
> >>  
> >> Other questions :
> >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> >> fsync_wq) ?
> >> newstore_sync_transaction -> true = sync in DB ?
> >> newstore_sync_submit_transaction -> if false then kv_queue (only if 
> >> newstore_sync_transaction=false) ?
> >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> >>  
> >> Is it true ?
> >>  
> >> Way for cache with battery (sync DB and no sync data) ?
> >>  
> >> Thanks for everything !
> >>  
> >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >> On Mon, 12 Oct 2015, David Casier wrote:
> >> Hello everybody,
> >> fragment is stored in rocksdb before being written to "/fragments" ?
> >> I separed "/db" and "/fragments" but during the bench, everything is writing
> >> to "/db"
> >> I changed options "newstore_sync_*" without success.
> >>  
> >> Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >> But if you are overwriting an existing object, doing write-ahead logging
> >> is usually unavoidable because we need to make the update atomic (and the
> >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >> mitigates this somewhat for larger writes by limiting fragment size, but
> >> for small IOs this is pretty much always going to be the case.  For small
> >> IOs, though, putting things in db/ is generally better since we can
> >> combine many small ios into a single (rocksdb) journal/wal write.  And
> >> often leave them there (via the 'overlay' behavior).
> >>  
> >> sage
> >>  
> >>  
> >>  
> >>  
> >> 
> >> -- 
> >> ________________________________________________________ 
> >> 
> >> Cordialement,
> >> 
> >> David CASIER
> >> 
> >> 
> >> 4 Trait d'Union
> >> 77127 LIEUSAINT
> >> 
> >> Ligne directe: 01 75 98 53 85
> >> Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> >> ________________________________________________________
> > 
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > 
> >  
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: Sage Weil <sage@newdream.net>
> > Date: 12 octobre 2015 21:33:52 UTC+2
> > À: David Casier <david.casier@aevoo.fr>
> > Cc: Ceph Development <ceph-devel@vger.kernel.org>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>, benoit.loriot@aevoo.fr, Denis Saget <geodni@gmail.com>, "luc.petetin" <luc.petetin@aevoo.fr>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Hi David-
> > 
> > On Mon, 12 Oct 2015, David Casier wrote:
> >> Ok,
> >> Great.
> >> 
> >> With these  settings :
> >> //
> >> newstore_max_dir_size = 4096
> >> newstore_sync_io = true
> >> newstore_sync_transaction = true
> >> newstore_sync_submit_transaction = true
> > 
> > Is this a hard disk?  Those settings probably don't make sense since it 
> > does every IO synchronously, blocking the submitting IO path...
> > 
> >> newstore_sync_wal_apply = true
> >> newstore_overlay_max = 0
> >> //
> >> 
> >> And direct IO in the benchmark tool (fio)
> >> 
> >> I see that the HDD is 100% charged and there are notransfer of /db to
> >> /fragments after stopping benchmark : Great !
> >> 
> >> But when i launch a bench with random blocs of 256k, i see random blocs
> >> between 32k and 256k on HDD. Any idea ?
> > 
> > Random IOs have to be write ahead logged in rocksdb, which has its own IO 
> > pattern.  Since you made everything sync above I think it'll depend on 
> > how many osd threads get batched together at a time.. maybe.  Those 
> > settings aren't something I've really tested, and probably only make 
> > sense with very fast NVMe devices.
> > 
> >> Debits to the HDD are about 8MBps when they could be higher with larger blocs> (~30MBps)
> >> And 70 MBps without fsync (hard drive cache disabled).
> >> 
> >> Other questions :
> >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> >> fsync_wq) ?
> > 
> > yes
> > 
> >> newstore_sync_transaction -> true = sync in DB ?
> > 
> > synchronously do the rocksdb commit too
> > 
> >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> >> newstore_sync_transaction=false) ?
> > 
> > yeah.. there is an annoying rocksdb behavior that makes an async 
> > transaction submit block if a sync one is in progress, so this queues them 
> > up and explicitly batches them.
> > 
> >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > 
> > the txn commit completion threads can do the wal work synchronously.. this 
> > is only a good idea if it's doing aio (which it generally is).
> > 
> >> Is it true ?
> >> 
> >> Way for cache with battery (sync DB and no sync data) ?
> > 
> > ?
> > s
> > 
> >> 
> >> Thanks for everything !
> >> 
> >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >>> On Mon, 12 Oct 2015, David Casier wrote:
> >>>> Hello everybody,
> >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>>> I separed "/db" and "/fragments" but during the bench, everything is
> >>>> writing
> >>>> to "/db"
> >>>> I changed options "newstore_sync_*" without success.
> >>>> 
> >>>> Is there any way to write all metadata in "/db" and all data in
> >>>> "/fragments" ?
> >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >>> But if you are overwriting an existing object, doing write-ahead logging
> >>> is usually unavoidable because we need to make the update atomic (and the
> >>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >>> mitigates this somewhat for larger writes by limiting fragment size, but
> >>> for small IOs this is pretty much always going to be the case.  For small
> >>> IOs, though, putting things in db/ is generally better since we can
> >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> >>> often leave them there (via the 'overlay' behavior).
> >>> 
> >>> sage
> >>> 
> >> 
> >> 
> >> -- 
> >> ________________________________________________________
> >> 
> >> Cordialement,
> >> 
> >> *David CASIER
> >> DCConsulting SARL
> >> 
> >> 
> >> 4 Trait d'Union
> >> 77127 LIEUSAINT
> >> 
> >> **Ligne directe: _01 75 98 53 85_
> >> Email: _david.casier@aevoo.fr_
> >> * ________________________________________________________
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> 
> > Début du message réexpédié :
> > 
> > De: David Casier <david.casier@aevoo.fr>
> > Date: 29 octobre 2015 12:41:22 UTC+1
> > À: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@ssi.samsung.com>
> > Cc: benoit LORIOT <benoit.loriot@aevoo.fr>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Hi Vish,
> > It's OK.
> > 
> > We have a lot of different configuration with newstore tests.
> > 
> > What is your goal with ?
> > 
> > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> >> Hi David,
> >> 
> >> Sorry for sending you the mail directly.
> >> 
> >> This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO.
> >> 
> >> Can you please share your Ceph Configuration file which you have used to run the IO's using FIO?
> >> 
> >> Thanks,
> >> -Vish
> >> 
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org> [mailto:ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org>] On Behalf Of David Casier
> >> Sent: Monday, October 12, 2015 11:52 AM
> >> To: Sage Weil; Ceph Development
> >> Cc: Sébastien VALSEMEY; benoit.loriot@aevoo.fr <mailto:benoit.loriot@aevoo.fr>; Denis Saget; luc.petetin
> >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >> 
> >> Ok,
> >> Great.
> >> 
> >> With these  settings :
> >> //
> >> newstore_max_dir_size = 4096
> >> newstore_sync_io = true
> >> newstore_sync_transaction = true
> >> newstore_sync_submit_transaction = true
> >> newstore_sync_wal_apply = true
> >> newstore_overlay_max = 0
> >> //
> >> 
> >> And direct IO in the benchmark tool (fio)
> >> 
> >> I see that the HDD is 100% charged and there are notransfer of /db to 
> >> /fragments after stopping benchmark : Great !
> >> 
> >> But when i launch a bench with random blocs of 256k, i see random blocs 
> >> between 32k and 256k on HDD. Any idea ?
> >> 
> >> Debits to the HDD are about 8MBps when they could be higher with larger 
> >> blocs (~30MBps)
> >> And 70 MBps without fsync (hard drive cache disabled).
> >> 
> >> Other questions :
> >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> >> fsync_wq) ?
> >> newstore_sync_transaction -> true = sync in DB ?
> >> newstore_sync_submit_transaction -> if false then kv_queue (only if 
> >> newstore_sync_transaction=false) ?
> >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> >> 
> >> Is it true ?
> >> 
> >> Way for cache with battery (sync DB and no sync data) ?
> >> 
> >> Thanks for everything !
> >> 
> >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >>> On Mon, 12 Oct 2015, David Casier wrote:
> >>>> Hello everybody,
> >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>>> I separed "/db" and "/fragments" but during the bench, everything is writing
> >>>> to "/db"
> >>>> I changed options "newstore_sync_*" without success.
> >>>> 
> >>>> Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >>> But if you are overwriting an existing object, doing write-ahead logging
> >>> is usually unavoidable because we need to make the update atomic (and the
> >>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >>> mitigates this somewhat for larger writes by limiting fragment size, but
> >>> for small IOs this is pretty much always going to be the case.  For small
> >>> IOs, though, putting things in db/ is generally better since we can
> >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> >>> often leave them there (via the 'overlay' behavior).
> >>> 
> >>> sage
> >>> 
> >> 
> > 
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > 
> >  
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@ssi.samsung.com>
> > Date: 29 octobre 2015 17:30:56 UTC+1
> > À: David Casier <david.casier@aevoo.fr>
> > Cc: benoit LORIOT <benoit.loriot@aevoo.fr>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>
> > Objet: RE: Fwd: [newstore (again)] how disable double write WAL
> > 
> > Thanks David for the reply.
> >  
> > Yeah We just wanted to know how different is it from Filestore and how do we contribute for this? My motive is to first understand the design of Newstore and get the Performance loopholes so that we can try looking into it.
> >  
> > It would be helpful if you can share what is your idea from your side to use Newstore and configuration? What plans you are having for contributions to help us understand and see if we can work together.
> >  
> > Thanks,
> > -Vish
> >   <>
> > From: David Casier [mailto:david.casier@aevoo.fr] 
> > Sent: Thursday, October 29, 2015 4:41 AM
> > To: Vish (Vishwanath) Maram-SSI
> > Cc: benoit LORIOT; Sébastien VALSEMEY
> > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >  
> > Hi Vish,
> > It's OK.
> > 
> > We have a lot of different configuration with newstore tests.
> > 
> > What is your goal with ?
> > 
> > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > Hi David,
> >  
> > Sorry for sending you the mail directly.
> >  
> > This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO.
> >  
> > Can you please share your Ceph Configuration file which you have used to run the IO's using FIO?
> >  
> > Thanks,
> > -Vish
> >  
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org> [mailto:ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org>] On Behalf Of David Casier
> > Sent: Monday, October 12, 2015 11:52 AM
> > To: Sage Weil; Ceph Development
> > Cc: Sébastien VALSEMEY; benoit.loriot@aevoo.fr <mailto:benoit.loriot@aevoo.fr>; Denis Saget; luc.petetin
> > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >  
> > Ok,
> > Great.
> >  
> > With these  settings :
> > //
> > newstore_max_dir_size = 4096
> > newstore_sync_io = true
> > newstore_sync_transaction = true
> > newstore_sync_submit_transaction = true
> > newstore_sync_wal_apply = true
> > newstore_overlay_max = 0
> > //
> >  
> > And direct IO in the benchmark tool (fio)
> >  
> > I see that the HDD is 100% charged and there are notransfer of /db to 
> > /fragments after stopping benchmark : Great !
> >  
> > But when i launch a bench with random blocs of 256k, i see random blocs 
> > between 32k and 256k on HDD. Any idea ?
> >  
> > Debits to the HDD are about 8MBps when they could be higher with larger 
> > blocs (~30MBps)
> > And 70 MBps without fsync (hard drive cache disabled).
> >  
> > Other questions :
> > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> > fsync_wq) ?
> > newstore_sync_transaction -> true = sync in DB ?
> > newstore_sync_submit_transaction -> if false then kv_queue (only if 
> > newstore_sync_transaction=false) ?
> > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> >  
> > Is it true ?
> >  
> > Way for cache with battery (sync DB and no sync data) ?
> >  
> > Thanks for everything !
> >  
> > On 10/12/2015 03:01 PM, Sage Weil wrote:
> > On Mon, 12 Oct 2015, David Casier wrote:
> > Hello everybody,
> > fragment is stored in rocksdb before being written to "/fragments" ?
> > I separed "/db" and "/fragments" but during the bench, everything is writing
> > to "/db"
> > I changed options "newstore_sync_*" without success.
> >  
> > Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> > You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > But if you are overwriting an existing object, doing write-ahead logging
> > is usually unavoidable because we need to make the update atomic (and the
> > underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > mitigates this somewhat for larger writes by limiting fragment size, but
> > for small IOs this is pretty much always going to be the case.  For small
> > IOs, though, putting things in db/ is generally better since we can
> > combine many small ios into a single (rocksdb) journal/wal write.  And
> > often leave them there (via the 'overlay' behavior).
> >  
> > sage
> >  
> >  
> >  
> >  
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > 
> > 
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: David Casier <david.casier@aevoo.fr>
> > Date: 14 octobre 2015 22:03:38 UTC+2
> > À: Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>, benoit.loriot@aevoo.fr
> > Cc: Denis Saget <geodni@gmail.com>, "luc.petetin" <luc.petetin@aevoo.fr>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Bonsoir Messieurs, 
> > Je viens de vivre le premier vrai feu Ceph.
> > Loic Dachary m'a bien appuyé sur le coup.
> > 
> > Je peux vous dire une chose : on a beau penser maîtriser le produit, c'est lors d'un incident qu'on se rend compte du nombre de facteurs à connaître par coeur.
> > Aussi, pas de panique, je prend vraiment l'expérience de ce soir comme un succès et comme un excellent coup de boost.
> > 
> > Explications :
> >  - LI ont un peu trop joué avec la crushmap (je ferai de la technique pointut un autre jour)
> >  - Mise à jour et redémarrage des OSD
> >  - Les OSD ne savaient plus où étaient la data
> >  - Reconstruction à la mimine de la crushmap et zzooouu.
> > 
> > Rien de bien grave en soit et un gros plus (++++) en image chez LI (j'aurais perdu 1h à 2h de plus sans Loic où ne s'est pas dispersé)
> > 
> > Conclusion : 
> > On va bosser ensemble sur des stress-tests, un peu comme des validations RedHat : une plate-forme, je casse, vous réparez.
> > Vous aurez autant de temps qu'il faut pour trouver (il m'est arrivé de passer quelques jours sur certains trucs).
> > 
> > Objectifs : 
> >  - Maîtriser une liste de vérifs à faire 
> >  - La rejouer toutes les semaines si beaucoup de fautes
> >  - Tous les mois si un peu de faute
> >  - Tous les 3 mois si bonne maîtrise 
> >  - ...
> > 
> > Il faut qu'on soit au top et que certaines choses passent en réflexe (vérif crushmap, savoir trouver la data sans les process, ...).
> > Surtout qu'il faut que le client soit rassuré en cas d'incident (ou pas).
> > 
> > Et franchement, c'est vraiment passionnant Ceph !
> > 
> >   On 10/12/2015 09:33 PM, Sage Weil wrote:
> >> Hi David-
> >> 
> >> On Mon, 12 Oct 2015, David Casier wrote:
> >>> Ok,
> >>> Great.
> >>> 
> >>> With these  settings :
> >>> //
> >>> newstore_max_dir_size = 4096
> >>> newstore_sync_io = true
> >>> newstore_sync_transaction = true
> >>> newstore_sync_submit_transaction = true
> >> Is this a hard disk?  Those settings probably don't make sense since it 
> >> does every IO synchronously, blocking the submitting IO path...
> >> 
> >>> newstore_sync_wal_apply = true
> >>> newstore_overlay_max = 0
> >>> //
> >>> 
> >>> And direct IO in the benchmark tool (fio)
> >>> 
> >>> I see that the HDD is 100% charged and there are notransfer of /db to
> >>> /fragments after stopping benchmark : Great !
> >>> 
> >>> But when i launch a bench with random blocs of 256k, i see random blocs
> >>> between 32k and 256k on HDD. Any idea ?
> >> Random IOs have to be write ahead logged in rocksdb, which has its own IO 
> >> pattern.  Since you made everything sync above I think it'll depend on 
> >> how many osd threads get batched together at a time.. maybe.  Those 
> >> settings aren't something I've really tested, and probably only make 
> >> sense with very fast NVMe devices.
> >> 
> >>> Debits to the HDD are about 8MBps when they could be higher with larger blocs> (~30MBps)
> >>> And 70 MBps without fsync (hard drive cache disabled).
> >>> 
> >>> Other questions :
> >>> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> >>> fsync_wq) ?
> >> yes
> >> 
> >>> newstore_sync_transaction -> true = sync in DB ?
> >> synchronously do the rocksdb commit too
> >> 
> >>> newstore_sync_submit_transaction -> if false then kv_queue (only if
> >>> newstore_sync_transaction=false) ?
> >> yeah.. there is an annoying rocksdb behavior that makes an async 
> >> transaction submit block if a sync one is in progress, so this queues them 
> >> up and explicitly batches them.
> >> 
> >>> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> >> the txn commit completion threads can do the wal work synchronously.. this 
> >> is only a good idea if it's doing aio (which it generally is).
> >> 
> >>> Is it true ?
> >>> 
> >>> Way for cache with battery (sync DB and no sync data) ?
> >> ?
> >> s
> >> 
> >>> Thanks for everything !
> >>> 
> >>> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >>>> On Mon, 12 Oct 2015, David Casier wrote:
> >>>>> Hello everybody,
> >>>>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>>>> I separed "/db" and "/fragments" but during the bench, everything is
> >>>>> writing
> >>>>> to "/db"
> >>>>> I changed options "newstore_sync_*" without success.
> >>>>> 
> >>>>> Is there any way to write all metadata in "/db" and all data in
> >>>>> "/fragments" ?
> >>>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >>>> But if you are overwriting an existing object, doing write-ahead logging
> >>>> is usually unavoidable because we need to make the update atomic (and the
> >>>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >>>> mitigates this somewhat for larger writes by limiting fragment size, but
> >>>> for small IOs this is pretty much always going to be the case.  For small
> >>>> IOs, though, putting things in db/ is generally better since we can
> >>>> combine many small ios into a single (rocksdb) journal/wal write.  And
> >>>> often leave them there (via the 'overlay' behavior).
> >>>> 
> >>>> sage
> >>>> 
> >>> 
> >>> -- 
> >>> ________________________________________________________
> >>> 
> >>> Cordialement,
> >>> 
> >>> *David CASIER
> >>> DCConsulting SARL
> >>> 
> >>> 
> >>> 4 Trait d'Union
> >>> 77127 LIEUSAINT
> >>> 
> >>> **Ligne directe: _01 75 98 53 85_
> >>> Email: _david.casier@aevoo.fr_
> >>> * ________________________________________________________
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
> >>> 
> >>> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
> > 
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > DCConsulting SARL
> > 
> >  
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > ________________________________________________________
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Fwd: Fwd: [newstore (again)] how disable double write WAL
       [not found]   ` <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
@ 2015-12-01 20:34     ` David Casier
  2015-12-01 22:02       ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2015-12-01 20:34 UTC (permalink / raw)
  To: Ceph Development

FYI
---------- Forwarded message ----------
From: David Casier <david.casier@aevoo.fr>
Date: 2015-12-01 21:32 GMT+01:00
Subject: Re: Fwd: [newstore (again)] how disable double write WAL
To: Sage Weil <sage@newdream.net>
Cc: Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>, Vish Maram-SSI
<vishwanath.m@ssi.samsung.com>, Ceph Development
<ceph-devel@vger.kernel.org>, Benoît LORIOT <benoit.loriot@aevoo.fr>,
pascal.billery-schneider@laposte.net


Hi Sage,
With a standard disk (4 to 6 TB), and a small flash drive, it's easy
to create an ext4 FS with metadata on flash

Example with sdg1 on flash and sdb on hdd :

size_of() {
  blockdev --getsize $1
}

mkdmsetup() {
  _ssd=/dev/$1
  _hdd=/dev/$2
  _size_of_ssd=$(size_of $_ssd)
  echo """0 $_size_of_ssd linear $_ssd 0
  $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
}

mkdmsetup sdg1 sdb

mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
-E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
$((1024*512)) /dev/mapper/dm-sdg1-sdb

With that, all meta_blocks are on the SSD

If omap are on SSD, there are almost no metadata on HDD

Consequence : performance Ceph (with hack on filestore without journal
and directIO) are almost same that performance of the HDD.

With cache-tier, it's very cool !

That is why we are working on a hybrid approach HDD / Flash on ARM or Intel

With newstore, it's much more difficult to control the I/O profil.
Because rocksDB embedded its own intelligence

In the (near) futur, we will create a portal to display our hardware
solution in the CERN OHL license.

(My non-fluency in English explains the latency of my answers)

2015-11-24 21:42 GMT+01:00 Sage Weil <sage@newdream.net>:
>
> On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote:
> > Hello Vish,
> >
> > Please apologize for the delay in my answer.
> > Following the conversation you had with my colleague David, here are
> > some more details about our work :
> >
> > We are working on Filestore / Newstore optimizations by studying how we
> > could set ourselves free from using the journal.
> >
> > It is very important to work with SSD, but it is also mandatory to
> > combine it with regular magnetic platter disks. This is why we are
> > combining metadata storing on flash with data storing on disk.
>
> This is pretty common, and something we will support natively with
> newstore.
>
> > Our main goal is to have the control on performance. Which is quite
> > difficult with NewStore, and needs fundamental hacks with FileStore.
>
> Can you clarify what you mean by "quite difficult with NewStore"?
>
> FWIW, the latest bleeding edge code is currently at
> github.com/liewegas/wip-bluestore.
>
> sage
>
>
> > Is Samsung working on ARM boards with embedded flash and a SATA port, in
> > order to allow us to work on a hybrid approach? What is your line of
> > work with Ceph?
> >
> > How can we work together ?
> >
> > Regards,
> > Sébastien
> >
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.casier@aevoo.fr>
> > > Date: 12 octobre 2015 20:52:26 UTC+2
> > > À: Sage Weil <sage@newdream.net>, Ceph Development <ceph-devel@vger.kernel.org>
> > > Cc: Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>, benoit.loriot@aevoo.fr, Denis Saget <geodni@gmail.com>, "luc.petetin" <luc.petetin@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Ok,
> > > Great.
> > >
> > > With these  settings :
> > > //
> > > newstore_max_dir_size = 4096
> > > newstore_sync_io = true
> > > newstore_sync_transaction = true
> > > newstore_sync_submit_transaction = true
> > > newstore_sync_wal_apply = true
> > > newstore_overlay_max = 0
> > > //
> > >
> > > And direct IO in the benchmark tool (fio)
> > >
> > > I see that the HDD is 100% charged and there are notransfer of /db to /fragments after stopping benchmark : Great !
> > >
> > > But when i launch a bench with random blocs of 256k, i see random blocs between 32k and 256k on HDD. Any idea ?
> > >
> > > Debits to the HDD are about 8MBps when they could be higher with larger blocs (~30MBps)
> > > And 70 MBps without fsync (hard drive cache disabled).
> > >
> > > Other questions :
> > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread fsync_wq) ?
> > > newstore_sync_transaction -> true = sync in DB ?
> > > newstore_sync_submit_transaction -> if false then kv_queue (only if newstore_sync_transaction=false) ?
> > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > >
> > > Is it true ?
> > >
> > > Way for cache with battery (sync DB and no sync data) ?
> > >
> > > Thanks for everything !
> > >
> > > On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >> On Mon, 12 Oct 2015, David Casier wrote:
> > >>> Hello everybody,
> > >>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>> I separed "/db" and "/fragments" but during the bench, everything is writing
> > >>> to "/db"
> > >>> I changed options "newstore_sync_*" without success.
> > >>>
> > >>> Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >> But if you are overwriting an existing object, doing write-ahead logging
> > >> is usually unavoidable because we need to make the update atomic (and the
> > >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > >> mitigates this somewhat for larger writes by limiting fragment size, but
> > >> for small IOs this is pretty much always going to be the case.  For small
> > >> IOs, though, putting things in db/ is generally better since we can
> > >> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >> often leave them there (via the 'overlay' behavior).
> > >>
> > >> sage
> > >>
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > *David CASIER
> > > DCConsulting SARL
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > **Ligne directe: _01 75 98 53 85_
> > > Email: _david.casier@aevoo.fr_
> > > * ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.casier@aevoo.fr>
> > > Date: 2 novembre 2015 20:02:37 UTC+1
> > > À: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@ssi.samsung.com>
> > > Cc: benoit LORIOT <benoit.loriot@aevoo.fr>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi Vish,
> > > In FileStore, data and metadata are stored in files, with xargs FS and omap.
> > > NewStore works with RocksDB.
> > > There are a lot of configuration in RocksDB but all options not implemented.
> > >
> > > The best way, for me, is not to use the logs, with secure cache (for example SSD 845DC).
> > > I don't think that is necessary to report I/O with a good metadata optimisation.
> > >
> > > The problem with RocksDB is that not possible to control I/O blocs size.
> > >
> > > We will resume work on NewStore soon.
> > >
> > > On 10/29/2015 05:30 PM, Vish (Vishwanath) Maram-SSI wrote:
> > >> Thanks David for the reply.
> > >>
> > >> Yeah We just wanted to know how different is it from Filestore and how do we contribute for this? My motive is to first understand the design of Newstore and get the Performance loopholes so that we can try looking into it.
> > >>
> > >> It would be helpful if you can share what is your idea from your side to use Newstore and configuration? What plans you are having for contributions to help us understand and see if we can work together.
> > >>
> > >> Thanks,
> > >> -Vish
> > >>   <>
> > >> From: David Casier [mailto:david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>]
> > >> Sent: Thursday, October 29, 2015 4:41 AM
> > >> To: Vish (Vishwanath) Maram-SSI
> > >> Cc: benoit LORIOT; Sébastien VALSEMEY
> > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >>
> > >> Hi Vish,
> > >> It's OK.
> > >>
> > >> We have a lot of different configuration with newstore tests.
> > >>
> > >> What is your goal with ?
> > >>
> > >> On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > >> Hi David,
> > >>
> > >> Sorry for sending you the mail directly.
> > >>
> > >> This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO.
> > >>
> > >> Can you please share your Ceph Configuration file which you have used to run the IO's using FIO?
> > >>
> > >> Thanks,
> > >> -Vish
> > >>
> > >> -----Original Message-----
> > >> From: ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org> [mailto:ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org>] On Behalf Of David Casier
> > >> Sent: Monday, October 12, 2015 11:52 AM
> > >> To: Sage Weil; Ceph Development
> > >> Cc: Sébastien VALSEMEY; benoit.loriot@aevoo.fr <mailto:benoit.loriot@aevoo.fr>; Denis Saget; luc.petetin
> > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >>
> > >> Ok,
> > >> Great.
> > >>
> > >> With these  settings :
> > >> //
> > >> newstore_max_dir_size = 4096
> > >> newstore_sync_io = true
> > >> newstore_sync_transaction = true
> > >> newstore_sync_submit_transaction = true
> > >> newstore_sync_wal_apply = true
> > >> newstore_overlay_max = 0
> > >> //
> > >>
> > >> And direct IO in the benchmark tool (fio)
> > >>
> > >> I see that the HDD is 100% charged and there are notransfer of /db to
> > >> /fragments after stopping benchmark : Great !
> > >>
> > >> But when i launch a bench with random blocs of 256k, i see random blocs
> > >> between 32k and 256k on HDD. Any idea ?
> > >>
> > >> Debits to the HDD are about 8MBps when they could be higher with larger
> > >> blocs (~30MBps)
> > >> And 70 MBps without fsync (hard drive cache disabled).
> > >>
> > >> Other questions :
> > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >> fsync_wq) ?
> > >> newstore_sync_transaction -> true = sync in DB ?
> > >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >> newstore_sync_transaction=false) ?
> > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > >>
> > >> Is it true ?
> > >>
> > >> Way for cache with battery (sync DB and no sync data) ?
> > >>
> > >> Thanks for everything !
> > >>
> > >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >> On Mon, 12 Oct 2015, David Casier wrote:
> > >> Hello everybody,
> > >> fragment is stored in rocksdb before being written to "/fragments" ?
> > >> I separed "/db" and "/fragments" but during the bench, everything is writing
> > >> to "/db"
> > >> I changed options "newstore_sync_*" without success.
> > >>
> > >> Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >> But if you are overwriting an existing object, doing write-ahead logging
> > >> is usually unavoidable because we need to make the update atomic (and the
> > >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > >> mitigates this somewhat for larger writes by limiting fragment size, but
> > >> for small IOs this is pretty much always going to be the case.  For small
> > >> IOs, though, putting things in db/ is generally better since we can
> > >> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >> often leave them there (via the 'overlay' behavior).
> > >>
> > >> sage
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> ________________________________________________________
> > >>
> > >> Cordialement,
> > >>
> > >> David CASIER
> > >>
> > >>
> > >> 4 Trait d'Union
> > >> 77127 LIEUSAINT
> > >>
> > >> Ligne directe: 01 75 98 53 85
> > >> Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > >> ________________________________________________________
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > > ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: Sage Weil <sage@newdream.net>
> > > Date: 12 octobre 2015 21:33:52 UTC+2
> > > À: David Casier <david.casier@aevoo.fr>
> > > Cc: Ceph Development <ceph-devel@vger.kernel.org>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>, benoit.loriot@aevoo.fr, Denis Saget <geodni@gmail.com>, "luc.petetin" <luc.petetin@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi David-
> > >
> > > On Mon, 12 Oct 2015, David Casier wrote:
> > >> Ok,
> > >> Great.
> > >>
> > >> With these  settings :
> > >> //
> > >> newstore_max_dir_size = 4096
> > >> newstore_sync_io = true
> > >> newstore_sync_transaction = true
> > >> newstore_sync_submit_transaction = true
> > >
> > > Is this a hard disk?  Those settings probably don't make sense since it
> > > does every IO synchronously, blocking the submitting IO path...
> > >
> > >> newstore_sync_wal_apply = true
> > >> newstore_overlay_max = 0
> > >> //
> > >>
> > >> And direct IO in the benchmark tool (fio)
> > >>
> > >> I see that the HDD is 100% charged and there are notransfer of /db to
> > >> /fragments after stopping benchmark : Great !
> > >>
> > >> But when i launch a bench with random blocs of 256k, i see random blocs
> > >> between 32k and 256k on HDD. Any idea ?
> > >
> > > Random IOs have to be write ahead logged in rocksdb, which has its own IO
> > > pattern.  Since you made everything sync above I think it'll depend on
> > > how many osd threads get batched together at a time.. maybe.  Those
> > > settings aren't something I've really tested, and probably only make
> > > sense with very fast NVMe devices.
> > >
> > >> Debits to the HDD are about 8MBps when they could be higher with larger blocs> (~30MBps)
> > >> And 70 MBps without fsync (hard drive cache disabled).
> > >>
> > >> Other questions :
> > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >> fsync_wq) ?
> > >
> > > yes
> > >
> > >> newstore_sync_transaction -> true = sync in DB ?
> > >
> > > synchronously do the rocksdb commit too
> > >
> > >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >> newstore_sync_transaction=false) ?
> > >
> > > yeah.. there is an annoying rocksdb behavior that makes an async
> > > transaction submit block if a sync one is in progress, so this queues them
> > > up and explicitly batches them.
> > >
> > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > >
> > > the txn commit completion threads can do the wal work synchronously.. this
> > > is only a good idea if it's doing aio (which it generally is).
> > >
> > >> Is it true ?
> > >>
> > >> Way for cache with battery (sync DB and no sync data) ?
> > >
> > > ?
> > > s
> > >
> > >>
> > >> Thanks for everything !
> > >>
> > >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >>> On Mon, 12 Oct 2015, David Casier wrote:
> > >>>> Hello everybody,
> > >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>>> I separed "/db" and "/fragments" but during the bench, everything is
> > >>>> writing
> > >>>> to "/db"
> > >>>> I changed options "newstore_sync_*" without success.
> > >>>>
> > >>>> Is there any way to write all metadata in "/db" and all data in
> > >>>> "/fragments" ?
> > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >>> But if you are overwriting an existing object, doing write-ahead logging
> > >>> is usually unavoidable because we need to make the update atomic (and the
> > >>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > >>> mitigates this somewhat for larger writes by limiting fragment size, but
> > >>> for small IOs this is pretty much always going to be the case.  For small
> > >>> IOs, though, putting things in db/ is generally better since we can
> > >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >>> often leave them there (via the 'overlay' behavior).
> > >>>
> > >>> sage
> > >>>
> > >>
> > >>
> > >> --
> > >> ________________________________________________________
> > >>
> > >> Cordialement,
> > >>
> > >> *David CASIER
> > >> DCConsulting SARL
> > >>
> > >>
> > >> 4 Trait d'Union
> > >> 77127 LIEUSAINT
> > >>
> > >> **Ligne directe: _01 75 98 53 85_
> > >> Email: _david.casier@aevoo.fr_
> > >> * ________________________________________________________
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > >>
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.casier@aevoo.fr>
> > > Date: 29 octobre 2015 12:41:22 UTC+1
> > > À: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@ssi.samsung.com>
> > > Cc: benoit LORIOT <benoit.loriot@aevoo.fr>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi Vish,
> > > It's OK.
> > >
> > > We have a lot of different configuration with newstore tests.
> > >
> > > What is your goal with ?
> > >
> > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > >> Hi David,
> > >>
> > >> Sorry for sending you the mail directly.
> > >>
> > >> This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO.
> > >>
> > >> Can you please share your Ceph Configuration file which you have used to run the IO's using FIO?
> > >>
> > >> Thanks,
> > >> -Vish
> > >>
> > >> -----Original Message-----
> > >> From: ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org> [mailto:ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org>] On Behalf Of David Casier
> > >> Sent: Monday, October 12, 2015 11:52 AM
> > >> To: Sage Weil; Ceph Development
> > >> Cc: Sébastien VALSEMEY; benoit.loriot@aevoo.fr <mailto:benoit.loriot@aevoo.fr>; Denis Saget; luc.petetin
> > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >>
> > >> Ok,
> > >> Great.
> > >>
> > >> With these  settings :
> > >> //
> > >> newstore_max_dir_size = 4096
> > >> newstore_sync_io = true
> > >> newstore_sync_transaction = true
> > >> newstore_sync_submit_transaction = true
> > >> newstore_sync_wal_apply = true
> > >> newstore_overlay_max = 0
> > >> //
> > >>
> > >> And direct IO in the benchmark tool (fio)
> > >>
> > >> I see that the HDD is 100% charged and there are notransfer of /db to
> > >> /fragments after stopping benchmark : Great !
> > >>
> > >> But when i launch a bench with random blocs of 256k, i see random blocs
> > >> between 32k and 256k on HDD. Any idea ?
> > >>
> > >> Debits to the HDD are about 8MBps when they could be higher with larger
> > >> blocs (~30MBps)
> > >> And 70 MBps without fsync (hard drive cache disabled).
> > >>
> > >> Other questions :
> > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >> fsync_wq) ?
> > >> newstore_sync_transaction -> true = sync in DB ?
> > >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >> newstore_sync_transaction=false) ?
> > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > >>
> > >> Is it true ?
> > >>
> > >> Way for cache with battery (sync DB and no sync data) ?
> > >>
> > >> Thanks for everything !
> > >>
> > >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >>> On Mon, 12 Oct 2015, David Casier wrote:
> > >>>> Hello everybody,
> > >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>>> I separed "/db" and "/fragments" but during the bench, everything is writing
> > >>>> to "/db"
> > >>>> I changed options "newstore_sync_*" without success.
> > >>>>
> > >>>> Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >>> But if you are overwriting an existing object, doing write-ahead logging
> > >>> is usually unavoidable because we need to make the update atomic (and the
> > >>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > >>> mitigates this somewhat for larger writes by limiting fragment size, but
> > >>> for small IOs this is pretty much always going to be the case.  For small
> > >>> IOs, though, putting things in db/ is generally better since we can
> > >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >>> often leave them there (via the 'overlay' behavior).
> > >>>
> > >>> sage
> > >>>
> > >>
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > > ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@ssi.samsung.com>
> > > Date: 29 octobre 2015 17:30:56 UTC+1
> > > À: David Casier <david.casier@aevoo.fr>
> > > Cc: benoit LORIOT <benoit.loriot@aevoo.fr>, Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>
> > > Objet: RE: Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Thanks David for the reply.
> > >
> > > Yeah We just wanted to know how different is it from Filestore and how do we contribute for this? My motive is to first understand the design of Newstore and get the Performance loopholes so that we can try looking into it.
> > >
> > > It would be helpful if you can share what is your idea from your side to use Newstore and configuration? What plans you are having for contributions to help us understand and see if we can work together.
> > >
> > > Thanks,
> > > -Vish
> > >   <>
> > > From: David Casier [mailto:david.casier@aevoo.fr]
> > > Sent: Thursday, October 29, 2015 4:41 AM
> > > To: Vish (Vishwanath) Maram-SSI
> > > Cc: benoit LORIOT; Sébastien VALSEMEY
> > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi Vish,
> > > It's OK.
> > >
> > > We have a lot of different configuration with newstore tests.
> > >
> > > What is your goal with ?
> > >
> > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > > Hi David,
> > >
> > > Sorry for sending you the mail directly.
> > >
> > > This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO.
> > >
> > > Can you please share your Ceph Configuration file which you have used to run the IO's using FIO?
> > >
> > > Thanks,
> > > -Vish
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org> [mailto:ceph-devel-owner@vger.kernel.org <mailto:ceph-devel-owner@vger.kernel.org>] On Behalf Of David Casier
> > > Sent: Monday, October 12, 2015 11:52 AM
> > > To: Sage Weil; Ceph Development
> > > Cc: Sébastien VALSEMEY; benoit.loriot@aevoo.fr <mailto:benoit.loriot@aevoo.fr>; Denis Saget; luc.petetin
> > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Ok,
> > > Great.
> > >
> > > With these  settings :
> > > //
> > > newstore_max_dir_size = 4096
> > > newstore_sync_io = true
> > > newstore_sync_transaction = true
> > > newstore_sync_submit_transaction = true
> > > newstore_sync_wal_apply = true
> > > newstore_overlay_max = 0
> > > //
> > >
> > > And direct IO in the benchmark tool (fio)
> > >
> > > I see that the HDD is 100% charged and there are notransfer of /db to
> > > /fragments after stopping benchmark : Great !
> > >
> > > But when i launch a bench with random blocs of 256k, i see random blocs
> > > between 32k and 256k on HDD. Any idea ?
> > >
> > > Debits to the HDD are about 8MBps when they could be higher with larger
> > > blocs (~30MBps)
> > > And 70 MBps without fsync (hard drive cache disabled).
> > >
> > > Other questions :
> > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > > fsync_wq) ?
> > > newstore_sync_transaction -> true = sync in DB ?
> > > newstore_sync_submit_transaction -> if false then kv_queue (only if
> > > newstore_sync_transaction=false) ?
> > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > >
> > > Is it true ?
> > >
> > > Way for cache with battery (sync DB and no sync data) ?
> > >
> > > Thanks for everything !
> > >
> > > On 10/12/2015 03:01 PM, Sage Weil wrote:
> > > On Mon, 12 Oct 2015, David Casier wrote:
> > > Hello everybody,
> > > fragment is stored in rocksdb before being written to "/fragments" ?
> > > I separed "/db" and "/fragments" but during the bench, everything is writing
> > > to "/db"
> > > I changed options "newstore_sync_*" without success.
> > >
> > > Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> > > You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > > But if you are overwriting an existing object, doing write-ahead logging
> > > is usually unavoidable because we need to make the update atomic (and the
> > > underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > > mitigates this somewhat for larger writes by limiting fragment size, but
> > > for small IOs this is pretty much always going to be the case.  For small
> > > IOs, though, putting things in db/ is generally better since we can
> > > combine many small ios into a single (rocksdb) journal/wal write.  And
> > > often leave them there (via the 'overlay' behavior).
> > >
> > > sage
> > >
> > >
> > >
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > > ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.casier@aevoo.fr>
> > > Date: 14 octobre 2015 22:03:38 UTC+2
> > > À: Sébastien VALSEMEY <sebastien.valsemey@aevoo.fr>, benoit.loriot@aevoo.fr
> > > Cc: Denis Saget <geodni@gmail.com>, "luc.petetin" <luc.petetin@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Bonsoir Messieurs,
> > > Je viens de vivre le premier vrai feu Ceph.
> > > Loic Dachary m'a bien appuyé sur le coup.
> > >
> > > Je peux vous dire une chose : on a beau penser maîtriser le produit, c'est lors d'un incident qu'on se rend compte du nombre de facteurs à connaître par coeur.
> > > Aussi, pas de panique, je prend vraiment l'expérience de ce soir comme un succès et comme un excellent coup de boost.
> > >
> > > Explications :
> > >  - LI ont un peu trop joué avec la crushmap (je ferai de la technique pointut un autre jour)
> > >  - Mise à jour et redémarrage des OSD
> > >  - Les OSD ne savaient plus où étaient la data
> > >  - Reconstruction à la mimine de la crushmap et zzooouu.
> > >
> > > Rien de bien grave en soit et un gros plus (++++) en image chez LI (j'aurais perdu 1h à 2h de plus sans Loic où ne s'est pas dispersé)
> > >
> > > Conclusion :
> > > On va bosser ensemble sur des stress-tests, un peu comme des validations RedHat : une plate-forme, je casse, vous réparez.
> > > Vous aurez autant de temps qu'il faut pour trouver (il m'est arrivé de passer quelques jours sur certains trucs).
> > >
> > > Objectifs :
> > >  - Maîtriser une liste de vérifs à faire
> > >  - La rejouer toutes les semaines si beaucoup de fautes
> > >  - Tous les mois si un peu de faute
> > >  - Tous les 3 mois si bonne maîtrise
> > >  - ...
> > >
> > > Il faut qu'on soit au top et que certaines choses passent en réflexe (vérif crushmap, savoir trouver la data sans les process, ...).
> > > Surtout qu'il faut que le client soit rassuré en cas d'incident (ou pas).
> > >
> > > Et franchement, c'est vraiment passionnant Ceph !
> > >
> > >   On 10/12/2015 09:33 PM, Sage Weil wrote:
> > >> Hi David-
> > >>
> > >> On Mon, 12 Oct 2015, David Casier wrote:
> > >>> Ok,
> > >>> Great.
> > >>>
> > >>> With these  settings :
> > >>> //
> > >>> newstore_max_dir_size = 4096
> > >>> newstore_sync_io = true
> > >>> newstore_sync_transaction = true
> > >>> newstore_sync_submit_transaction = true
> > >> Is this a hard disk?  Those settings probably don't make sense since it
> > >> does every IO synchronously, blocking the submitting IO path...
> > >>
> > >>> newstore_sync_wal_apply = true
> > >>> newstore_overlay_max = 0
> > >>> //
> > >>>
> > >>> And direct IO in the benchmark tool (fio)
> > >>>
> > >>> I see that the HDD is 100% charged and there are notransfer of /db to
> > >>> /fragments after stopping benchmark : Great !
> > >>>
> > >>> But when i launch a bench with random blocs of 256k, i see random blocs
> > >>> between 32k and 256k on HDD. Any idea ?
> > >> Random IOs have to be write ahead logged in rocksdb, which has its own IO
> > >> pattern.  Since you made everything sync above I think it'll depend on
> > >> how many osd threads get batched together at a time.. maybe.  Those
> > >> settings aren't something I've really tested, and probably only make
> > >> sense with very fast NVMe devices.
> > >>
> > >>> Debits to the HDD are about 8MBps when they could be higher with larger blocs> (~30MBps)
> > >>> And 70 MBps without fsync (hard drive cache disabled).
> > >>>
> > >>> Other questions :
> > >>> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >>> fsync_wq) ?
> > >> yes
> > >>
> > >>> newstore_sync_transaction -> true = sync in DB ?
> > >> synchronously do the rocksdb commit too
> > >>
> > >>> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >>> newstore_sync_transaction=false) ?
> > >> yeah.. there is an annoying rocksdb behavior that makes an async
> > >> transaction submit block if a sync one is in progress, so this queues them
> > >> up and explicitly batches them.
> > >>
> > >>> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > >> the txn commit completion threads can do the wal work synchronously.. this
> > >> is only a good idea if it's doing aio (which it generally is).
> > >>
> > >>> Is it true ?
> > >>>
> > >>> Way for cache with battery (sync DB and no sync data) ?
> > >> ?
> > >> s
> > >>
> > >>> Thanks for everything !
> > >>>
> > >>> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >>>> On Mon, 12 Oct 2015, David Casier wrote:
> > >>>>> Hello everybody,
> > >>>>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>>>> I separed "/db" and "/fragments" but during the bench, everything is
> > >>>>> writing
> > >>>>> to "/db"
> > >>>>> I changed options "newstore_sync_*" without success.
> > >>>>>
> > >>>>> Is there any way to write all metadata in "/db" and all data in
> > >>>>> "/fragments" ?
> > >>>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >>>> But if you are overwriting an existing object, doing write-ahead logging
> > >>>> is usually unavoidable because we need to make the update atomic (and the
> > >>>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > >>>> mitigates this somewhat for larger writes by limiting fragment size, but
> > >>>> for small IOs this is pretty much always going to be the case.  For small
> > >>>> IOs, though, putting things in db/ is generally better since we can
> > >>>> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >>>> often leave them there (via the 'overlay' behavior).
> > >>>>
> > >>>> sage
> > >>>>
> > >>>
> > >>> --
> > >>> ________________________________________________________
> > >>>
> > >>> Cordialement,
> > >>>
> > >>> *David CASIER
> > >>> DCConsulting SARL
> > >>>
> > >>>
> > >>> 4 Trait d'Union
> > >>> 77127 LIEUSAINT
> > >>>
> > >>> **Ligne directe: _01 75 98 53 85_
> > >>> Email: _david.casier@aevoo.fr_
> > >>> * ________________________________________________________
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > >>> the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
> > >>>
> > >>>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > >> the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > > DCConsulting SARL
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.casier@aevoo.fr <mailto:david.casier@aevoo.fr>
> > > ________________________________________________________
> >
> >




-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2015-12-01 20:34     ` Fwd: " David Casier
@ 2015-12-01 22:02       ` Sage Weil
  2015-12-04 20:12         ` Ric Wheeler
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2015-12-01 22:02 UTC (permalink / raw)
  To: David Casier; +Cc: Ceph Development

Hi David,

On Tue, 1 Dec 2015, David Casier wrote:
> Hi Sage,
> With a standard disk (4 to 6 TB), and a small flash drive, it's easy
> to create an ext4 FS with metadata on flash
> 
> Example with sdg1 on flash and sdb on hdd :
> 
> size_of() {
>   blockdev --getsize $1
> }
> 
> mkdmsetup() {
>   _ssd=/dev/$1
>   _hdd=/dev/$2
>   _size_of_ssd=$(size_of $_ssd)
>   echo """0 $_size_of_ssd linear $_ssd 0
>   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
> }
> 
> mkdmsetup sdg1 sdb
> 
> mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
> $((1024*512)) /dev/mapper/dm-sdg1-sdb
> 
> With that, all meta_blocks are on the SSD
> 
> If omap are on SSD, there are almost no metadata on HDD
> 
> Consequence : performance Ceph (with hack on filestore without journal
> and directIO) are almost same that performance of the HDD.
> 
> With cache-tier, it's very cool !

Cool!  I know XFS lets you do that with the journal, but I'm not sure if 
you can push the fs metadata onto a different device too.. I'm guessing 
not?

> That is why we are working on a hybrid approach HDD / Flash on ARM or Intel
> 
> With newstore, it's much more difficult to control the I/O profil.
> Because rocksDB embedded its own intelligence

This is coincidentally what I've been working on today.  So far I've just 
added the ability to put the rocksdb WAL on a second device, but it's 
super easy to push rocksdb data there as well (and have it spill over onto 
the larger, slower device if it fills up).  Or to put the rocksdb WAL on a 
third device (e.g., expensive NVMe or NVRAM).

See this ticket for the ceph-disk tooling that's needed:

	http://tracker.ceph.com/issues/13942

I expect this will be more flexible and perform better than the ext4 
metadata option, but we'll need to test on your hardware to confirm!

sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2015-12-01 22:02       ` Sage Weil
@ 2015-12-04 20:12         ` Ric Wheeler
  2015-12-04 20:20           ` Eric Sandeen
  2015-12-08  4:46           ` Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: Ric Wheeler @ 2015-12-04 20:12 UTC (permalink / raw)
  To: Sage Weil, David Casier
  Cc: Ceph Development, Dave Chinner, Brian Foster, Eric Sandeen

On 12/01/2015 05:02 PM, Sage Weil wrote:
> Hi David,
>
> On Tue, 1 Dec 2015, David Casier wrote:
>> Hi Sage,
>> With a standard disk (4 to 6 TB), and a small flash drive, it's easy
>> to create an ext4 FS with metadata on flash
>>
>> Example with sdg1 on flash and sdb on hdd :
>>
>> size_of() {
>>    blockdev --getsize $1
>> }
>>
>> mkdmsetup() {
>>    _ssd=/dev/$1
>>    _hdd=/dev/$2
>>    _size_of_ssd=$(size_of $_ssd)
>>    echo """0 $_size_of_ssd linear $_ssd 0
>>    $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
>> }
>>
>> mkdmsetup sdg1 sdb
>>
>> mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
>> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
>> $((1024*512)) /dev/mapper/dm-sdg1-sdb
>>
>> With that, all meta_blocks are on the SSD
>>
>> If omap are on SSD, there are almost no metadata on HDD
>>
>> Consequence : performance Ceph (with hack on filestore without journal
>> and directIO) are almost same that performance of the HDD.
>>
>> With cache-tier, it's very cool !
> Cool!  I know XFS lets you do that with the journal, but I'm not sure if
> you can push the fs metadata onto a different device too.. I'm guessing
> not?
>
>> That is why we are working on a hybrid approach HDD / Flash on ARM or Intel
>>
>> With newstore, it's much more difficult to control the I/O profil.
>> Because rocksDB embedded its own intelligence
> This is coincidentally what I've been working on today.  So far I've just
> added the ability to put the rocksdb WAL on a second device, but it's
> super easy to push rocksdb data there as well (and have it spill over onto
> the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
> third device (e.g., expensive NVMe or NVRAM).
>
> See this ticket for the ceph-disk tooling that's needed:
>
> 	http://tracker.ceph.com/issues/13942
>
> I expect this will be more flexible and perform better than the ext4
> metadata option, but we'll need to test on your hardware to confirm!
>
> sage

I think that XFS "realtime" subvolumes are the thing that does this -  the 
second volume contains only the data (no metadata).

Seem to recall that it is popular historically with video appliances, etc but it 
is not commonly used.

Some of the XFS crew cc'ed above would have more information on this,

Ric



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2015-12-04 20:12         ` Ric Wheeler
@ 2015-12-04 20:20           ` Eric Sandeen
  2015-12-08  4:46           ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Eric Sandeen @ 2015-12-04 20:20 UTC (permalink / raw)
  To: Ric Wheeler, Sage Weil, David Casier
  Cc: Ceph Development, Dave Chinner, Brian Foster

On 12/4/15 2:12 PM, Ric Wheeler wrote:
> On 12/01/2015 05:02 PM, Sage Weil wrote:
>> Hi David,
>>
>> On Tue, 1 Dec 2015, David Casier wrote:
>>> Hi Sage,
>>> With a standard disk (4 to 6 TB), and a small flash drive, it's easy
>>> to create an ext4 FS with metadata on flash
>>>
>>> Example with sdg1 on flash and sdb on hdd :
>>>
>>> size_of() {
>>>    blockdev --getsize $1
>>> }
>>>
>>> mkdmsetup() {
>>>    _ssd=/dev/$1
>>>    _hdd=/dev/$2
>>>    _size_of_ssd=$(size_of $_ssd)
>>>    echo """0 $_size_of_ssd linear $_ssd 0
>>>    $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
>>> }
>>>
>>> mkdmsetup sdg1 sdb
>>>
>>> mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
>>> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
>>> $((1024*512)) /dev/mapper/dm-sdg1-sdb
>>>
>>> With that, all meta_blocks are on the SSD
>>>
>>> If omap are on SSD, there are almost no metadata on HDD
>>>
>>> Consequence : performance Ceph (with hack on filestore without journal
>>> and directIO) are almost same that performance of the HDD.
>>>
>>> With cache-tier, it's very cool !
>> Cool!  I know XFS lets you do that with the journal, but I'm not sure if
>> you can push the fs metadata onto a different device too.. I'm guessing
>> not?
>>
>>> That is why we are working on a hybrid approach HDD / Flash on ARM or Intel
>>>
>>> With newstore, it's much more difficult to control the I/O profil.
>>> Because rocksDB embedded its own intelligence
>> This is coincidentally what I've been working on today.  So far I've just
>> added the ability to put the rocksdb WAL on a second device, but it's
>> super easy to push rocksdb data there as well (and have it spill over onto
>> the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
>> third device (e.g., expensive NVMe or NVRAM).
>>
>> See this ticket for the ceph-disk tooling that's needed:
>>
>>     http://tracker.ceph.com/issues/13942
>>
>> I expect this will be more flexible and perform better than the ext4
>> metadata option, but we'll need to test on your hardware to confirm!
>>
>> sage
> 
> I think that XFS "realtime" subvolumes are the thing that does this -  the second volume contains only the data (no metadata).
> 
> Seem to recall that it is popular historically with video appliances, etc but it is not commonly used.
> 
> Some of the XFS crew cc'ed above would have more information on this,

The realtime subvolume puts all data on a separate volume, and uses a different
allocator; it is more for streaming type applications, in general.  And it's
not enabled in RHEL - and not heavily tested at this point, I think.

-Eric

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2015-12-04 20:12         ` Ric Wheeler
  2015-12-04 20:20           ` Eric Sandeen
@ 2015-12-08  4:46           ` Dave Chinner
  2016-02-15 15:18             ` David Casier
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2015-12-08  4:46 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Sage Weil, David Casier, Ceph Development, Brian Foster, Eric Sandeen

On Fri, Dec 04, 2015 at 03:12:25PM -0500, Ric Wheeler wrote:
> On 12/01/2015 05:02 PM, Sage Weil wrote:
> >Hi David,
> >
> >On Tue, 1 Dec 2015, David Casier wrote:
> >>Hi Sage,
> >>With a standard disk (4 to 6 TB), and a small flash drive, it's easy
> >>to create an ext4 FS with metadata on flash
> >>
> >>Example with sdg1 on flash and sdb on hdd :
> >>
> >>size_of() {
> >>   blockdev --getsize $1
> >>}
> >>
> >>mkdmsetup() {
> >>   _ssd=/dev/$1
> >>   _hdd=/dev/$2
> >>   _size_of_ssd=$(size_of $_ssd)
> >>   echo """0 $_size_of_ssd linear $_ssd 0
> >>   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
> >>}

So this is just a linear concatenation that relies on ext4 putting
all it's metadata at the front of the filesystem?

> >>
> >>mkdmsetup sdg1 sdb
> >>
> >>mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
> >>-E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
> >>$((1024*512)) /dev/mapper/dm-sdg1-sdb
> >>
> >>With that, all meta_blocks are on the SSD

IIRC, it's the "packed_meta_blocks=1" that does this.

THis is something that is pretty trivial to do with XFS, too,
by use of the inode32 allocation mechanism. That reserves the
first TB of space for inodes and other metadata allocations,
so if you span the first TB with SSDs, you get almost all the
metadata on the SSDs, and all the data in the higher AGs. With the
undocumented log location mkfs option, you can also put hte log at
the start og AG 0 which means that would sit on the SSD, too,
without needing an external log device.

SGI even had a mount option hack to limit this allocator behaviour
to a block limit lower than 1TB so they could limit the metadata AG
regions to, say, the first 200GB.

> >This is coincidentally what I've been working on today.  So far I've just
> >added the ability to put the rocksdb WAL on a second device, but it's
> >super easy to push rocksdb data there as well (and have it spill over onto
> >the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
> >third device (e.g., expensive NVMe or NVRAM).

I have old bits and pieces from 7-8 years ago that would allow some
application control of allocation policy to allow things like this
to be done, but I left SGI before it was anything mor ethan just a
proof of concept....

> >See this ticket for the ceph-disk tooling that's needed:
> >
> >	http://tracker.ceph.com/issues/13942
> >
> >I expect this will be more flexible and perform better than the ext4
> >metadata option, but we'll need to test on your hardware to confirm!
> >
> >sage
> 
> I think that XFS "realtime" subvolumes are the thing that does this
> -  the second volume contains only the data (no metadata).
> 
> Seem to recall that it is popular historically with video
> appliances, etc but it is not commonly used.

Because it's a single threaded allocator. It's not suited to highly
concurrent applications, just applications that require large
extents allocated in a deterministic manner.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2015-12-08  4:46           ` Dave Chinner
@ 2016-02-15 15:18             ` David Casier
  2016-02-15 16:21               ` Eric Sandeen
  2016-02-16  3:35               ` Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: David Casier @ 2016-02-15 15:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster, Eric Sandeen

Hi Dave,
1TB is very wide for SSD.
Exemple with only 10GiB :
https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/

2015-12-08 5:46 GMT+01:00 Dave Chinner <dchinner@redhat.com>:
> On Fri, Dec 04, 2015 at 03:12:25PM -0500, Ric Wheeler wrote:
>> On 12/01/2015 05:02 PM, Sage Weil wrote:
>> >Hi David,
>> >
>> >On Tue, 1 Dec 2015, David Casier wrote:
>> >>Hi Sage,
>> >>With a standard disk (4 to 6 TB), and a small flash drive, it's easy
>> >>to create an ext4 FS with metadata on flash
>> >>
>> >>Example with sdg1 on flash and sdb on hdd :
>> >>
>> >>size_of() {
>> >>   blockdev --getsize $1
>> >>}
>> >>
>> >>mkdmsetup() {
>> >>   _ssd=/dev/$1
>> >>   _hdd=/dev/$2
>> >>   _size_of_ssd=$(size_of $_ssd)
>> >>   echo """0 $_size_of_ssd linear $_ssd 0
>> >>   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
>> >>}
>
> So this is just a linear concatenation that relies on ext4 putting
> all it's metadata at the front of the filesystem?
>
>> >>
>> >>mkdmsetup sdg1 sdb
>> >>
>> >>mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
>> >>-E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
>> >>$((1024*512)) /dev/mapper/dm-sdg1-sdb
>> >>
>> >>With that, all meta_blocks are on the SSD
>
> IIRC, it's the "packed_meta_blocks=1" that does this.
>
> THis is something that is pretty trivial to do with XFS, too,
> by use of the inode32 allocation mechanism. That reserves the
> first TB of space for inodes and other metadata allocations,
> so if you span the first TB with SSDs, you get almost all the
> metadata on the SSDs, and all the data in the higher AGs. With the
> undocumented log location mkfs option, you can also put hte log at
> the start og AG 0 which means that would sit on the SSD, too,
> without needing an external log device.
>
> SGI even had a mount option hack to limit this allocator behaviour
> to a block limit lower than 1TB so they could limit the metadata AG
> regions to, say, the first 200GB.
>
>> >This is coincidentally what I've been working on today.  So far I've just
>> >added the ability to put the rocksdb WAL on a second device, but it's
>> >super easy to push rocksdb data there as well (and have it spill over onto
>> >the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
>> >third device (e.g., expensive NVMe or NVRAM).
>
> I have old bits and pieces from 7-8 years ago that would allow some
> application control of allocation policy to allow things like this
> to be done, but I left SGI before it was anything mor ethan just a
> proof of concept....
>
>> >See this ticket for the ceph-disk tooling that's needed:
>> >
>> >     http://tracker.ceph.com/issues/13942
>> >
>> >I expect this will be more flexible and perform better than the ext4
>> >metadata option, but we'll need to test on your hardware to confirm!
>> >
>> >sage
>>
>> I think that XFS "realtime" subvolumes are the thing that does this
>> -  the second volume contains only the data (no metadata).
>>
>> Seem to recall that it is popular historically with video
>> appliances, etc but it is not commonly used.
>
> Because it's a single threaded allocator. It's not suited to highly
> concurrent applications, just applications that require large
> extents allocated in a deterministic manner.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> dchinner@redhat.com



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-15 15:18             ` David Casier
@ 2016-02-15 16:21               ` Eric Sandeen
  2016-02-16  3:35               ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Eric Sandeen @ 2016-02-15 16:21 UTC (permalink / raw)
  To: David Casier, Dave Chinner
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster

On 2/15/16 9:18 AM, David Casier wrote:
> Hi Dave,
> 1TB is very wide for SSD.
> Exemple with only 10GiB :
> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/

It wouldn't be too hard to modify the inode32 restriction to a lower
threshold, I think, if it would really be useful.

On the other hand, 10GiB seems awfully small.  What are realistic
sizes for this usecase?

-Eric

 
> 2015-12-08 5:46 GMT+01:00 Dave Chinner <dchinner@redhat.com>:
>> On Fri, Dec 04, 2015 at 03:12:25PM -0500, Ric Wheeler wrote:
>>> On 12/01/2015 05:02 PM, Sage Weil wrote:
>>>> Hi David,
>>>>
>>>> On Tue, 1 Dec 2015, David Casier wrote:
>>>>> Hi Sage,
>>>>> With a standard disk (4 to 6 TB), and a small flash drive, it's easy
>>>>> to create an ext4 FS with metadata on flash
>>>>>
>>>>> Example with sdg1 on flash and sdb on hdd :
>>>>>
>>>>> size_of() {
>>>>>   blockdev --getsize $1
>>>>> }
>>>>>
>>>>> mkdmsetup() {
>>>>>   _ssd=/dev/$1
>>>>>   _hdd=/dev/$2
>>>>>   _size_of_ssd=$(size_of $_ssd)
>>>>>   echo """0 $_size_of_ssd linear $_ssd 0
>>>>>   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
>>>>> }
>>
>> So this is just a linear concatenation that relies on ext4 putting
>> all it's metadata at the front of the filesystem?
>>
>>>>>
>>>>> mkdmsetup sdg1 sdb
>>>>>
>>>>> mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
>>>>> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
>>>>> $((1024*512)) /dev/mapper/dm-sdg1-sdb
>>>>>
>>>>> With that, all meta_blocks are on the SSD
>>
>> IIRC, it's the "packed_meta_blocks=1" that does this.
>>
>> THis is something that is pretty trivial to do with XFS, too,
>> by use of the inode32 allocation mechanism. That reserves the
>> first TB of space for inodes and other metadata allocations,
>> so if you span the first TB with SSDs, you get almost all the
>> metadata on the SSDs, and all the data in the higher AGs. With the
>> undocumented log location mkfs option, you can also put hte log at
>> the start og AG 0 which means that would sit on the SSD, too,
>> without needing an external log device.
>>
>> SGI even had a mount option hack to limit this allocator behaviour
>> to a block limit lower than 1TB so they could limit the metadata AG
>> regions to, say, the first 200GB.
>>
>>>> This is coincidentally what I've been working on today.  So far I've just
>>>> added the ability to put the rocksdb WAL on a second device, but it's
>>>> super easy to push rocksdb data there as well (and have it spill over onto
>>>> the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
>>>> third device (e.g., expensive NVMe or NVRAM).
>>
>> I have old bits and pieces from 7-8 years ago that would allow some
>> application control of allocation policy to allow things like this
>> to be done, but I left SGI before it was anything mor ethan just a
>> proof of concept....
>>
>>>> See this ticket for the ceph-disk tooling that's needed:
>>>>
>>>>     http://tracker.ceph.com/issues/13942
>>>>
>>>> I expect this will be more flexible and perform better than the ext4
>>>> metadata option, but we'll need to test on your hardware to confirm!
>>>>
>>>> sage
>>>
>>> I think that XFS "realtime" subvolumes are the thing that does this
>>> -  the second volume contains only the data (no metadata).
>>>
>>> Seem to recall that it is popular historically with video
>>> appliances, etc but it is not commonly used.
>>
>> Because it's a single threaded allocator. It's not suited to highly
>> concurrent applications, just applications that require large
>> extents allocated in a deterministic manner.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> dchinner@redhat.com
> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-15 15:18             ` David Casier
  2016-02-15 16:21               ` Eric Sandeen
@ 2016-02-16  3:35               ` Dave Chinner
  2016-02-16  8:14                 ` David Casier
                                   ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Dave Chinner @ 2016-02-16  3:35 UTC (permalink / raw)
  To: David Casier
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster, Eric Sandeen

On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
> Hi Dave,
> 1TB is very wide for SSD.

It fills from the bottom, so you don't need 1TB to make it work
in a similar manner to the ext4 hack being described.

> Exemple with only 10GiB :
> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/

It's a nice toy, but it's not something that is going scale reliably
for production.  That caveat at the end:

	"With this model, filestore rearrange the tree very
	frequently : + 40 I/O every 32 objects link/unlink."

Indicates how bad the IO patterns will be when modifying the
directory structure, and says to me that it's not a useful
optimisation at all when you might be creating several thousand
files/s on a filesystem. That will end up IO bound, SSD or not.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-16  3:35               ` Dave Chinner
@ 2016-02-16  8:14                 ` David Casier
  2016-02-16  8:39                   ` David Casier
  2016-02-18 17:54                 ` David Casier
  2016-02-19 17:06                 ` Eric Sandeen
  2 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2016-02-16  8:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster,
	Eric Sandeen, Benoît LORIOT

Hi,
All inodes, xattrs and extent are stored at the beginning of the disk
with inode32 XFS ?

2016-02-16 4:35 GMT+01:00 Dave Chinner <dchinner@redhat.com>:
> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>> Hi Dave,
>> 1TB is very wide for SSD.
>
> It fills from the bottom, so you don't need 1TB to make it work
> in a similar manner to the ext4 hack being described.
>
>> Exemple with only 10GiB :
>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>
> It's a nice toy, but it's not something that is going scale reliably
> for production.  That caveat at the end:
>
>         "With this model, filestore rearrange the tree very
>         frequently : + 40 I/O every 32 objects link/unlink."
>
> Indicates how bad the IO patterns will be when modifying the
> directory structure, and says to me that it's not a useful
> optimisation at all when you might be creating several thousand
> files/s on a filesystem. That will end up IO bound, SSD or not.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> dchinner@redhat.com



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-16  8:14                 ` David Casier
@ 2016-02-16  8:39                   ` David Casier
  2016-02-19  5:26                     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2016-02-16  8:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster,
	Eric Sandeen, Benoît LORIOT

        "With this model, filestore rearrange the tree very
        frequently : + 40 I/O every 32 objects link/unlink."
It is the consequence of parameters :
filestore_merge_threshold = 2
filestore_split_multiple = 1

Not of ext4 customization.

The large amount of objects in FileStore require indirect access and
more IOPS for every directory.

If root of inode B+tree is a simple block, we have the same problem with XFS

2016-02-16 9:14 GMT+01:00 David Casier <david.casier@aevoo.fr>:
> Hi,
> All inodes, xattrs and extent are stored at the beginning of the disk
> with inode32 XFS ?
>
> 2016-02-16 4:35 GMT+01:00 Dave Chinner <dchinner@redhat.com>:
>> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>>> Hi Dave,
>>> 1TB is very wide for SSD.
>>
>> It fills from the bottom, so you don't need 1TB to make it work
>> in a similar manner to the ext4 hack being described.
>>
>>> Exemple with only 10GiB :
>>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>>
>> It's a nice toy, but it's not something that is going scale reliably
>> for production.  That caveat at the end:
>>
>>         "With this model, filestore rearrange the tree very
>>         frequently : + 40 I/O every 32 objects link/unlink."
>>
>> Indicates how bad the IO patterns will be when modifying the
>> directory structure, and says to me that it's not a useful
>> optimisation at all when you might be creating several thousand
>> files/s on a filesystem. That will end up IO bound, SSD or not.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> dchinner@redhat.com
>
>
>
> --
>
> ________________________________________________________
>
> Cordialement,
>
> David CASIER
>
>
> 3B Rue Taylor, CS20004
> 75481 PARIS Cedex 10 Paris
>
> Ligne directe: 01 75 98 53 85
> Email: david.casier@aevoo.fr
> ________________________________________________________



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-16  3:35               ` Dave Chinner
  2016-02-16  8:14                 ` David Casier
@ 2016-02-18 17:54                 ` David Casier
  2016-02-19 17:06                 ` Eric Sandeen
  2 siblings, 0 replies; 29+ messages in thread
From: David Casier @ 2016-02-18 17:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster, Eric Sandeen

You are right, Dave.
Better with XFS and all inode+attr are stored on the start of the
drive with inode32.
~40 K IOPS with ext4 and only 10 K IOPS with XFS.

Good.

2016-02-16 4:35 GMT+01:00 Dave Chinner <dchinner@redhat.com>:
> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>> Hi Dave,
>> 1TB is very wide for SSD.
>
> It fills from the bottom, so you don't need 1TB to make it work
> in a similar manner to the ext4 hack being described.
>
>> Exemple with only 10GiB :
>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>
> It's a nice toy, but it's not something that is going scale reliably
> for production.  That caveat at the end:
>
>         "With this model, filestore rearrange the tree very
>         frequently : + 40 I/O every 32 objects link/unlink."
>
> Indicates how bad the IO patterns will be when modifying the
> directory structure, and says to me that it's not a useful
> optimisation at all when you might be creating several thousand
> files/s on a filesystem. That will end up IO bound, SSD or not.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> dchinner@redhat.com



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-16  8:39                   ` David Casier
@ 2016-02-19  5:26                     ` Dave Chinner
  2016-02-19 11:28                       ` Blair Bethwaite
  2016-02-22 12:01                       ` Sage Weil
  0 siblings, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2016-02-19  5:26 UTC (permalink / raw)
  To: David Casier
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster,
	Eric Sandeen, Benoît LORIOT

On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>         "With this model, filestore rearrange the tree very
>         frequently : + 40 I/O every 32 objects link/unlink."
> It is the consequence of parameters :
> filestore_merge_threshold = 2
> filestore_split_multiple = 1
> 
> Not of ext4 customization.

It's a function of the directory structure you are using to work
around the scalability deficiencies of the ext4 directory structure.
i.e. the root cause is that you are working around an ext4 problem.

> The large amount of objects in FileStore require indirect access and
> more IOPS for every directory.
> 
> If root of inode B+tree is a simple block, we have the same problem with XFS

Only if you use the same 32-entries per directory constraint. Get
rid of that constraint, start thinking about storing tens of
thousands of files per directory instead. i.e. let the directory
structure handle IO optimisation as the number of entries grow, not
impose artificial limits that prevent them from working efficiently.

Put simply, XFS is more efficient in terms of the average physical
IO per random inode lookup with shallow, wide directory structures
than it will be with a narrow, deep setup that is optimised to work
around the shortcomings of ext3/ext4.

When you use deep directory structures to inde millions of files,
you have to assume that any random lookup will require directory
inode IO. When you use wide, shallow directories you can almost
guarantee that the directory inodes will remain cached in memory
because the are so frequently traversed. hence we never need to do
IO for directory inodes in a wide, shallow config, and so that IO
can be ignored.

So let's assume, for ease of maths, we have 40 byte dirent
structures (~24 byte file names). That means a single 4k directory
block can index aproximately 60-70 entries. More than this, and XFs
switches to a more scalable multi-block ("leaf", then "node") format.

When XFs moves to a multi-block structure, the first block of the
directory is converted to a name hash btree that allows finding any
directory entry in one further IO.  The hash index is made up of 8
byte entries, so for a 4k block it can index 500 entries in a single
IO.  IOWs, a random, cold cache lookup across 500 directory entries
can be done in 2 IOs.

Now lets add a second level to that hash btree - we have 500 hash
index leaf blocks that can be reached in 2 IOs, so now we can reach
25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
entries.

It should be noted that the length of the directory entries doesn't
affect this lookup scalability because the index is based on 4 byte
name hashes. Hence it has the same scalability characterisitics
regardless of the name lengths; it is only affect by changes in
directory block size.

If we consider your current "1 IO per directory" config using a 32
entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
4 IOs it's 1 million entries. This is assuming we can fit 32 entries
in the inode core, which we shoul dbe able to do for the nodes of
the tree, but the leaves with the file entries are probably going to
have full object names and so are likely to be in block format. I've
ignored this and assume the leaf directories pointing to the objects
are also inline.

IOWs, by the time we get to needing 4 IOs to reach the file store
leaf directories (i.e. > ~30,000 files in the object store), a
single XFS directory is going to have the same or better IO efficiency
than your configuration fixed confiugration.

And we can make XFS even better - with an 8k directory block size, 2
IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
reach a billion entries.

So, in summary, the number of entries that can be indexed in a
given number of IOs:

IO count		1	2	3	4
32 entry wide		32	1k	32k	1m
4k dir block		70	500	25k	2.5m
8k dir block		150	1k	1m	1000m

And the number of directories required for a given number of
files if we limit XFS directories to 3 internal IOs:

file count		1k	10k	100k	1m	10m	100m
32 entry wide		32	320	3200	32k	320k	3.2g
4k dir block		1	1	5	50	500	5k
8k dir block		1	1	1	1	11	101

So, as you can see, once you make the directory structure shallow
and wide, you can reach many more entries in the same number of IOs
and there is much lower inode/dentry cache footprint when you do so.
IOWs, on XFS you design the heirachy to provide the necessary
lookup/modification concurrency as IO scalibility as file counts
rise is already efficeintly handled by the filesystem's directory
structure.

Doing this means the file store does not need to rebalance every 32
create/unlink operations. Nor do you need to be concerned about
maintaining a working set of directory inodes in cache under memory
pressure - there directory entries become the hotest items in the
cache and so will never get reclaimed.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-19  5:26                     ` Dave Chinner
@ 2016-02-19 11:28                       ` Blair Bethwaite
  2016-02-19 12:57                         ` Mark Nelson
  2016-02-22 12:01                       ` Sage Weil
  1 sibling, 1 reply; 29+ messages in thread
From: Blair Bethwaite @ 2016-02-19 11:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Casier, Ric Wheeler, Sage Weil, Ceph Development,
	Brian Foster, Eric Sandeen, Benoît LORIOT

Interesting observations Dave. Given XFS is Ceph's current production
standard it makes me wonder why the default filestore configs split
leaf directories at only 320 objects. We've seen first hand that it
doesn't take long before this starts hurting performance in a big way.

Cheers,

On 19 February 2016 at 16:26, Dave Chinner <dchinner@redhat.com> wrote:
> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>>         "With this model, filestore rearrange the tree very
>>         frequently : + 40 I/O every 32 objects link/unlink."
>> It is the consequence of parameters :
>> filestore_merge_threshold = 2
>> filestore_split_multiple = 1
>>
>> Not of ext4 customization.
>
> It's a function of the directory structure you are using to work
> around the scalability deficiencies of the ext4 directory structure.
> i.e. the root cause is that you are working around an ext4 problem.
>
>> The large amount of objects in FileStore require indirect access and
>> more IOPS for every directory.
>>
>> If root of inode B+tree is a simple block, we have the same problem with XFS
>
> Only if you use the same 32-entries per directory constraint. Get
> rid of that constraint, start thinking about storing tens of
> thousands of files per directory instead. i.e. let the directory
> structure handle IO optimisation as the number of entries grow, not
> impose artificial limits that prevent them from working efficiently.
>
> Put simply, XFS is more efficient in terms of the average physical
> IO per random inode lookup with shallow, wide directory structures
> than it will be with a narrow, deep setup that is optimised to work
> around the shortcomings of ext3/ext4.
>
> When you use deep directory structures to inde millions of files,
> you have to assume that any random lookup will require directory
> inode IO. When you use wide, shallow directories you can almost
> guarantee that the directory inodes will remain cached in memory
> because the are so frequently traversed. hence we never need to do
> IO for directory inodes in a wide, shallow config, and so that IO
> can be ignored.
>
> So let's assume, for ease of maths, we have 40 byte dirent
> structures (~24 byte file names). That means a single 4k directory
> block can index aproximately 60-70 entries. More than this, and XFs
> switches to a more scalable multi-block ("leaf", then "node") format.
>
> When XFs moves to a multi-block structure, the first block of the
> directory is converted to a name hash btree that allows finding any
> directory entry in one further IO.  The hash index is made up of 8
> byte entries, so for a 4k block it can index 500 entries in a single
> IO.  IOWs, a random, cold cache lookup across 500 directory entries
> can be done in 2 IOs.
>
> Now lets add a second level to that hash btree - we have 500 hash
> index leaf blocks that can be reached in 2 IOs, so now we can reach
> 25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
> entries.
>
> It should be noted that the length of the directory entries doesn't
> affect this lookup scalability because the index is based on 4 byte
> name hashes. Hence it has the same scalability characterisitics
> regardless of the name lengths; it is only affect by changes in
> directory block size.
>
> If we consider your current "1 IO per directory" config using a 32
> entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
> 4 IOs it's 1 million entries. This is assuming we can fit 32 entries
> in the inode core, which we shoul dbe able to do for the nodes of
> the tree, but the leaves with the file entries are probably going to
> have full object names and so are likely to be in block format. I've
> ignored this and assume the leaf directories pointing to the objects
> are also inline.
>
> IOWs, by the time we get to needing 4 IOs to reach the file store
> leaf directories (i.e. > ~30,000 files in the object store), a
> single XFS directory is going to have the same or better IO efficiency
> than your configuration fixed confiugration.
>
> And we can make XFS even better - with an 8k directory block size, 2
> IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
> reach a billion entries.
>
> So, in summary, the number of entries that can be indexed in a
> given number of IOs:
>
> IO count                1       2       3       4
> 32 entry wide           32      1k      32k     1m
> 4k dir block            70      500     25k     2.5m
> 8k dir block            150     1k      1m      1000m
>
> And the number of directories required for a given number of
> files if we limit XFS directories to 3 internal IOs:
>
> file count              1k      10k     100k    1m      10m     100m
> 32 entry wide           32      320     3200    32k     320k    3.2g
> 4k dir block            1       1       5       50      500     5k
> 8k dir block            1       1       1       1       11      101
>
> So, as you can see, once you make the directory structure shallow
> and wide, you can reach many more entries in the same number of IOs
> and there is much lower inode/dentry cache footprint when you do so.
> IOWs, on XFS you design the heirachy to provide the necessary
> lookup/modification concurrency as IO scalibility as file counts
> rise is already efficeintly handled by the filesystem's directory
> structure.
>
> Doing this means the file store does not need to rebalance every 32
> create/unlink operations. Nor do you need to be concerned about
> maintaining a working set of directory inodes in cache under memory
> pressure - there directory entries become the hotest items in the
> cache and so will never get reclaimed.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> dchinner@redhat.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-19 11:28                       ` Blair Bethwaite
@ 2016-02-19 12:57                         ` Mark Nelson
  0 siblings, 0 replies; 29+ messages in thread
From: Mark Nelson @ 2016-02-19 12:57 UTC (permalink / raw)
  To: Blair Bethwaite, Dave Chinner
  Cc: David Casier, Ric Wheeler, Sage Weil, Ceph Development,
	Brian Foster, Eric Sandeen, Benoît LORIOT

There's a long standing bugzilla entry for this:

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

See Kefu and Sam's comments about scrubbing.  That's basically the only 
blocker AFAIK.

Mark

On 02/19/2016 05:28 AM, Blair Bethwaite wrote:
> Interesting observations Dave. Given XFS is Ceph's current production
> standard it makes me wonder why the default filestore configs split
> leaf directories at only 320 objects. We've seen first hand that it
> doesn't take long before this starts hurting performance in a big way.
>
> Cheers,
>
> On 19 February 2016 at 16:26, Dave Chinner <dchinner@redhat.com> wrote:
>> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>>>          "With this model, filestore rearrange the tree very
>>>          frequently : + 40 I/O every 32 objects link/unlink."
>>> It is the consequence of parameters :
>>> filestore_merge_threshold = 2
>>> filestore_split_multiple = 1
>>>
>>> Not of ext4 customization.
>>
>> It's a function of the directory structure you are using to work
>> around the scalability deficiencies of the ext4 directory structure.
>> i.e. the root cause is that you are working around an ext4 problem.
>>
>>> The large amount of objects in FileStore require indirect access and
>>> more IOPS for every directory.
>>>
>>> If root of inode B+tree is a simple block, we have the same problem with XFS
>>
>> Only if you use the same 32-entries per directory constraint. Get
>> rid of that constraint, start thinking about storing tens of
>> thousands of files per directory instead. i.e. let the directory
>> structure handle IO optimisation as the number of entries grow, not
>> impose artificial limits that prevent them from working efficiently.
>>
>> Put simply, XFS is more efficient in terms of the average physical
>> IO per random inode lookup with shallow, wide directory structures
>> than it will be with a narrow, deep setup that is optimised to work
>> around the shortcomings of ext3/ext4.
>>
>> When you use deep directory structures to inde millions of files,
>> you have to assume that any random lookup will require directory
>> inode IO. When you use wide, shallow directories you can almost
>> guarantee that the directory inodes will remain cached in memory
>> because the are so frequently traversed. hence we never need to do
>> IO for directory inodes in a wide, shallow config, and so that IO
>> can be ignored.
>>
>> So let's assume, for ease of maths, we have 40 byte dirent
>> structures (~24 byte file names). That means a single 4k directory
>> block can index aproximately 60-70 entries. More than this, and XFs
>> switches to a more scalable multi-block ("leaf", then "node") format.
>>
>> When XFs moves to a multi-block structure, the first block of the
>> directory is converted to a name hash btree that allows finding any
>> directory entry in one further IO.  The hash index is made up of 8
>> byte entries, so for a 4k block it can index 500 entries in a single
>> IO.  IOWs, a random, cold cache lookup across 500 directory entries
>> can be done in 2 IOs.
>>
>> Now lets add a second level to that hash btree - we have 500 hash
>> index leaf blocks that can be reached in 2 IOs, so now we can reach
>> 25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
>> entries.
>>
>> It should be noted that the length of the directory entries doesn't
>> affect this lookup scalability because the index is based on 4 byte
>> name hashes. Hence it has the same scalability characterisitics
>> regardless of the name lengths; it is only affect by changes in
>> directory block size.
>>
>> If we consider your current "1 IO per directory" config using a 32
>> entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
>> 4 IOs it's 1 million entries. This is assuming we can fit 32 entries
>> in the inode core, which we shoul dbe able to do for the nodes of
>> the tree, but the leaves with the file entries are probably going to
>> have full object names and so are likely to be in block format. I've
>> ignored this and assume the leaf directories pointing to the objects
>> are also inline.
>>
>> IOWs, by the time we get to needing 4 IOs to reach the file store
>> leaf directories (i.e. > ~30,000 files in the object store), a
>> single XFS directory is going to have the same or better IO efficiency
>> than your configuration fixed confiugration.
>>
>> And we can make XFS even better - with an 8k directory block size, 2
>> IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
>> reach a billion entries.
>>
>> So, in summary, the number of entries that can be indexed in a
>> given number of IOs:
>>
>> IO count                1       2       3       4
>> 32 entry wide           32      1k      32k     1m
>> 4k dir block            70      500     25k     2.5m
>> 8k dir block            150     1k      1m      1000m
>>
>> And the number of directories required for a given number of
>> files if we limit XFS directories to 3 internal IOs:
>>
>> file count              1k      10k     100k    1m      10m     100m
>> 32 entry wide           32      320     3200    32k     320k    3.2g
>> 4k dir block            1       1       5       50      500     5k
>> 8k dir block            1       1       1       1       11      101
>>
>> So, as you can see, once you make the directory structure shallow
>> and wide, you can reach many more entries in the same number of IOs
>> and there is much lower inode/dentry cache footprint when you do so.
>> IOWs, on XFS you design the heirachy to provide the necessary
>> lookup/modification concurrency as IO scalibility as file counts
>> rise is already efficeintly handled by the filesystem's directory
>> structure.
>>
>> Doing this means the file store does not need to rebalance every 32
>> create/unlink operations. Nor do you need to be concerned about
>> maintaining a working set of directory inodes in cache under memory
>> pressure - there directory entries become the hotest items in the
>> cache and so will never get reclaimed.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> dchinner@redhat.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-16  3:35               ` Dave Chinner
  2016-02-16  8:14                 ` David Casier
  2016-02-18 17:54                 ` David Casier
@ 2016-02-19 17:06                 ` Eric Sandeen
  2016-02-21 10:56                   ` David Casier
  2 siblings, 1 reply; 29+ messages in thread
From: Eric Sandeen @ 2016-02-19 17:06 UTC (permalink / raw)
  To: Dave Chinner, David Casier
  Cc: Ric Wheeler, Sage Weil, Ceph Development, Brian Foster



On 2/15/16 9:35 PM, Dave Chinner wrote:
> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>> Hi Dave,
>> 1TB is very wide for SSD.
> 
> It fills from the bottom, so you don't need 1TB to make it work
> in a similar manner to the ext4 hack being described.

I'm not sure it will work for smaller filesystems, though - we essentially
ignore the inode32 mount option for sufficiently small filesystems.

i.e. if inode numbers > 32 bits can't exist, we don't change the allocator,
at least not until the filesystem (possibly) gets grown later.

So for inode32 to impact behavior, it needs to be on a filesystem 
of sufficient size (at least 1 or 2T, depending on block size, inode
size, etc). Otherwise it will have no effect today.

Dave, I wonder if we need another mount option to essentially mean
"invoke the inode32 allocator regardless of filesystem size?"

-Eric

>> Exemple with only 10GiB :
>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
> 
> It's a nice toy, but it's not something that is going scale reliably
> for production.  That caveat at the end:
> 
> 	"With this model, filestore rearrange the tree very
> 	frequently : + 40 I/O every 32 objects link/unlink."
> 
> Indicates how bad the IO patterns will be when modifying the
> directory structure, and says to me that it's not a useful
> optimisation at all when you might be creating several thousand
> files/s on a filesystem. That will end up IO bound, SSD or not.
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-19 17:06                 ` Eric Sandeen
@ 2016-02-21 10:56                   ` David Casier
  2016-02-22 15:56                     ` Eric Sandeen
  0 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2016-02-21 10:56 UTC (permalink / raw)
  To: sandeen
  Cc: Dave Chinner, Ric Wheeler, Sage Weil, Ceph Development, Brian Foster

I made a simple test with XFS

dm-sdf6-sdg1 :
-------------------------------------------------------------------------------------------
||  sdf6 : SSD part ||           sdg1 : HDD (4TB)                         ||
-------------------------------------------------------------------------------------------

[root@aotest ~]# mkfs.xfs -f -i maxpct=0.2 /dev/mapper/dm-sdf6-sdg1
[root@aotest ~]# mount -o inode32 /dev/mapper/dm-sdf6-sdg1 /mnt

8 directory with 16, 32, ..., 128 sub-directory and 16, 32, ..., 128
files (82 bytes)
1 xattr per dir and 3 xattr per file (user.cephosd...)

3 800 000 files and directory
16 GiB was written on SSD

------------------------------------------------------
||                 find | wc -l                   ||
------------------------------------------------------
|| Objects per dir || % IOPS on SSD ||
------------------------------------------------------
||           16         ||            99           ||
||           32         ||           100          ||
||           48         ||            93           ||
||           64         ||            88           ||
||           80         ||            88           ||
||           96         ||            86           ||
||          112        ||            87           ||
||          128        ||            88           ||
-----------------------------------------------------

------------------------------------------------------
||           find -exec getfattr '{}' \;         ||
------------------------------------------------------
|| Objects per dir || % IOPS on SSD ||
------------------------------------------------------
||           16         ||            96           ||
||           32         ||            97           ||
||           48         ||            96           ||
||           64         ||            95           ||
||           80         ||            94           ||
||           96         ||            93           ||
||          112        ||            94           ||
||          128        ||            95           ||
-----------------------------------------------------

It is true that filestore is not designed to make Big Data and the
cache must work inode / xattr

I hope to see quiclky Bluestore in production :)

2016-02-19 18:06 GMT+01:00 Eric Sandeen <esandeen@redhat.com>:
>
>
> On 2/15/16 9:35 PM, Dave Chinner wrote:
>> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>>> Hi Dave,
>>> 1TB is very wide for SSD.
>>
>> It fills from the bottom, so you don't need 1TB to make it work
>> in a similar manner to the ext4 hack being described.
>
> I'm not sure it will work for smaller filesystems, though - we essentially
> ignore the inode32 mount option for sufficiently small filesystems.
>
> i.e. if inode numbers > 32 bits can't exist, we don't change the allocator,
> at least not until the filesystem (possibly) gets grown later.
>
> So for inode32 to impact behavior, it needs to be on a filesystem
> of sufficient size (at least 1 or 2T, depending on block size, inode
> size, etc). Otherwise it will have no effect today.
>
> Dave, I wonder if we need another mount option to essentially mean
> "invoke the inode32 allocator regardless of filesystem size?"
>
> -Eric
>
>>> Exemple with only 10GiB :
>>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>>
>> It's a nice toy, but it's not something that is going scale reliably
>> for production.  That caveat at the end:
>>
>>       "With this model, filestore rearrange the tree very
>>       frequently : + 40 I/O every 32 objects link/unlink."
>>
>> Indicates how bad the IO patterns will be when modifying the
>> directory structure, and says to me that it's not a useful
>> optimisation at all when you might be creating several thousand
>> files/s on a filesystem. That will end up IO bound, SSD or not.
>>
>> Cheers,
>>
>> Dave.
>>



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-19  5:26                     ` Dave Chinner
  2016-02-19 11:28                       ` Blair Bethwaite
@ 2016-02-22 12:01                       ` Sage Weil
  2016-02-22 17:09                         ` David Casier
  1 sibling, 1 reply; 29+ messages in thread
From: Sage Weil @ 2016-02-22 12:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Casier, Ric Wheeler, Ceph Development, Brian Foster,
	Eric Sandeen, Benoît LORIOT

On Fri, 19 Feb 2016, Dave Chinner wrote:
> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
> >         "With this model, filestore rearrange the tree very
> >         frequently : + 40 I/O every 32 objects link/unlink."
> > It is the consequence of parameters :
> > filestore_merge_threshold = 2
> > filestore_split_multiple = 1
> > 
> > Not of ext4 customization.
> 
> It's a function of the directory structure you are using to work
> around the scalability deficiencies of the ext4 directory structure.
> i.e. the root cause is that you are working around an ext4 problem.

If only it were just that :(.  The other problem is that we need in-order 
enumeration of files/objects (with a particular sort order we define) and 
POSIX doesn't give us that.  Small directories let us read the whole thing 
and sort in memory.

If there is a 'good' directory size that tends to have a small/minimal 
number of IOs for listing all files it may make sense to change the 
defaults (picked semi-randomly several years back), but beyond that there 
isn't much to do here except wait for the replacement for this whole 
module that doesn't try to map our namespace onto POSIX's.

Optimizations to any of this FileStore code will see limited mileage since 
it'll be deprecated shortly anyway...

sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-21 10:56                   ` David Casier
@ 2016-02-22 15:56                     ` Eric Sandeen
  2016-02-22 16:12                       ` David Casier
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Sandeen @ 2016-02-22 15:56 UTC (permalink / raw)
  To: David Casier
  Cc: Dave Chinner, Ric Wheeler, Sage Weil, Ceph Development, Brian Foster

On 2/21/16 4:56 AM, David Casier wrote:
> I made a simple test with XFS
> 
> dm-sdf6-sdg1 :
> -------------------------------------------------------------------------------------------
> ||  sdf6 : SSD part ||           sdg1 : HDD (4TB)                         ||
> -------------------------------------------------------------------------------------------

If this is in response to my concern about not working on small
filesystems, the above is sufficiently large that inode32
won't be ignored.

> [root@aotest ~]# mkfs.xfs -f -i maxpct=0.2 /dev/mapper/dm-sdf6-sdg1

Hm, why set maxpct?  This does affect how the inode32 allocator
works, but I'm wondering if that's why you set it.  How did you arrive
at 0.2%?  Just want to be sure you understand what you're tuning.

Thanks,
-Eric

> [root@aotest ~]# mount -o inode32 /dev/mapper/dm-sdf6-sdg1 /mnt
> 
> 8 directory with 16, 32, ..., 128 sub-directory and 16, 32, ..., 128
> files (82 bytes)
> 1 xattr per dir and 3 xattr per file (user.cephosd...)
> 
> 3 800 000 files and directory
> 16 GiB was written on SSD
> 
> ------------------------------------------------------
> ||                 find | wc -l                   ||
> ------------------------------------------------------
> || Objects per dir || % IOPS on SSD ||
> ------------------------------------------------------
> ||           16         ||            99           ||
> ||           32         ||           100          ||
> ||           48         ||            93           ||
> ||           64         ||            88           ||
> ||           80         ||            88           ||
> ||           96         ||            86           ||
> ||          112        ||            87           ||
> ||          128        ||            88           ||
> -----------------------------------------------------
> 
> ------------------------------------------------------
> ||           find -exec getfattr '{}' \;         ||
> ------------------------------------------------------
> || Objects per dir || % IOPS on SSD ||
> ------------------------------------------------------
> ||           16         ||            96           ||
> ||           32         ||            97           ||
> ||           48         ||            96           ||
> ||           64         ||            95           ||
> ||           80         ||            94           ||
> ||           96         ||            93           ||
> ||          112        ||            94           ||
> ||          128        ||            95           ||
> -----------------------------------------------------
> 
> It is true that filestore is not designed to make Big Data and the
> cache must work inode / xattr
> 
> I hope to see quiclky Bluestore in production :)
> 
> 2016-02-19 18:06 GMT+01:00 Eric Sandeen <esandeen@redhat.com>:
>>
>>
>> On 2/15/16 9:35 PM, Dave Chinner wrote:
>>> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>>>> Hi Dave,
>>>> 1TB is very wide for SSD.
>>>
>>> It fills from the bottom, so you don't need 1TB to make it work
>>> in a similar manner to the ext4 hack being described.
>>
>> I'm not sure it will work for smaller filesystems, though - we essentially
>> ignore the inode32 mount option for sufficiently small filesystems.
>>
>> i.e. if inode numbers > 32 bits can't exist, we don't change the allocator,
>> at least not until the filesystem (possibly) gets grown later.
>>
>> So for inode32 to impact behavior, it needs to be on a filesystem
>> of sufficient size (at least 1 or 2T, depending on block size, inode
>> size, etc). Otherwise it will have no effect today.
>>
>> Dave, I wonder if we need another mount option to essentially mean
>> "invoke the inode32 allocator regardless of filesystem size?"
>>
>> -Eric
>>
>>>> Exemple with only 10GiB :
>>>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>>>
>>> It's a nice toy, but it's not something that is going scale reliably
>>> for production.  That caveat at the end:
>>>
>>>       "With this model, filestore rearrange the tree very
>>>       frequently : + 40 I/O every 32 objects link/unlink."
>>>
>>> Indicates how bad the IO patterns will be when modifying the
>>> directory structure, and says to me that it's not a useful
>>> optimisation at all when you might be creating several thousand
>>> files/s on a filesystem. That will end up IO bound, SSD or not.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-22 15:56                     ` Eric Sandeen
@ 2016-02-22 16:12                       ` David Casier
  2016-02-22 16:16                         ` Eric Sandeen
  0 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2016-02-22 16:12 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Dave Chinner, Ric Wheeler, Sage Weil, Ceph Development, Brian Foster

I have carried out tests very quickly and I have not had time to
concentrate fully on XFS.
 maxpct =0.2 => 0.2% of 4To = 8Go
Because my existing ssd partitions are small

If i'm not mistaken, and with what Dave says :
By default, data is written to 2^32 inodes of 256 bytes (= 1TiB).
With maxpct, you set the maximum size used by inodes, depending on the
percentage of disk

2016-02-22 16:56 GMT+01:00 Eric Sandeen <sandeen@redhat.com>:
> On 2/21/16 4:56 AM, David Casier wrote:
>> I made a simple test with XFS
>>
>> dm-sdf6-sdg1 :
>> -------------------------------------------------------------------------------------------
>> ||  sdf6 : SSD part ||           sdg1 : HDD (4TB)                         ||
>> -------------------------------------------------------------------------------------------
>
> If this is in response to my concern about not working on small
> filesystems, the above is sufficiently large that inode32
> won't be ignored.
>
>> [root@aotest ~]# mkfs.xfs -f -i maxpct=0.2 /dev/mapper/dm-sdf6-sdg1
>
> Hm, why set maxpct?  This does affect how the inode32 allocator
> works, but I'm wondering if that's why you set it.  How did you arrive
> at 0.2%?  Just want to be sure you understand what you're tuning.
>
> Thanks,
> -Eric
>
>> [root@aotest ~]# mount -o inode32 /dev/mapper/dm-sdf6-sdg1 /mnt
>>
>> 8 directory with 16, 32, ..., 128 sub-directory and 16, 32, ..., 128
>> files (82 bytes)
>> 1 xattr per dir and 3 xattr per file (user.cephosd...)
>>
>> 3 800 000 files and directory
>> 16 GiB was written on SSD
>>
>> ------------------------------------------------------
>> ||                 find | wc -l                   ||
>> ------------------------------------------------------
>> || Objects per dir || % IOPS on SSD ||
>> ------------------------------------------------------
>> ||           16         ||            99           ||
>> ||           32         ||           100          ||
>> ||           48         ||            93           ||
>> ||           64         ||            88           ||
>> ||           80         ||            88           ||
>> ||           96         ||            86           ||
>> ||          112        ||            87           ||
>> ||          128        ||            88           ||
>> -----------------------------------------------------
>>
>> ------------------------------------------------------
>> ||           find -exec getfattr '{}' \;         ||
>> ------------------------------------------------------
>> || Objects per dir || % IOPS on SSD ||
>> ------------------------------------------------------
>> ||           16         ||            96           ||
>> ||           32         ||            97           ||
>> ||           48         ||            96           ||
>> ||           64         ||            95           ||
>> ||           80         ||            94           ||
>> ||           96         ||            93           ||
>> ||          112        ||            94           ||
>> ||          128        ||            95           ||
>> -----------------------------------------------------
>>
>> It is true that filestore is not designed to make Big Data and the
>> cache must work inode / xattr
>>
>> I hope to see quiclky Bluestore in production :)
>>
>> 2016-02-19 18:06 GMT+01:00 Eric Sandeen <esandeen@redhat.com>:
>>>
>>>
>>> On 2/15/16 9:35 PM, Dave Chinner wrote:
>>>> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>>>>> Hi Dave,
>>>>> 1TB is very wide for SSD.
>>>>
>>>> It fills from the bottom, so you don't need 1TB to make it work
>>>> in a similar manner to the ext4 hack being described.
>>>
>>> I'm not sure it will work for smaller filesystems, though - we essentially
>>> ignore the inode32 mount option for sufficiently small filesystems.
>>>
>>> i.e. if inode numbers > 32 bits can't exist, we don't change the allocator,
>>> at least not until the filesystem (possibly) gets grown later.
>>>
>>> So for inode32 to impact behavior, it needs to be on a filesystem
>>> of sufficient size (at least 1 or 2T, depending on block size, inode
>>> size, etc). Otherwise it will have no effect today.
>>>
>>> Dave, I wonder if we need another mount option to essentially mean
>>> "invoke the inode32 allocator regardless of filesystem size?"
>>>
>>> -Eric
>>>
>>>>> Exemple with only 10GiB :
>>>>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>>>>
>>>> It's a nice toy, but it's not something that is going scale reliably
>>>> for production.  That caveat at the end:
>>>>
>>>>       "With this model, filestore rearrange the tree very
>>>>       frequently : + 40 I/O every 32 objects link/unlink."
>>>>
>>>> Indicates how bad the IO patterns will be when modifying the
>>>> directory structure, and says to me that it's not a useful
>>>> optimisation at all when you might be creating several thousand
>>>> files/s on a filesystem. That will end up IO bound, SSD or not.
>>>>
>>>> Cheers,
>>>>
>>>> Dave.
>>>>
>>
>>
>>
>



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-22 16:12                       ` David Casier
@ 2016-02-22 16:16                         ` Eric Sandeen
  2016-02-22 17:17                           ` Howard Chu
  2016-02-23  5:20                           ` Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Sandeen @ 2016-02-22 16:16 UTC (permalink / raw)
  To: David Casier
  Cc: Dave Chinner, Ric Wheeler, Sage Weil, Ceph Development, Brian Foster

On 2/22/16 10:12 AM, David Casier wrote:
> I have carried out tests very quickly and I have not had time to
> concentrate fully on XFS.
>  maxpct =0.2 => 0.2% of 4To = 8Go
> Because my existing ssd partitions are small
> 
> If i'm not mistaken, and with what Dave says :
> By default, data is written to 2^32 inodes of 256 bytes (= 1TiB).
> With maxpct, you set the maximum size used by inodes, depending on the
> percentage of disk

Yes, that's reasonable, I just wanted to be sure.  I hadn't seen
it stated that your SSD was that small.

Thanks,
-Eric

> 2016-02-22 16:56 GMT+01:00 Eric Sandeen <sandeen@redhat.com>:
>> On 2/21/16 4:56 AM, David Casier wrote:
>>> I made a simple test with XFS
>>>
>>> dm-sdf6-sdg1 :
>>> -------------------------------------------------------------------------------------------
>>> ||  sdf6 : SSD part ||           sdg1 : HDD (4TB)                         ||
>>> -------------------------------------------------------------------------------------------
>>
>> If this is in response to my concern about not working on small
>> filesystems, the above is sufficiently large that inode32
>> won't be ignored.
>>
>>> [root@aotest ~]# mkfs.xfs -f -i maxpct=0.2 /dev/mapper/dm-sdf6-sdg1
>>
>> Hm, why set maxpct?  This does affect how the inode32 allocator
>> works, but I'm wondering if that's why you set it.  How did you arrive
>> at 0.2%?  Just want to be sure you understand what you're tuning.
>>
>> Thanks,
>> -Eric
>>
>>> [root@aotest ~]# mount -o inode32 /dev/mapper/dm-sdf6-sdg1 /mnt
>>>
>>> 8 directory with 16, 32, ..., 128 sub-directory and 16, 32, ..., 128
>>> files (82 bytes)
>>> 1 xattr per dir and 3 xattr per file (user.cephosd...)
>>>
>>> 3 800 000 files and directory
>>> 16 GiB was written on SSD
>>>
>>> ------------------------------------------------------
>>> ||                 find | wc -l                   ||
>>> ------------------------------------------------------
>>> || Objects per dir || % IOPS on SSD ||
>>> ------------------------------------------------------
>>> ||           16         ||            99           ||
>>> ||           32         ||           100          ||
>>> ||           48         ||            93           ||
>>> ||           64         ||            88           ||
>>> ||           80         ||            88           ||
>>> ||           96         ||            86           ||
>>> ||          112        ||            87           ||
>>> ||          128        ||            88           ||
>>> -----------------------------------------------------
>>>
>>> ------------------------------------------------------
>>> ||           find -exec getfattr '{}' \;         ||
>>> ------------------------------------------------------
>>> || Objects per dir || % IOPS on SSD ||
>>> ------------------------------------------------------
>>> ||           16         ||            96           ||
>>> ||           32         ||            97           ||
>>> ||           48         ||            96           ||
>>> ||           64         ||            95           ||
>>> ||           80         ||            94           ||
>>> ||           96         ||            93           ||
>>> ||          112        ||            94           ||
>>> ||          128        ||            95           ||
>>> -----------------------------------------------------
>>>
>>> It is true that filestore is not designed to make Big Data and the
>>> cache must work inode / xattr
>>>
>>> I hope to see quiclky Bluestore in production :)
>>>
>>> 2016-02-19 18:06 GMT+01:00 Eric Sandeen <esandeen@redhat.com>:
>>>>
>>>>
>>>> On 2/15/16 9:35 PM, Dave Chinner wrote:
>>>>> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>>>>>> Hi Dave,
>>>>>> 1TB is very wide for SSD.
>>>>>
>>>>> It fills from the bottom, so you don't need 1TB to make it work
>>>>> in a similar manner to the ext4 hack being described.
>>>>
>>>> I'm not sure it will work for smaller filesystems, though - we essentially
>>>> ignore the inode32 mount option for sufficiently small filesystems.
>>>>
>>>> i.e. if inode numbers > 32 bits can't exist, we don't change the allocator,
>>>> at least not until the filesystem (possibly) gets grown later.
>>>>
>>>> So for inode32 to impact behavior, it needs to be on a filesystem
>>>> of sufficient size (at least 1 or 2T, depending on block size, inode
>>>> size, etc). Otherwise it will have no effect today.
>>>>
>>>> Dave, I wonder if we need another mount option to essentially mean
>>>> "invoke the inode32 allocator regardless of filesystem size?"
>>>>
>>>> -Eric
>>>>
>>>>>> Exemple with only 10GiB :
>>>>>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>>>>>
>>>>> It's a nice toy, but it's not something that is going scale reliably
>>>>> for production.  That caveat at the end:
>>>>>
>>>>>       "With this model, filestore rearrange the tree very
>>>>>       frequently : + 40 I/O every 32 objects link/unlink."
>>>>>
>>>>> Indicates how bad the IO patterns will be when modifying the
>>>>> directory structure, and says to me that it's not a useful
>>>>> optimisation at all when you might be creating several thousand
>>>>> files/s on a filesystem. That will end up IO bound, SSD or not.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Dave.
>>>>>
>>>
>>>
>>>
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-22 12:01                       ` Sage Weil
@ 2016-02-22 17:09                         ` David Casier
  2016-02-22 17:16                           ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2016-02-22 17:09 UTC (permalink / raw)
  To: Sage Weil
  Cc: Dave Chinner, Ric Wheeler, Ceph Development, Brian Foster,
	Eric Sandeen, Benoît LORIOT

Hi Sage,
Are you optimistic about the release of Bluestore ?

2016-02-22 13:01 GMT+01:00 Sage Weil <sage@newdream.net>:
> On Fri, 19 Feb 2016, Dave Chinner wrote:
>> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>> >         "With this model, filestore rearrange the tree very
>> >         frequently : + 40 I/O every 32 objects link/unlink."
>> > It is the consequence of parameters :
>> > filestore_merge_threshold = 2
>> > filestore_split_multiple = 1
>> >
>> > Not of ext4 customization.
>>
>> It's a function of the directory structure you are using to work
>> around the scalability deficiencies of the ext4 directory structure.
>> i.e. the root cause is that you are working around an ext4 problem.
>
> If only it were just that :(.  The other problem is that we need in-order
> enumeration of files/objects (with a particular sort order we define) and
> POSIX doesn't give us that.  Small directories let us read the whole thing
> and sort in memory.
>
> If there is a 'good' directory size that tends to have a small/minimal
> number of IOs for listing all files it may make sense to change the
> defaults (picked semi-randomly several years back), but beyond that there
> isn't much to do here except wait for the replacement for this whole
> module that doesn't try to map our namespace onto POSIX's.
>
> Optimizations to any of this FileStore code will see limited mileage since
> it'll be deprecated shortly anyway...
>
> sage



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.casier@aevoo.fr
________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-22 17:09                         ` David Casier
@ 2016-02-22 17:16                           ` Sage Weil
  0 siblings, 0 replies; 29+ messages in thread
From: Sage Weil @ 2016-02-22 17:16 UTC (permalink / raw)
  To: David Casier
  Cc: Dave Chinner, Ric Wheeler, Ceph Development, Brian Foster,
	Eric Sandeen, Benoît LORIOT

On Mon, 22 Feb 2016, David Casier wrote:
> Hi Sage,
> Are you optimistic about the release of Bluestore ?

Yes.  It'll be part of Jewel, although still not the default and still 
marked experimental (since it's a complete rewrite of the storage layer 
and obviously pretty critical).  The goal is to make it default by the 
next release (kraken, september-ish).

sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-22 16:16                         ` Eric Sandeen
@ 2016-02-22 17:17                           ` Howard Chu
  2016-02-23  5:20                           ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Howard Chu @ 2016-02-22 17:17 UTC (permalink / raw)
  To: Ceph Development

(Just as an aside on the Subject - in preliminary testing with a run of the 
mill 32GB microSD card, LMDB on a raw partition performs synchronous commits 
~8x faster than the same card formatted with an ext4 filesystem. There's a 
huge cost to journaling filesystems that a reliable database engine simply 
doesn't need.)

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
  2016-02-22 16:16                         ` Eric Sandeen
  2016-02-22 17:17                           ` Howard Chu
@ 2016-02-23  5:20                           ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2016-02-23  5:20 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: David Casier, Ric Wheeler, Sage Weil, Ceph Development, Brian Foster

On Mon, Feb 22, 2016 at 10:16:43AM -0600, Eric Sandeen wrote:
> On 2/22/16 10:12 AM, David Casier wrote:
> > I have carried out tests very quickly and I have not had time to
> > concentrate fully on XFS.
> >  maxpct =0.2 => 0.2% of 4To = 8Go
> > Because my existing ssd partitions are small
> > 
> > If i'm not mistaken, and with what Dave says :
> > By default, data is written to 2^32 inodes of 256 bytes (= 1TiB).
> > With maxpct, you set the maximum size used by inodes, depending on the
> > percentage of disk

maxpct doesn't work like that. It's a limit on the count of inodes,
not a limit on their physical location.

And, FWIW, mkfs does not take floating point numbers, so that
mkfs.xfs command line is not doing what you think it's doing. In
fact, it's probably setting maxpct to zero, and the kernel is then
ignoring it because it's invalid.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: [newstore (again)] how disable double write WAL
  2015-10-12 18:52     ` David Casier
@ 2015-10-12 19:33       ` Sage Weil
  0 siblings, 0 replies; 29+ messages in thread
From: Sage Weil @ 2015-10-12 19:33 UTC (permalink / raw)
  To: David Casier
  Cc: Ceph Development, Sébastien VALSEMEY, benoit.loriot,
	Denis Saget, luc.petetin

Hi David-

On Mon, 12 Oct 2015, David Casier wrote:
> Ok,
> Great.
> 
> With these  settings :
> //
> newstore_max_dir_size = 4096
> newstore_sync_io = true
> newstore_sync_transaction = true
> newstore_sync_submit_transaction = true

Is this a hard disk?  Those settings probably don't make sense since it 
does every IO synchronously, blocking the submitting IO path...

> newstore_sync_wal_apply = true
> newstore_overlay_max = 0
> //
> 
> And direct IO in the benchmark tool (fio)
> 
> I see that the HDD is 100% charged and there are notransfer of /db to
> /fragments after stopping benchmark : Great !
> 
> But when i launch a bench with random blocs of 256k, i see random blocs
> between 32k and 256k on HDD. Any idea ?

Random IOs have to be write ahead logged in rocksdb, which has its own IO 
pattern.  Since you made everything sync above I think it'll depend on 
how many osd threads get batched together at a time.. maybe.  Those 
settings aren't something I've really tested, and probably only make 
sense with very fast NVMe devices.

> Debits to the HDD are about 8MBps when they could be higher with larger blocs> (~30MBps)
> And 70 MBps without fsync (hard drive cache disabled).
> 
> Other questions :
> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> fsync_wq) ?

yes

> newstore_sync_transaction -> true = sync in DB ?

synchronously do the rocksdb commit too

> newstore_sync_submit_transaction -> if false then kv_queue (only if
> newstore_sync_transaction=false) ?

yeah.. there is an annoying rocksdb behavior that makes an async 
transaction submit block if a sync one is in progress, so this queues them 
up and explicitly batches them.

> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?

the txn commit completion threads can do the wal work synchronously.. this 
is only a good idea if it's doing aio (which it generally is).

> Is it true ?
> 
> Way for cache with battery (sync DB and no sync data) ?

?
s

> 
> Thanks for everything !
> 
> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > On Mon, 12 Oct 2015, David Casier wrote:
> > > Hello everybody,
> > > fragment is stored in rocksdb before being written to "/fragments" ?
> > > I separed "/db" and "/fragments" but during the bench, everything is
> > > writing
> > > to "/db"
> > > I changed options "newstore_sync_*" without success.
> > > 
> > > Is there any way to write all metadata in "/db" and all data in
> > > "/fragments" ?
> > You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > But if you are overwriting an existing object, doing write-ahead logging
> > is usually unavoidable because we need to make the update atomic (and the
> > underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > mitigates this somewhat for larger writes by limiting fragment size, but
> > for small IOs this is pretty much always going to be the case.  For small
> > IOs, though, putting things in db/ is generally better since we can
> > combine many small ios into a single (rocksdb) journal/wal write.  And
> > often leave them there (via the 'overlay' behavior).
> > 
> > sage
> > 
> 
> 
> -- 
> ________________________________________________________
> 
> Cordialement,
> 
> *David CASIER
> DCConsulting SARL
> 
> 
> 4 Trait d'Union
> 77127 LIEUSAINT
> 
> **Ligne directe: _01 75 98 53 85_
> Email: _david.casier@aevoo.fr_
> * ________________________________________________________
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: [newstore (again)] how disable double write WAL
  2015-10-12 13:01   ` Sage Weil
@ 2015-10-12 18:52     ` David Casier
  2015-10-12 19:33       ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2015-10-12 18:52 UTC (permalink / raw)
  To: Sage Weil, Ceph Development
  Cc: Sébastien VALSEMEY, benoit.loriot, Denis Saget, luc.petetin

Ok,
Great.

With these  settings :
//
newstore_max_dir_size = 4096
newstore_sync_io = true
newstore_sync_transaction = true
newstore_sync_submit_transaction = true
newstore_sync_wal_apply = true
newstore_overlay_max = 0
//

And direct IO in the benchmark tool (fio)

I see that the HDD is 100% charged and there are notransfer of /db to 
/fragments after stopping benchmark : Great !

But when i launch a bench with random blocs of 256k, i see random blocs 
between 32k and 256k on HDD. Any idea ?

Debits to the HDD are about 8MBps when they could be higher with larger 
blocs (~30MBps)
And 70 MBps without fsync (hard drive cache disabled).

Other questions :
newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
fsync_wq) ?
newstore_sync_transaction -> true = sync in DB ?
newstore_sync_submit_transaction -> if false then kv_queue (only if 
newstore_sync_transaction=false) ?
newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?

Is it true ?

Way for cache with battery (sync DB and no sync data) ?

Thanks for everything !

On 10/12/2015 03:01 PM, Sage Weil wrote:
> On Mon, 12 Oct 2015, David Casier wrote:
>> Hello everybody,
>> fragment is stored in rocksdb before being written to "/fragments" ?
>> I separed "/db" and "/fragments" but during the bench, everything is writing
>> to "/db"
>> I changed options "newstore_sync_*" without success.
>>
>> Is there any way to write all metadata in "/db" and all data in "/fragments" ?
> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> But if you are overwriting an existing object, doing write-ahead logging
> is usually unavoidable because we need to make the update atomic (and the
> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> mitigates this somewhat for larger writes by limiting fragment size, but
> for small IOs this is pretty much always going to be the case.  For small
> IOs, though, putting things in db/ is generally better since we can
> combine many small ios into a single (rocksdb) journal/wal write.  And
> often leave them there (via the 'overlay' behavior).
>
> sage
>


-- 
________________________________________________________

Cordialement,

*David CASIER
DCConsulting SARL


4 Trait d'Union
77127 LIEUSAINT

**Ligne directe: _01 75 98 53 85_
Email: _david.casier@aevoo.fr_
* ________________________________________________________

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: [newstore (again)] how disable double write WAL
  2015-10-12 12:50 ` David Casier
@ 2015-10-12 13:01   ` Sage Weil
  2015-10-12 18:52     ` David Casier
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2015-10-12 13:01 UTC (permalink / raw)
  To: David Casier; +Cc: Ceph Development

On Mon, 12 Oct 2015, David Casier wrote:
> Hello everybody,
> fragment is stored in rocksdb before being written to "/fragments" ?
> I separed "/db" and "/fragments" but during the bench, everything is writing
> to "/db"
> I changed options "newstore_sync_*" without success.
> 
> Is there any way to write all metadata in "/db" and all data in "/fragments" ?

You can set newstore_overlay_max = 0 to avoid most data landing in db/.  
But if you are overwriting an existing object, doing write-ahead logging 
is usually unavoidable because we need to make the update atomic (and the 
underlying posix fs doesn't provide that).  The wip-newstore-frags branch 
mitigates this somewhat for larger writes by limiting fragment size, but 
for small IOs this is pretty much always going to be the case.  For small 
IOs, though, putting things in db/ is generally better since we can 
combine many small ios into a single (rocksdb) journal/wal write.  And 
often leave them there (via the 'overlay' behavior).

sage


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Fwd: [newstore (again)] how disable double write WAL
       [not found] <561BABB5.2090209@aevoo.fr>
@ 2015-10-12 12:50 ` David Casier
  2015-10-12 13:01   ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: David Casier @ 2015-10-12 12:50 UTC (permalink / raw)
  To: Ceph Development

Hello everybody,
fragment is stored in rocksdb before being written to "/fragments" ?
I separed "/db" and "/fragments" but during the bench, everything is 
writing to "/db"
I changed options "newstore_sync_*" without success.

Is there any way to write all metadata in "/db" and all data in 
"/fragments" ?

-- 



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-02-23  5:20 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <9D046674-EA8B-4CB5-B049-3CF665D4ED64@aevoo.fr>
2015-11-24 20:42 ` Fwd: [newstore (again)] how disable double write WAL Sage Weil
     [not found]   ` <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
2015-12-01 20:34     ` Fwd: " David Casier
2015-12-01 22:02       ` Sage Weil
2015-12-04 20:12         ` Ric Wheeler
2015-12-04 20:20           ` Eric Sandeen
2015-12-08  4:46           ` Dave Chinner
2016-02-15 15:18             ` David Casier
2016-02-15 16:21               ` Eric Sandeen
2016-02-16  3:35               ` Dave Chinner
2016-02-16  8:14                 ` David Casier
2016-02-16  8:39                   ` David Casier
2016-02-19  5:26                     ` Dave Chinner
2016-02-19 11:28                       ` Blair Bethwaite
2016-02-19 12:57                         ` Mark Nelson
2016-02-22 12:01                       ` Sage Weil
2016-02-22 17:09                         ` David Casier
2016-02-22 17:16                           ` Sage Weil
2016-02-18 17:54                 ` David Casier
2016-02-19 17:06                 ` Eric Sandeen
2016-02-21 10:56                   ` David Casier
2016-02-22 15:56                     ` Eric Sandeen
2016-02-22 16:12                       ` David Casier
2016-02-22 16:16                         ` Eric Sandeen
2016-02-22 17:17                           ` Howard Chu
2016-02-23  5:20                           ` Dave Chinner
     [not found] <561BABB5.2090209@aevoo.fr>
2015-10-12 12:50 ` David Casier
2015-10-12 13:01   ` Sage Weil
2015-10-12 18:52     ` David Casier
2015-10-12 19:33       ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.