All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sage Weil <sage@newdream.net>
To: Ramesh Chander <Ramesh.Chander@sandisk.com>
Cc: Allen Samuels <Allen.Samuels@sandisk.com>,
	Somnath Roy <Somnath.Roy@sandisk.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: Bluestore different allocator performance Vs FileStore
Date: Fri, 12 Aug 2016 15:26:16 +0000 (UTC)	[thread overview]
Message-ID: <alpine.DEB.2.11.1608121521010.29701@piezo.us.to> (raw)
In-Reply-To: <alpine.DEB.2.11.1608112115250.17762@piezo.us.to>

On Thu, 11 Aug 2016, Sage Weil wrote:
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 1:24 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > > 
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > clarify as needed.
> > > >
> > > > I thought that the authoritative indication of space used by BlueFS
> > > > was contained in the snapshot/journal of BlueFS itself, NOT in the KV
> > > > store itself. This requires that upon startup, we replay the BlueFS
> > > > snapshot/journal into the FreeListManager so that it properly records
> > > > the consumption of BlueFS space (since that allocation MAY NOT be
> > > > accurate within the FreeListmanager itself). But that this playback
> > > > need not generate an KVStore operations (since those are duplicates of
> > > > the BlueFS).
> > > >
> > > > So in the code you cite:
> > > >
> > > > fm->allocate(0, reserved, t);
> > > >
> > > > There's no need to commit 't', and in fact, in the general case, you
> > > > don't want to commit 't'.
> > > >
> > > > That suggests to me that a version of allocate that doesn't have a
> > > > transaction could be easily created would have the speed we're looking
> > > > for (and independence from the BitMapAllocator to KVStore chunking).
> > > 
> > > Oh, I see.  Yeah, you're right--this step isn't really necessary, as long as we
> > > ensure that the auxilliary representation of what bluefs owns
> > > (bluefs_extents in the superblock) is still passed into the Allocator during
> > > initialization.  Having the freelist reflect the allocator that this space was "in
> > > use" (by bluefs) and thus off limits to bluestore is simple but not strictly
> > > necessary.
> > > 
> > > I'll work on a PR that avoids this...
> 
> https://github.com/ceph/ceph/pull/10698
> 
> Ramesh, can you give it a try?
> 
> > > > I suspect that we also have long startup times because we're doing the
> > > > same underlying bitmap operations except they come from the BlueFS
> > > > replay code instead of the BlueFS initialization code, but same
> > > > problem with likely the same fix.
> > > 
> > > BlueFS doesn't touch the FreelistManager (or explicitly persist the freelist at
> > > all)... we initialize the in-memory Allocator state from the metadata in the
> > > bluefs log.  I think we should be fine on this end.
> > 
> > Likely that code suffers from the same problem -- a false need to update 
> > the KV Store (During the playback, BlueFS extents are converted to 
> > bitmap runs, it's essentially the same lower level code as the case 
> > we're seeing now, but it instead of being driven by an artificial "big 
> > run", it'sll be driven from the BlueFS Journal replay code). But that's 
> > just a guess, I don't have time to track down the actual code right now.
> 
> BlueFS can't touch the freelist (or kv store, ever) since it ultimately 
> backs the kv store and that would be problematic.  We do initialize the 
> bluefs Allocator's in-memory state, but that's it.
> 
> The PR above changes the BlueStore::_init_alloc() so that BlueStore's 
> Allocator state is initialize with both the freelist state (from kv store) 
> *and* the bluefs_extents list (from the bluestore superblock).  (From this 
> Allocator's perspective, all of bluefs's space is allocated and can't be 
> used.  BlueFS has it's own separate instance to do it's internal 
> allocations.)

Ah, okay, so after our conversation in standup I went and looked at the 
code some more and realized I've been thinking about the 
BitmapFreelistManager and not the BitMapAllocator.  The ~40s is all CPU 
time spent updating in-memory bits, and has nothing to do with pushing 
updates through rocksdb.  Sorry for the confusing conversation.

So... I think there is one thing we can do: change the initialization of 
the allocator state from the freelist so that the assumption is that space 
is freed and we tell it was is allocation (currently we assume everything 
is allocated and tell it what is free).  I'm not sure it's worth it, 
though: we'll just make things slower to start up on a full OSD instead of 
slower on an empty OSD.  And it seems like the CPU time really won't be 
significant anyway once the debugging stuff is taken out.

I think this PR

	https://github.com/ceph/ceph/pull/10698

is still a good idea, though, since it avoids useless freelist kv work 
during mkfs.

Does that sound right?  Or am I still missing something?

Thanks for you patience!
sage



  parent reply	other threads:[~2016-08-12 15:26 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-10 16:55 Bluestore different allocator performance Vs FileStore Somnath Roy
2016-08-10 21:31 ` Sage Weil
2016-08-10 22:27   ` Somnath Roy
2016-08-10 22:44     ` Sage Weil
2016-08-10 22:58       ` Allen Samuels
2016-08-11  4:34         ` Ramesh Chander
2016-08-11  6:07         ` Ramesh Chander
2016-08-11  7:11           ` Somnath Roy
2016-08-11 11:24             ` Mark Nelson
2016-08-11 14:06               ` Ben England
2016-08-11 17:07                 ` Allen Samuels
2016-08-11 16:04           ` Allen Samuels
2016-08-11 16:35             ` Ramesh Chander
2016-08-11 16:38               ` Sage Weil
2016-08-11 17:05                 ` Allen Samuels
2016-08-11 17:15                   ` Sage Weil
2016-08-11 17:26                     ` Allen Samuels
2016-08-11 19:34                       ` Sage Weil
2016-08-11 19:45                         ` Allen Samuels
2016-08-11 20:03                           ` Sage Weil
2016-08-11 20:16                             ` Allen Samuels
2016-08-11 20:24                               ` Sage Weil
2016-08-11 20:28                                 ` Allen Samuels
2016-08-11 21:19                                   ` Sage Weil
2016-08-12  3:10                                     ` Somnath Roy
2016-08-12  3:44                                       ` Allen Samuels
2016-08-12  5:27                                         ` Ramesh Chander
2016-08-12  5:52                                         ` Ramesh Chander
2016-08-12  5:59                                         ` Somnath Roy
2016-08-12  6:19                                     ` Somnath Roy
2016-08-12 15:26                                     ` Sage Weil [this message]
2016-08-12 15:43                                       ` Somnath Roy
2016-08-12 20:02                                       ` Somnath Roy
2016-08-11 12:28       ` Milosz Tanski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.11.1608121521010.29701@piezo.us.to \
    --to=sage@newdream.net \
    --cc=Allen.Samuels@sandisk.com \
    --cc=Ramesh.Chander@sandisk.com \
    --cc=Somnath.Roy@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.