All of lore.kernel.org
 help / color / mirror / Atom feed
From: Somnath Roy <Somnath.Roy@sandisk.com>
To: Sage Weil <sage@newdream.net>, Allen Samuels <Allen.Samuels@sandisk.com>
Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: Bluestore different allocator performance Vs FileStore
Date: Fri, 12 Aug 2016 06:19:47 +0000	[thread overview]
Message-ID: <BL2PR02MB2115F059A98E5A148C7D26ACF41F0@BL2PR02MB2115.namprd02.prod.outlook.com> (raw)
In-Reply-To: alpine.DEB.2.11.1608112115250.17762@piezo.us.to

One more finding Ramesh while debugging this..
I found in the BitAllocator.cc you have used /usr/include/assert.h. This will collide with dout() (that I was trying to introduce) and give compilation error. Eventually, I had to comment out  <assert.h> and use ceph assert.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Thursday, August 11, 2016 8:10 PM
To: 'Sage Weil'; Allen Samuels
Cc: Ramesh Chander; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

Sage,
I tried your PR but it is not helping much. See this each insert_free() call is taking ~40sec to complete and we have 2 calls that is taking time..

2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free instance 140128595341440 off 0x2000 len 0x6ab7d14f000
2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear bits in 0x6ab7d100000

2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free instance 140127837929472 off 0x2000 len 0x6ab7d14f000
2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear bits in 0x6ab7d100000

I have also tried with the following and it is not helping either..

       bluestore_bluefs_min_ratio = .01
        bluestore_freelist_blocks_per_key = 512


I did some debugging on this to find out which call inside this function is taking time and I found this within BitAllocator::free_blocks

  debug_assert(is_allocated(start_block, num_blocks));

  free_blocks_int(start_block, num_blocks);

I did skip this debug_assert and total time reduced from ~80sec ~49sec , so, that's a significant improvement.

Next, I found out that debug_assert(is_allocated()) is called from free_blocks_int as well. I commented out blindly all debug_assert(is_allocated()) and performance became similar to stupid/filestore.
I didn't bother to look into is_allocated() anymore as my guess is we can safely ignore this during mkfs() time ?
But, it will be good if we can optimize this as it may induce latency in the IO path (?).

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net]
Sent: Thursday, August 11, 2016 2:20 PM
To: Allen Samuels
Cc: Ramesh Chander; Somnath Roy; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 1:24 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > clarify as needed.
> > >
> > > I thought that the authoritative indication of space used by
> > > BlueFS was contained in the snapshot/journal of BlueFS itself, NOT
> > > in the KV store itself. This requires that upon startup, we replay
> > > the BlueFS snapshot/journal into the FreeListManager so that it
> > > properly records the consumption of BlueFS space (since that
> > > allocation MAY NOT be accurate within the FreeListmanager itself).
> > > But that this playback need not generate an KVStore operations
> > > (since those are duplicates of the BlueFS).
> > >
> > > So in the code you cite:
> > >
> > > fm->allocate(0, reserved, t);
> > >
> > > There's no need to commit 't', and in fact, in the general case,
> > > you don't want to commit 't'.
> > >
> > > That suggests to me that a version of allocate that doesn't have a
> > > transaction could be easily created would have the speed we're
> > > looking for (and independence from the BitMapAllocator to KVStore chunking).
> >
> > Oh, I see.  Yeah, you're right--this step isn't really necessary, as
> > long as we ensure that the auxilliary representation of what bluefs
> > owns (bluefs_extents in the superblock) is still passed into the
> > Allocator during initialization.  Having the freelist reflect the
> > allocator that this space was "in use" (by bluefs) and thus off
> > limits to bluestore is simple but not strictly necessary.
> >
> > I'll work on a PR that avoids this...

https://github.com/ceph/ceph/pull/10698

Ramesh, can you give it a try?

> > > I suspect that we also have long startup times because we're doing
> > > the same underlying bitmap operations except they come from the
> > > BlueFS replay code instead of the BlueFS initialization code, but
> > > same problem with likely the same fix.
> >
> > BlueFS doesn't touch the FreelistManager (or explicitly persist the
> > freelist at all)... we initialize the in-memory Allocator state from
> > the metadata in the bluefs log.  I think we should be fine on this end.
>
> Likely that code suffers from the same problem -- a false need to
> update the KV Store (During the playback, BlueFS extents are converted
> to bitmap runs, it's essentially the same lower level code as the case
> we're seeing now, but it instead of being driven by an artificial "big
> run", it'sll be driven from the BlueFS Journal replay code). But
> that's just a guess, I don't have time to track down the actual code right now.

BlueFS can't touch the freelist (or kv store, ever) since it ultimately backs the kv store and that would be problematic.  We do initialize the bluefs Allocator's in-memory state, but that's it.

The PR above changes the BlueStore::_init_alloc() so that BlueStore's Allocator state is initialize with both the freelist state (from kv store)
*and* the bluefs_extents list (from the bluestore superblock).  (From this Allocator's perspective, all of bluefs's space is allocated and can't be used.  BlueFS has it's own separate instance to do it's internal
allocations.)

sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

  parent reply	other threads:[~2016-08-12  6:19 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-10 16:55 Bluestore different allocator performance Vs FileStore Somnath Roy
2016-08-10 21:31 ` Sage Weil
2016-08-10 22:27   ` Somnath Roy
2016-08-10 22:44     ` Sage Weil
2016-08-10 22:58       ` Allen Samuels
2016-08-11  4:34         ` Ramesh Chander
2016-08-11  6:07         ` Ramesh Chander
2016-08-11  7:11           ` Somnath Roy
2016-08-11 11:24             ` Mark Nelson
2016-08-11 14:06               ` Ben England
2016-08-11 17:07                 ` Allen Samuels
2016-08-11 16:04           ` Allen Samuels
2016-08-11 16:35             ` Ramesh Chander
2016-08-11 16:38               ` Sage Weil
2016-08-11 17:05                 ` Allen Samuels
2016-08-11 17:15                   ` Sage Weil
2016-08-11 17:26                     ` Allen Samuels
2016-08-11 19:34                       ` Sage Weil
2016-08-11 19:45                         ` Allen Samuels
2016-08-11 20:03                           ` Sage Weil
2016-08-11 20:16                             ` Allen Samuels
2016-08-11 20:24                               ` Sage Weil
2016-08-11 20:28                                 ` Allen Samuels
2016-08-11 21:19                                   ` Sage Weil
2016-08-12  3:10                                     ` Somnath Roy
2016-08-12  3:44                                       ` Allen Samuels
2016-08-12  5:27                                         ` Ramesh Chander
2016-08-12  5:52                                         ` Ramesh Chander
2016-08-12  5:59                                         ` Somnath Roy
2016-08-12  6:19                                     ` Somnath Roy [this message]
2016-08-12 15:26                                     ` Sage Weil
2016-08-12 15:43                                       ` Somnath Roy
2016-08-12 20:02                                       ` Somnath Roy
2016-08-11 12:28       ` Milosz Tanski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BL2PR02MB2115F059A98E5A148C7D26ACF41F0@BL2PR02MB2115.namprd02.prod.outlook.com \
    --to=somnath.roy@sandisk.com \
    --cc=Allen.Samuels@sandisk.com \
    --cc=Ramesh.Chander@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.