From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: 2 related bluestore questions Date: Tue, 10 May 2016 17:41:46 +0300 Message-ID: References: <6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-lf0-f52.google.com ([209.85.215.52]:33857 "EHLO mail-lf0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752439AbcEJOlu (ORCPT ); Tue, 10 May 2016 10:41:50 -0400 Received: by mail-lf0-f52.google.com with SMTP id m64so17421937lfd.1 for ; Tue, 10 May 2016 07:41:49 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: allen.samuels@sandisk.com, ceph-devel@vger.kernel.org On 10.05.2016 15:53, Sage Weil wrote: > On Tue, 10 May 2016, Igor Fedotov wrote: >> Hi Sage, >> Please find my comments below. >> >> WRT 1. there is an alternative approach that doesn't need persistent refmap. >> It works for non-shared bnode only though. In fact one can build such a refmap >> using onode's lextents map pretty easy. It looks like any procedure that >> requires such a refmap has a logical offset as an input. This provides an >> appropriate lextent referring to some blob we need refmap for. What we need to >> do for blob's refmap building is to enumerate lextents within +-max_blob_size >> range from the original loffset. I suppose we are going to avoid small lextent >> entries most of time by merging them thus such enumeration should be short >> enough. Most probably such refmap build is needed for background wal procedure >> (or its replacement - see below) thus it wouldn't affect primary write path >> performance. And this procedure will require some neighboring lxtent >> enumeration to detect lextents to merge anyway. >> >> Actually I don't have strong opinion which approach is better. Just a minor >> point that tracking persistent refmap is a bit more complex and space >> consuming. > Yeah, that's my only real concern--and mostly on the memory allocation > side, less so on the size of the encoded metadata. Since the alternative > only works in the non-shared bnode case, I think it'll be simpler to only > implement one approach for now, and consider optimizing later, since we'd > have to implement to share-capable approach either way. (For example, > most blobs will have one reference for their full range; we could probably > represent this as an empty map with a bit of care.) So the initial approach is to have refmap, right? >> WRT to 2. IMO single byte granularity is OK. Initial write request handling >> can create lextents of any size depending on the input data blocks. But we >> will try to eliminate it during wal processing to have larger extents and >> better space usage though. > Ok cool. > >> WRT WAL changes. My idea is to replace WAL with a bit different extent merge >> (defragmenter, garbage collector, space optimizer - whatever name of your >> choice) process. The main difference - current WAL implementation tracks some >> user data and thus it's a part of the consistency model (i.e. one has to check >> if data block is in the WAL). In my approach data is always consistent without >> such a service. At the first write handling step we always write data to the >> store by allocating new blob and modifying lextent map. And apply >> corresponding checksum using regular means if needed. Thus we always have >> consistent data in lextent/blob structures. And defragmenter process is just a >> cleanup/optimization thread that merges sparse lextents to improve space >> utilization. To avoid full lextent map enumeration during defragmentation >> ExtentManager (or whatever entity that handles writes) may return some 'hints' >> where space optimization should be applied. This is to be done at initial >> write processing. Such hint is most probably just a logical offset or some >> interval within object logical space. Write handler provides such a hint if it >> detects (by lextent map inspection) that optimization is required, e.g. in >> case of partial lextent overwrite, big hole punch, sparse small lextents etc. >> Pending optimization tasks (list of hints) are maintained by the BlueStore and >> passed to EM (or another corresponding entity) for processing in the context >> of a specific thread. Based of such hints defragmenter locates lextents to >> merge and do the job: Read/Modify/Write multiple lextents and/or blobs. >> Optionally this can be done with with some delay to care write burst within a >> specific object region. Another point is that hint list can be potentially >> tracked without KV store (some in-memory data structure is enough) as there is >> no mandatory need for its replay in case of OSD failure - data are always >> consistent at the store and failure can lead to some local space >> ineffectiveness only. That's a rare case though. >> >> What do you think about this approach? > My concern is that it makes a simple overwrite less IO efficient because > you have to (1) write a new (temporary-ish) blob, (2) commit the kv > transaction, and then (3) write an updated/merged blob, then (4) commit > the kv txn for new blob. Yes, that's true. But there are some concerns about WAL case as well: 1) Are you sure that writing larger KV record ( metadata + user data ) is better than direct data write to the store + smaller KV (metadata only) update? 2) Either WAL records will increase or we need to have both WAL and optimizer simultaneously. Especially for compressed case. As far as I understand currently WAL record has up to block_size bytes of user data. With blob introduction this raises up to max_blob_size ( N*min_alloc_size). Or we'll need to maintain both WAL and optimizer E.g. there is an lextent 0~256K and overwrite 1K~254K, block size = 4K For no checksum and compression case WAL records are 2 * 3K For checksum case WAL records are 2 * (max( csum_block_size, block_size) - 1K) For compression case WAL records are 2 * (max( max_blob_size, block_size) - 1K) or do that temporary blob allocation. 3) WAL apply locks the subsequent read until its completion. I.e. subsequent read has to wait until WAL apply is completed ( o->flush() call in _do_read()). In case of optimizer approach lock can be postponed as optimizer doesn't need to perform the task immediately. > And if I understand the proposal correctly any > overwrite is still off-limits because you can't to the overwrite IO > atomically with the kv commit. Is that right? Could you please elaborate - not sure I understand the question. > Making the wal part of the consistency model is more complex, but it means > we can (1) log our intent to overwrite atomically with the kv txn commit, > and then (2) do the async overwrite. It will get a bit more complex > because we'll be doing metadata updates as part of the wal completion, but > it's not a big step from where we are now, and I think the performance > benefit will be worth it. May I have some example how it's supposed to work please? > I think we'll still want a gc/cleanup/optimizer async process like you > describe, but it can be driven by wal hints or whatever other mechanism we > like. > > sage