From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: 2 related bluestore questions Date: Tue, 10 May 2016 15:17:58 +0300 Message-ID: <6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-lf0-f47.google.com ([209.85.215.47]:36128 "EHLO mail-lf0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751133AbcEJMSC (ORCPT ); Tue, 10 May 2016 08:18:02 -0400 Received: by mail-lf0-f47.google.com with SMTP id u64so12541688lff.3 for ; Tue, 10 May 2016 05:18:01 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil , allen.samuels@sandisk.com Cc: ceph-devel@vger.kernel.org Hi Sage, Please find my comments below. WRT 1. there is an alternative approach that doesn't need persistent refmap. It works for non-shared bnode only though. In fact one can build such a refmap using onode's lextents map pretty easy. It looks like any procedure that requires such a refmap has a logical offset as an input. This provides an appropriate lextent referring to some blob we need refmap for. What we need to do for blob's refmap building is to enumerate lextents within +-max_blob_size range from the original loffset. I suppose we are going to avoid small lextent entries most of time by merging them thus such enumeration should be short enough. Most probably such refmap build is needed for background wal procedure (or its replacement - see below) thus it wouldn't affect primary write path performance. And this procedure will require some neighboring lxtent enumeration to detect lextents to merge anyway. Actually I don't have strong opinion which approach is better. Just a minor point that tracking persistent refmap is a bit more complex and space consuming. WRT to 2. IMO single byte granularity is OK. Initial write request handling can create lextents of any size depending on the input data blocks. But we will try to eliminate it during wal processing to have larger extents and better space usage though. WRT WAL changes. My idea is to replace WAL with a bit different extent merge (defragmenter, garbage collector, space optimizer - whatever name of your choice) process. The main difference - current WAL implementation tracks some user data and thus it's a part of the consistency model (i.e. one has to check if data block is in the WAL). In my approach data is always consistent without such a service. At the first write handling step we always write data to the store by allocating new blob and modifying lextent map. And apply corresponding checksum using regular means if needed. Thus we always have consistent data in lextent/blob structures. And defragmenter process is just a cleanup/optimization thread that merges sparse lextents to improve space utilization. To avoid full lextent map enumeration during defragmentation ExtentManager (or whatever entity that handles writes) may return some 'hints' where space optimization should be applied. This is to be done at initial write processing. Such hint is most probably just a logical offset or some interval within object logical space. Write handler provides such a hint if it detects (by lextent map inspection) that optimization is required, e.g. in case of partial lextent overwrite, big hole punch, sparse small lextents etc. Pending optimization tasks (list of hints) are maintained by the BlueStore and passed to EM (or another corresponding entity) for processing in the context of a specific thread. Based of such hints defragmenter locates lextents to merge and do the job: Read/Modify/Write multiple lextents and/or blobs. Optionally this can be done with with some delay to care write burst within a specific object region. Another point is that hint list can be potentially tracked without KV store (some in-memory data structure is enough) as there is no mandatory need for its replay in case of OSD failure - data are always consistent at the store and failure can lead to some local space ineffectiveness only. That's a rare case though. What do you think about this approach? Thanks, Igor On 09.05.2016 21:31, Sage Weil wrote: > 1. In 7fb649a3800a5653f5f7ddf942c53503f88ad3f1 I added an extent_ref_map_t > to the blob_t. This lets us keep track, for each blob, of references to > the logical blob extents (in addition to the raw num_refs that just counts > how many lextent_t's point to us). It will let us make decisions about > deallocating unused portions of the blob that are no longer referenced > (e.g., when we are uncompressed). It will also let us sanely reason > about whether we can write into the blob's allocated space that is not > referenced (e.g., past end of object/file, but within a min_alloc_size > chunk). > > The downside is that it's a bit more metadata to maintain. OTOH, we need > it in many cases, and it would be slow/tedious to create it on the fly. > > I think yes, though some minor changes to the current extent_ref_map_t are > needed, since it currently has weird assumptoins about empty meaning a ref > count of 1. > > 2. Allow lextent_t's to be byte-granularity. > > For example, if we write 10 bytes into the object, we'd have a blob of > min_alloc_size, and an lextent_t that indicates [0,10) points to that > blob. > > The upside here is that truncate and zero are trivial updates to the > lextent map and never need to do any IO--we just punch holes in our > mapping. > > The downside is that we might get odd mappings like > > 0: 0~10->1 > 4000: 4000~96->1 > > after a hole (10~3990) has been punched, and we may need to piece the > mapping back together. I think we will need most of this complexity > (e.g., merging adjacent lextents that map to adjacent regions of the same > blob) anyway. > > Hmm, there is probably some other downside but now I can't think of a good > reason not to do this. It'll basically put all of the onus on the write > code to do the right thing... which is probably a good thing. > > Yes? > > > Also, one note on the WAL changes: we'll need to have any read portion of > a wal event include the raw pextents *and* the associated checksum(s). > This is because the events need to be idempotent and may overwrite the > read region, or interact with wal ops that come before/after. > > sage