From: Igor Fedotov <ifedotov@mirantis.com>
To: Sage Weil <sweil@redhat.com>, allen.samuels@sandisk.com
Cc: ceph-devel@vger.kernel.org
Subject: Re: 2 related bluestore questions
Date: Tue, 10 May 2016 15:17:58 +0300 [thread overview]
Message-ID: <6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1605091417590.336@cpach.fuggernut.com>
Hi Sage,
Please find my comments below.
WRT 1. there is an alternative approach that doesn't need persistent
refmap. It works for non-shared bnode only though. In fact one can build
such a refmap using onode's lextents map pretty easy. It looks like any
procedure that requires such a refmap has a logical offset as an input.
This provides an appropriate lextent referring to some blob we need
refmap for. What we need to do for blob's refmap building is to
enumerate lextents within +-max_blob_size range from the original
loffset. I suppose we are going to avoid small lextent entries most of
time by merging them thus such enumeration should be short enough. Most
probably such refmap build is needed for background wal procedure (or
its replacement - see below) thus it wouldn't affect primary write path
performance. And this procedure will require some neighboring lxtent
enumeration to detect lextents to merge anyway.
Actually I don't have strong opinion which approach is better. Just a
minor point that tracking persistent refmap is a bit more complex and
space consuming.
WRT to 2. IMO single byte granularity is OK. Initial write request
handling can create lextents of any size depending on the input data
blocks. But we will try to eliminate it during wal processing to have
larger extents and better space usage though.
WRT WAL changes. My idea is to replace WAL with a bit different extent
merge (defragmenter, garbage collector, space optimizer - whatever name
of your choice) process. The main difference - current WAL
implementation tracks some user data and thus it's a part of the
consistency model (i.e. one has to check if data block is in the WAL).
In my approach data is always consistent without such a service. At the
first write handling step we always write data to the store by
allocating new blob and modifying lextent map. And apply corresponding
checksum using regular means if needed. Thus we always have consistent
data in lextent/blob structures. And defragmenter process is just a
cleanup/optimization thread that merges sparse lextents to improve space
utilization. To avoid full lextent map enumeration during
defragmentation ExtentManager (or whatever entity that handles writes)
may return some 'hints' where space optimization should be applied. This
is to be done at initial write processing. Such hint is most probably
just a logical offset or some interval within object logical space.
Write handler provides such a hint if it detects (by lextent map
inspection) that optimization is required, e.g. in case of partial
lextent overwrite, big hole punch, sparse small lextents etc. Pending
optimization tasks (list of hints) are maintained by the BlueStore and
passed to EM (or another corresponding entity) for processing in the
context of a specific thread. Based of such hints defragmenter locates
lextents to merge and do the job: Read/Modify/Write multiple lextents
and/or blobs. Optionally this can be done with with some delay to care
write burst within a specific object region. Another point is that hint
list can be potentially tracked without KV store (some in-memory data
structure is enough) as there is no mandatory need for its replay in
case of OSD failure - data are always consistent at the store and
failure can lead to some local space ineffectiveness only. That's a rare
case though.
What do you think about this approach?
Thanks,
Igor
On 09.05.2016 21:31, Sage Weil wrote:
> 1. In 7fb649a3800a5653f5f7ddf942c53503f88ad3f1 I added an extent_ref_map_t
> to the blob_t. This lets us keep track, for each blob, of references to
> the logical blob extents (in addition to the raw num_refs that just counts
> how many lextent_t's point to us). It will let us make decisions about
> deallocating unused portions of the blob that are no longer referenced
> (e.g., when we are uncompressed). It will also let us sanely reason
> about whether we can write into the blob's allocated space that is not
> referenced (e.g., past end of object/file, but within a min_alloc_size
> chunk).
>
> The downside is that it's a bit more metadata to maintain. OTOH, we need
> it in many cases, and it would be slow/tedious to create it on the fly.
>
> I think yes, though some minor changes to the current extent_ref_map_t are
> needed, since it currently has weird assumptoins about empty meaning a ref
> count of 1.
>
> 2. Allow lextent_t's to be byte-granularity.
>
> For example, if we write 10 bytes into the object, we'd have a blob of
> min_alloc_size, and an lextent_t that indicates [0,10) points to that
> blob.
>
> The upside here is that truncate and zero are trivial updates to the
> lextent map and never need to do any IO--we just punch holes in our
> mapping.
>
> The downside is that we might get odd mappings like
>
> 0: 0~10->1
> 4000: 4000~96->1
>
> after a hole (10~3990) has been punched, and we may need to piece the
> mapping back together. I think we will need most of this complexity
> (e.g., merging adjacent lextents that map to adjacent regions of the same
> blob) anyway.
>
> Hmm, there is probably some other downside but now I can't think of a good
> reason not to do this. It'll basically put all of the onus on the write
> code to do the right thing... which is probably a good thing.
>
> Yes?
>
>
> Also, one note on the WAL changes: we'll need to have any read portion of
> a wal event include the raw pextents *and* the associated checksum(s).
> This is because the events need to be idempotent and may overwrite the
> read region, or interact with wal ops that come before/after.
>
> sage
next prev parent reply other threads:[~2016-05-10 12:18 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-09 18:31 2 related bluestore questions Sage Weil
2016-05-10 12:17 ` Igor Fedotov [this message]
2016-05-10 12:53 ` Sage Weil
2016-05-10 14:41 ` Igor Fedotov
2016-05-10 15:39 ` Sage Weil
2016-05-11 1:10 ` Sage Weil
2016-05-11 12:11 ` Igor Fedotov
2016-05-11 13:10 ` Sage Weil
2016-05-11 13:45 ` Igor Fedotov
2016-05-11 13:57 ` Sage Weil
2016-05-11 20:54 ` Sage Weil
2016-05-11 21:38 ` Allen Samuels
2016-05-12 2:58 ` Sage Weil
2016-05-12 11:54 ` Allen Samuels
2016-05-12 14:47 ` Igor Fedotov
2016-05-12 14:38 ` Igor Fedotov
2016-05-12 16:37 ` Igor Fedotov
2016-05-12 16:43 ` Sage Weil
2016-05-12 16:45 ` Igor Fedotov
2016-05-12 16:48 ` Sage Weil
2016-05-12 16:52 ` Igor Fedotov
2016-05-12 17:09 ` Sage Weil
2016-05-13 17:07 ` Igor Fedotov
2016-05-12 14:29 ` Igor Fedotov
2016-05-12 14:27 ` Igor Fedotov
2016-05-12 15:06 ` Sage Weil
2016-05-11 12:39 ` Igor Fedotov
2016-05-11 14:35 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com \
--to=ifedotov@mirantis.com \
--cc=allen.samuels@sandisk.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sweil@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.