On Dec 1, 2019, at 6:45 PM, Daniel Phillips <daniel@phunq.net> wrote:
> 
> On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
>> (3) It's not particularly well documented...
> 
> We regard that as an issue needing attention. Here is a pretty picture
> to get started:
> 
>   https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

The shardmap diagram is good conceptually, but it would be useful
to add a legend on the empty side of the diagram that shows the on-disk
structures.

> 
> This needs some explaining. The bottom part of the directory file is
> a simple linear range of directory blocks, with a freespace map block
> appearing once every 4K blocks or so. This freespace mapping needs a
> post of its own, it is somewhat subtle. This will be a couple of posts
> in the future.
> 
> The Shardmap index appears at a higher logical address, sufficiently
> far above the directory base to accommodate a reasonable number of
> record entry blocks below it. We try not to place the index at so high
> an address that the radix tree gets extra levels, slowing everything
> down.
> 
> When the index needs to be expanded, either because some shard exceeded
> a threshold number of entries, or the record entry blocks ran into the
> the bottom of the index, then a new index tier with more shards is
> created at a higher logical address. The lower index tier is not copied
> immediately to the upper tier, but rather, each shard is incrementally
> split when it hits the threshold because of an insert. This bounds the
> latency of any given insert to the time needed to split one shard, which
> we target nominally at less than one millisecond. Thus, Shardmap takes a
> modest step in the direction of real time response.
> 
> Each index tier is just a simple array of shards, each of which fills
> up with 8 byte entries from bottom to top. The count of entries in each
> shard is stored separately in a table just below the shard array. So at
> shard load time, we can determine rapidly from the count table which
> tier a given shard belongs to. There are other advantages to breaking
> the shard counts out separately having to do with the persistent memory
> version of Shardmap, interesting details that I will leave for later.
> 
> When all lower tier shards have been deleted, the lower tier may be
> overwritten by the expanding record entry block region. In practice,
> a Shardmap file normally has just one tier most of the time, the other
> tier existing only long enough to complete the incremental expansion
> of the shard table, insert by insert.
> 
> There is a small header in the lowest record entry block, giving the
> positions of the one or two index tiers, count of entry blocks, and
> various tuning parameters such as maximum shard size and average depth
> of cache hash collision lists.
> 
> That is it for media format. Very simple, is it not? My next post
> will explain the Shardmap directory block format, with a focus on
> deficiencies of the traditional Ext2 format that were addressed.
> 
> Regards,
> 
> Daniel


Cheers, Andreas