On Dec 1, 2019, at 6:45 PM, Daniel Phillips wrote: > > On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote: >> (3) It's not particularly well documented... > > We regard that as an issue needing attention. Here is a pretty picture > to get started: > > https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format The shardmap diagram is good conceptually, but it would be useful to add a legend on the empty side of the diagram that shows the on-disk structures. > > This needs some explaining. The bottom part of the directory file is > a simple linear range of directory blocks, with a freespace map block > appearing once every 4K blocks or so. This freespace mapping needs a > post of its own, it is somewhat subtle. This will be a couple of posts > in the future. > > The Shardmap index appears at a higher logical address, sufficiently > far above the directory base to accommodate a reasonable number of > record entry blocks below it. We try not to place the index at so high > an address that the radix tree gets extra levels, slowing everything > down. > > When the index needs to be expanded, either because some shard exceeded > a threshold number of entries, or the record entry blocks ran into the > the bottom of the index, then a new index tier with more shards is > created at a higher logical address. The lower index tier is not copied > immediately to the upper tier, but rather, each shard is incrementally > split when it hits the threshold because of an insert. This bounds the > latency of any given insert to the time needed to split one shard, which > we target nominally at less than one millisecond. Thus, Shardmap takes a > modest step in the direction of real time response. > > Each index tier is just a simple array of shards, each of which fills > up with 8 byte entries from bottom to top. The count of entries in each > shard is stored separately in a table just below the shard array. So at > shard load time, we can determine rapidly from the count table which > tier a given shard belongs to. There are other advantages to breaking > the shard counts out separately having to do with the persistent memory > version of Shardmap, interesting details that I will leave for later. > > When all lower tier shards have been deleted, the lower tier may be > overwritten by the expanding record entry block region. In practice, > a Shardmap file normally has just one tier most of the time, the other > tier existing only long enough to complete the incremental expansion > of the shard table, insert by insert. > > There is a small header in the lowest record entry block, giving the > positions of the one or two index tiers, count of entry blocks, and > various tuning parameters such as maximum shard size and average depth > of cache hash collision lists. > > That is it for media format. Very simple, is it not? My next post > will explain the Shardmap directory block format, with a focus on > deficiencies of the traditional Ext2 format that were addressed. > > Regards, > > Daniel Cheers, Andreas