Linux-ext4 Archive on
 help / color / Atom feed
From: Daniel Phillips <>
To: "Theodore Y. Ts'o" <>
	OGAWA Hirofumi <>
Subject: Re: [RFC] Thing 1: Shardmap fox Ext4
Date: Sun, 1 Dec 2019 17:45:05 -0800
Message-ID: <> (raw)
In-Reply-To: <>

On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
> (3) It's not particularly well documented...

We regard that as an issue needing attention. Here is a pretty picture
to get started:

This needs some explaining. The bottom part of the directory file is
a simple linear range of directory blocks, with a freespace map block
appearing once every 4K blocks or so. This freespace mapping needs a
post of its own, it is somewhat subtle. This will be a couple of posts
in the future.

The Shardmap index appears at a higher logical address, sufficiently
far above the directory base to accommodate a reasonable number of
record entry blocks below it. We try not to place the index at so high
an address that the radix tree gets extra levels, slowing everything

When the index needs to be expanded, either because some shard exceeded
a threshold number of entries, or the record entry blocks ran into the
the bottom of the index, then a new index tier with more shards is
created at a higher logical address. The lower index tier is not copied
immediately to the upper tier, but rather, each shard is incrementally
split when it hits the threshold because of an insert. This bounds the
latency of any given insert to the time needed to split one shard, which
we target nominally at less than one millisecond. Thus, Shardmap takes a
modest step in the direction of real time response.

Each index tier is just a simple array of shards, each of which fills
up with 8 byte entries from bottom to top. The count of entries in each
shard is stored separately in a table just below the shard array. So at
shard load time, we can determine rapidly from the count table which
tier a given shard belongs to. There are other advantages to breaking
the shard counts out separately having to do with the persistent memory
version of Shardmap, interesting details that I will leave for later.

When all lower tier shards have been deleted, the lower tier may be
overwritten by the expanding record entry block region. In practice,
a Shardmap file normally has just one tier most of the time, the other
tier existing only long enough to complete the incremental expansion
of the shard table, insert by insert.

There is a small header in the lowest record entry block, giving the
positions of the one or two index tiers, count of entry blocks, and
various tuning parameters such as maximum shard size and average depth
of cache hash collision lists.

That is it for media format. Very simple, is it not? My next post
will explain the Shardmap directory block format, with a focus on
deficiencies of the traditional Ext2 format that were addressed.



  parent reply index

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  1:47 Daniel Phillips
2019-11-27  7:40 ` Vyacheslav Dubeyko
2019-11-27  8:28   ` Daniel Phillips
2019-11-27 19:35     ` Viacheslav Dubeyko
2019-11-28  2:54       ` Daniel Phillips
2019-11-28  9:15         ` Andreas Dilger
2019-11-28 10:03           ` Daniel Phillips
2019-11-27 14:25 ` Theodore Y. Ts'o
2019-11-27 22:27   ` Daniel Phillips
2019-11-28  2:28     ` Theodore Y. Ts'o
2019-11-28  4:27       ` Daniel Phillips
2019-11-30 17:50         ` Theodore Y. Ts'o
2019-12-01  8:21           ` Daniel Phillips
2019-12-04 18:31             ` Andreas Dilger
2019-12-04 21:44               ` Daniel Phillips
2019-12-05  0:36                 ` Andreas Dilger
2019-12-05  2:27                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
2019-12-06  1:16                 ` Dave Chinner
2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-08 22:42                     ` Dave Chinner
2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
2019-12-08 10:25       ` Daniel Phillips
2019-12-02  1:45   ` Daniel Phillips [this message]
2019-12-04 15:55     ` Vyacheslav Dubeyko
2019-12-05  9:46       ` Daniel Phillips
2019-12-06 11:47         ` Vyacheslav Dubeyko
2019-12-07  0:46           ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
2019-12-04 20:47       ` Daniel Phillips
2019-12-04 20:53         ` Daniel Phillips
2019-12-05  5:59           ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on

Archives are clonable:
	git clone --mirror linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ \
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone