All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] reftable: fix writing multi-level indices
@ 2024-01-26 10:31 Patrick Steinhardt
  2024-01-26 10:31 ` [PATCH 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-01-26 10:31 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1285 bytes --]

Hi,

yesterday I noticed that writing reftables with more than 250,000 refs
led to the last block of refs in the table not being seekable anymore.
As it turns out, 250k refs was the boundary at which we start to write
multi-level indices for refs.

The root cause is a bug with how we write multi-level indices: we did
not flush out the last block of the previous index level, and thus we
didn't index it in the higher-level index. This patch series fixes this
issue and also polishes surrounding code a bit.

The topic depends on Junio's jc/reftable-core-fsync at 1df18a1c9a
(reftable: honor core.fsync, 2024-01-23) due to a semantic merge
conflict in the newly added test.

Patrick

Patrick Steinhardt (5):
  reftable/reader: be more careful about errors in indexed seeks
  reftable/writer: use correct type to iterate through index entries
  reftable/writer: simplify writing index records
  reftable/writer: fix writing multi-level indices
  reftable: document reading and writing indices

 reftable/reader.c         | 30 ++++++++++++++++++
 reftable/readwrite_test.c | 56 ++++++++++++++++++++++++++++++++++
 reftable/writer.c         | 64 ++++++++++++++++++++++-----------------
 3 files changed, 123 insertions(+), 27 deletions(-)

-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] reftable/reader: be more careful about errors in indexed seeks
  2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
@ 2024-01-26 10:31 ` Patrick Steinhardt
  2024-01-26 10:31 ` [PATCH 2/5] reftable/writer: use correct type to iterate through index entries Patrick Steinhardt
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-01-26 10:31 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

When doing an indexed seek we first need to do a linear seek in order to
find the index block for our wanted key. We do not check the returned
error of the linear seek though. This is likely not an issue because the
next call to `table_iter_next()` would return error, too. But it very
much is a code smell when an error variable is being assigned to without
actually checking it.

Safeguard the code by checking for errors.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/reader.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/reftable/reader.c b/reftable/reader.c
index 64dc366fb1..278f727a3d 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -509,6 +509,9 @@ static int reader_seek_indexed(struct reftable_reader *r,
 		goto done;
 
 	err = reader_seek_linear(&index_iter, &want_index);
+	if (err < 0)
+		goto done;
+
 	while (1) {
 		err = table_iter_next(&index_iter, &index_result);
 		table_iter_block_done(&index_iter);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] reftable/writer: use correct type to iterate through index entries
  2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
  2024-01-26 10:31 ` [PATCH 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
@ 2024-01-26 10:31 ` Patrick Steinhardt
  2024-01-26 10:31 ` [PATCH 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-01-26 10:31 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1718 bytes --]

The reftable writer is tracking the number of blocks it has to index via
the `index_len` variable. But while this variable is of type `size_t`,
some sites use an `int` to loop through the index entries.

Convert the code to consistently use `size_t`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/writer.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/reftable/writer.c b/reftable/writer.c
index 92935baa70..5a0b87b406 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -379,20 +379,21 @@ int reftable_writer_add_logs(struct reftable_writer *w,
 
 static int writer_finish_section(struct reftable_writer *w)
 {
+	struct reftable_block_stats *bstats = NULL;
 	uint8_t typ = block_writer_type(w->block_writer);
 	uint64_t index_start = 0;
 	int max_level = 0;
-	int threshold = w->opts.unpadded ? 1 : 3;
+	size_t threshold = w->opts.unpadded ? 1 : 3;
 	int before_blocks = w->stats.idx_stats.blocks;
-	int err = writer_flush_block(w);
-	int i = 0;
-	struct reftable_block_stats *bstats = NULL;
+	int err;
+
+	err = writer_flush_block(w);
 	if (err < 0)
 		return err;
 
 	while (w->index_len > threshold) {
 		struct reftable_index_record *idx = NULL;
-		int idx_len = 0;
+		size_t i, idx_len;
 
 		max_level++;
 		index_start = w->next;
@@ -630,11 +631,8 @@ int reftable_writer_close(struct reftable_writer *w)
 
 static void writer_clear_index(struct reftable_writer *w)
 {
-	int i = 0;
-	for (i = 0; i < w->index_len; i++) {
+	for (size_t i = 0; i < w->index_len; i++)
 		strbuf_release(&w->index[i].last_key);
-	}
-
 	FREE_AND_NULL(w->index);
 	w->index_len = 0;
 	w->index_cap = 0;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] reftable/writer: simplify writing index records
  2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
  2024-01-26 10:31 ` [PATCH 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
  2024-01-26 10:31 ` [PATCH 2/5] reftable/writer: use correct type to iterate through index entries Patrick Steinhardt
@ 2024-01-26 10:31 ` Patrick Steinhardt
  2024-01-31 13:44   ` Toon Claes
  2024-01-31 15:55   ` Kristoffer Haugsbakk
  2024-01-26 10:31 ` [PATCH 4/5] reftable/writer: fix writing multi-level indices Patrick Steinhardt
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-01-26 10:31 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

When finishing the current section we may end up writing index records
for the section to the table. The logic to do so essentially copies what
we already have in `writer_add_record()`, making this more complicated
than it really has to be.

Simplify the code by using `writer_add_record()` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/writer.c | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/reftable/writer.c b/reftable/writer.c
index 5a0b87b406..2525f236b9 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -405,6 +405,7 @@ static int writer_finish_section(struct reftable_writer *w)
 		w->index = NULL;
 		w->index_len = 0;
 		w->index_cap = 0;
+
 		for (i = 0; i < idx_len; i++) {
 			struct reftable_record rec = {
 				.type = BLOCK_TYPE_INDEX,
@@ -412,26 +413,14 @@ static int writer_finish_section(struct reftable_writer *w)
 					.idx = idx[i],
 				},
 			};
-			if (block_writer_add(w->block_writer, &rec) == 0) {
-				continue;
-			}
 
-			err = writer_flush_block(w);
+			err = writer_add_record(w, &rec);
 			if (err < 0)
 				return err;
-
-			writer_reinit_block_writer(w, BLOCK_TYPE_INDEX);
-
-			err = block_writer_add(w->block_writer, &rec);
-			if (err != 0) {
-				/* write into fresh block should always succeed
-				 */
-				abort();
-			}
 		}
-		for (i = 0; i < idx_len; i++) {
+
+		for (i = 0; i < idx_len; i++)
 			strbuf_release(&idx[i].last_key);
-		}
 		reftable_free(idx);
 	}
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] reftable/writer: fix writing multi-level indices
  2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2024-01-26 10:31 ` [PATCH 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
@ 2024-01-26 10:31 ` Patrick Steinhardt
  2024-01-26 10:31 ` [PATCH 5/5] reftable: document reading and writing indices Patrick Steinhardt
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-01-26 10:31 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 4607 bytes --]

When finishing a section we will potentially write an index that makes
it more efficient to look up relevant blocks. The index records written
will encode, for each block of the indexed section, what the offset of
that block is as well as the last key of that block. Thus, the reader
would iterate through the index records to find the first key larger or
equal to the wanted key and then use the encoded offset to look up the
desired block.

When there are a lot of blocks to index though we may end up writing
multiple index blocks, too. To not require a linear search across all
index blocks we instead end up writing a multi-level index. Instead of
referring to the block we are after, an index record may point to
another index block. The reader will then access the highest-level index
and follow down the chain of index blocks until it hits the sought-after
block.

It has been observed though that it is impossible to seek ref records of
the last ref block when using a multi-level index. While the multi-level
index exists and looks fine for most of the part, the highest-level
index was missing an index record pointing to the last block of the next
index. Thus, every additional level made more refs become unseekable at
the end of the ref section.

The root cause is that we are not flushing the last block of the current
level once done writing the level. Consequently, it wasn't recorded in
the blocks that need to be indexed by the next-higher level and thus we
forgot about it.

Fix this bug by flushing blocks after we have written all index records.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/readwrite_test.c | 56 +++++++++++++++++++++++++++++++++++++++
 reftable/writer.c         |  8 +++---
 2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index 6b99daeaf2..853923397e 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -866,6 +866,61 @@ static void test_write_multiple_indices(void)
 	strbuf_release(&buf);
 }
 
+static void test_write_multi_level_index(void)
+{
+	struct reftable_write_options opts = {
+		.block_size = 100,
+	};
+	struct strbuf writer_buf = STRBUF_INIT, buf = STRBUF_INIT;
+	struct reftable_block_source source = { 0 };
+	struct reftable_iterator it = { 0 };
+	const struct reftable_stats *stats;
+	struct reftable_writer *writer;
+	struct reftable_reader *reader;
+	int err;
+
+	writer = reftable_new_writer(&strbuf_add_void, &noop_flush, &writer_buf, &opts);
+	reftable_writer_set_limits(writer, 1, 1);
+	for (size_t i = 0; i < 200; i++) {
+		struct reftable_ref_record ref = {
+			.update_index = 1,
+			.value_type = REFTABLE_REF_VAL1,
+			.value.val1 = {i},
+		};
+
+		strbuf_reset(&buf);
+		strbuf_addf(&buf, "refs/heads/%03" PRIuMAX, (uintmax_t)i);
+		ref.refname = buf.buf,
+
+		err = reftable_writer_add_ref(writer, &ref);
+		EXPECT_ERR(err);
+	}
+	reftable_writer_close(writer);
+
+	/*
+	 * The written refs should be sufficiently large to result in a
+	 * multi-level index.
+	 */
+	stats = reftable_writer_stats(writer);
+	EXPECT(stats->ref_stats.max_index_level == 2);
+
+	block_source_from_strbuf(&source, &writer_buf);
+	err = reftable_new_reader(&reader, &source, "filename");
+	EXPECT_ERR(err);
+
+	/*
+	 * Seeking the last ref should work as expected.
+	 */
+	err = reftable_reader_seek_ref(reader, &it, "refs/heads/199");
+	EXPECT_ERR(err);
+
+	reftable_iterator_destroy(&it);
+	reftable_writer_free(writer);
+	reftable_reader_free(reader);
+	strbuf_release(&writer_buf);
+	strbuf_release(&buf);
+}
+
 static void test_corrupt_table_empty(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -916,5 +971,6 @@ int readwrite_test_main(int argc, const char *argv[])
 	RUN_TEST(test_write_object_id_length);
 	RUN_TEST(test_write_object_id_min_length);
 	RUN_TEST(test_write_multiple_indices);
+	RUN_TEST(test_write_multi_level_index);
 	return 0;
 }
diff --git a/reftable/writer.c b/reftable/writer.c
index 2525f236b9..24fce5630a 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -419,15 +419,15 @@ static int writer_finish_section(struct reftable_writer *w)
 				return err;
 		}
 
+		err = writer_flush_block(w);
+		if (err < 0)
+			return err;
+
 		for (i = 0; i < idx_len; i++)
 			strbuf_release(&idx[i].last_key);
 		reftable_free(idx);
 	}
 
-	err = writer_flush_block(w);
-	if (err < 0)
-		return err;
-
 	writer_clear_index(w);
 
 	bstats = writer_reftable_block_stats(w, typ);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] reftable: document reading and writing indices
  2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2024-01-26 10:31 ` [PATCH 4/5] reftable/writer: fix writing multi-level indices Patrick Steinhardt
@ 2024-01-26 10:31 ` Patrick Steinhardt
  2024-01-26 16:26 ` [PATCH 0/5] reftable: fix writing multi-level indices Junio C Hamano
  2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
  6 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-01-26 10:31 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3863 bytes --]

The way the index gets written and read is not trivial at all and
requires the reader to piece together a bunch of parts to figure out how
it works. Add some documentation to hopefully make this easier to
understand for the next reader.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/reader.c | 27 +++++++++++++++++++++++++++
 reftable/writer.c | 23 +++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/reftable/reader.c b/reftable/reader.c
index 278f727a3d..6011d0aa04 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -508,11 +508,38 @@ static int reader_seek_indexed(struct reftable_reader *r,
 	if (err < 0)
 		goto done;
 
+	/*
+	 * The index may consist of multiple levels, where each level may have
+	 * multiple index blocks. We start by doing a linear search in the
+	 * highest layer that identifies the relevant index block as well as
+	 * the record inside that block that corresponds to our wanted key.
+	 */
 	err = reader_seek_linear(&index_iter, &want_index);
 	if (err < 0)
 		goto done;
 
+	/*
+	 * Traverse down the levels until we find a non-index entry.
+	 */
 	while (1) {
+		/*
+		 * In case we seek a record that does not exist the index iter
+		 * will tell us that the iterator is over. This works because
+		 * the last index entry of the current level will contain the
+		 * last key it knows about. So in case our seeked key is larger
+		 * than the last indexed key we know that it won't exist.
+		 *
+		 * There is one subtlety in the layout of the index section
+		 * that makes this work as expected: the highest-level index is
+		 * at end of the section and will point backwards and thus we
+		 * start reading from the end of the index section, not the
+		 * beginning.
+		 *
+		 * If that wasn't the case and the order was reversed then the
+		 * linear seek would seek into the lower levels and traverse
+		 * all levels of the index only to find out that the key does
+		 * not exist.
+		 */
 		err = table_iter_next(&index_iter, &index_result);
 		table_iter_block_done(&index_iter);
 		if (err != 0)
diff --git a/reftable/writer.c b/reftable/writer.c
index 24fce5630a..6028789dee 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -391,6 +391,24 @@ static int writer_finish_section(struct reftable_writer *w)
 	if (err < 0)
 		return err;
 
+	/*
+	 * When the section we are about to index has a lot of blocks then the
+	 * index itself may span across multiple blocks, as well. This would
+	 * require a linear scan over index blocks only to find the desired
+	 * indexed block, which is inefficient. Instead, we write a multi-level
+	 * index where index records of level N+1 will refer to index blocks of
+	 * level N. This isn't constant time, either, but at least logarithmic.
+	 *
+	 * This loop handles writing this multi-level index. Note that we write
+	 * the lowest-level index pointing to the indexed blocks first. We then
+	 * continue writing additional index levels until the current level has
+	 * less blocks than the threshold so that the highest level will be at
+	 * the end of the index section.
+	 *
+	 * Readers are thus required to start reading the index section from
+	 * its end, which is why we set `index_start` to the beginning of the
+	 * last index section.
+	 */
 	while (w->index_len > threshold) {
 		struct reftable_index_record *idx = NULL;
 		size_t i, idx_len;
@@ -428,6 +446,11 @@ static int writer_finish_section(struct reftable_writer *w)
 		reftable_free(idx);
 	}
 
+	/*
+	 * The index may still contain a number of index blocks lower than the
+	 * threshold. Clear it so that these entries don't leak into the next
+	 * index section.
+	 */
 	writer_clear_index(w);
 
 	bstats = writer_reftable_block_stats(w, typ);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] reftable: fix writing multi-level indices
  2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
                   ` (4 preceding siblings ...)
  2024-01-26 10:31 ` [PATCH 5/5] reftable: document reading and writing indices Patrick Steinhardt
@ 2024-01-26 16:26 ` Junio C Hamano
  2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
  6 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2024-01-26 16:26 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> yesterday I noticed that writing reftables with more than 250,000 refs
> led to the last block of refs in the table not being seekable anymore.
> As it turns out, 250k refs was the boundary at which we start to write
> multi-level indices for refs.

Obviously one of the less exercised corners of the code ;-)

> The topic depends on Junio's jc/reftable-core-fsync at 1df18a1c9a
> (reftable: honor core.fsync, 2024-01-23) due to a semantic merge
> conflict in the newly added test.

Thanks for the note for the base.  John Cai's work is well done and
has been reviewed, so let's merge it down to 'next' soonish.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] reftable/writer: simplify writing index records
  2024-01-26 10:31 ` [PATCH 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
@ 2024-01-31 13:44   ` Toon Claes
  2024-02-01  8:39     ` Patrick Steinhardt
  2024-01-31 15:55   ` Kristoffer Haugsbakk
  1 sibling, 1 reply; 21+ messages in thread
From: Toon Claes @ 2024-01-31 13:44 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git


Patrick Steinhardt <ps@pks.im> writes:

> When finishing the current section we may end up writing index records
> for the section to the table. The logic to do so essentially copies what
> we already have in `writer_add_record()`, making this more complicated
> than it really has to be.

I didn't feel like this commit message made it easier for me to
understand, because I interpreted words differently than you intended.
Using "may end up" makes it sound like it's unexpected behavior. Also
the use of "copies" implies to me it's doing a copy operation. I would
rephrase it to something like:

  When finishing the current section some index records might be written
  for the section to the table. The logic to do so is essentially
  duplicated from what we already have in `writer_add_record()`, making
  this more complicated than it really has to be.

Other than that, I don't have any comments about this patch series.

--
Toon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] reftable/writer: simplify writing index records
  2024-01-26 10:31 ` [PATCH 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
  2024-01-31 13:44   ` Toon Claes
@ 2024-01-31 15:55   ` Kristoffer Haugsbakk
  2024-02-01  8:39     ` Patrick Steinhardt
  1 sibling, 1 reply; 21+ messages in thread
From: Kristoffer Haugsbakk @ 2024-01-31 15:55 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Hi

It looks like this isn’t in `next` yet so I’ll leave a comment.

> @@ -405,6 +405,7 @@ static int writer_finish_section(struct reftable_writer *w)
>  		w->index = NULL;
>  		w->index_len = 0;
>  		w->index_cap = 0;
> +

This part and the next quoted one seem to be an unrelated edit by
`clang-format`. They could perhaps be grouped in their own patch.

> -				abort();
> -			}
>  		}
> -		for (i = 0; i < idx_len; i++) {
> +
> +		for (i = 0; i < idx_len; i++)
>  			strbuf_release(&idx[i].last_key);
> -		}
>  		reftable_free(idx);
>  	}

-- 
Kristoffer Haugsbakk

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 0/5] reftable: fix writing multi-level indices
  2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
                   ` (5 preceding siblings ...)
  2024-01-26 16:26 ` [PATCH 0/5] reftable: fix writing multi-level indices Junio C Hamano
@ 2024-02-01  7:51 ` Patrick Steinhardt
  2024-02-01  7:51   ` [PATCH v2 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
                     ` (4 more replies)
  6 siblings, 5 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  7:51 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 2874 bytes --]

Hi,

this is the second version of my patch series that fixes writing of
multi-level indices. There are two minor changes compared to v1:

  - Slightly rephrased a commit message.

  - Dropped an added newline that resulted in a new hunk.

The patch series continues to build on top of jc/reftable-core-fsync.

Thanks!

Patrick

Patrick Steinhardt (5):
  reftable/reader: be more careful about errors in indexed seeks
  reftable/writer: use correct type to iterate through index entries
  reftable/writer: simplify writing index records
  reftable/writer: fix writing multi-level indices
  reftable: document reading and writing indices

 reftable/reader.c         | 30 +++++++++++++++++++
 reftable/readwrite_test.c | 56 ++++++++++++++++++++++++++++++++++
 reftable/writer.c         | 63 ++++++++++++++++++++++-----------------
 3 files changed, 122 insertions(+), 27 deletions(-)

Range-diff against v1:
1:  ecf834a299 = 1:  ecf834a299 reftable/reader: be more careful about errors in indexed seeks
2:  88541d03be = 2:  88541d03be reftable/writer: use correct type to iterate through index entries
3:  b0982baacf ! 3:  b3de0b7f3b reftable/writer: simplify writing index records
    @@ Metadata
      ## Commit message ##
         reftable/writer: simplify writing index records
     
    -    When finishing the current section we may end up writing index records
    -    for the section to the table. The logic to do so essentially copies what
    -    we already have in `writer_add_record()`, making this more complicated
    -    than it really has to be.
    +    When finishing the current section some index records might be written
    +    for the section to the table. The logic that adds these records to the
    +    writer duplicates what we already have in `writer_add_record()`, making
    +    this more complicated than it really has to be.
     
    -    Simplify the code by using `writer_add_record()` instead.
    +    Simplify the code by using `writer_add_record()` instead. While at it,
    +    drop the unneeded braces around a loop to make the code conform to our
    +    code style better.
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## reftable/writer.c ##
    -@@ reftable/writer.c: static int writer_finish_section(struct reftable_writer *w)
    - 		w->index = NULL;
    - 		w->index_len = 0;
    - 		w->index_cap = 0;
    -+
    - 		for (i = 0; i < idx_len; i++) {
    - 			struct reftable_record rec = {
    - 				.type = BLOCK_TYPE_INDEX,
     @@ reftable/writer.c: static int writer_finish_section(struct reftable_writer *w)
      					.idx = idx[i],
      				},
4:  9c6622c409 = 4:  89a88cf83e reftable/writer: fix writing multi-level indices
5:  7850e65878 = 5:  c3492bbd42 reftable: document reading and writing indices
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/5] reftable/reader: be more careful about errors in indexed seeks
  2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
@ 2024-02-01  7:51   ` Patrick Steinhardt
  2024-02-01  7:52   ` [PATCH v2 2/5] reftable/writer: use correct type to iterate through index entries Patrick Steinhardt
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  7:51 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

When doing an indexed seek we first need to do a linear seek in order to
find the index block for our wanted key. We do not check the returned
error of the linear seek though. This is likely not an issue because the
next call to `table_iter_next()` would return error, too. But it very
much is a code smell when an error variable is being assigned to without
actually checking it.

Safeguard the code by checking for errors.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/reader.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/reftable/reader.c b/reftable/reader.c
index 64dc366fb1..278f727a3d 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -509,6 +509,9 @@ static int reader_seek_indexed(struct reftable_reader *r,
 		goto done;
 
 	err = reader_seek_linear(&index_iter, &want_index);
+	if (err < 0)
+		goto done;
+
 	while (1) {
 		err = table_iter_next(&index_iter, &index_result);
 		table_iter_block_done(&index_iter);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/5] reftable/writer: use correct type to iterate through index entries
  2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
  2024-02-01  7:51   ` [PATCH v2 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
@ 2024-02-01  7:52   ` Patrick Steinhardt
  2024-02-01  7:52   ` [PATCH v2 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  7:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 1718 bytes --]

The reftable writer is tracking the number of blocks it has to index via
the `index_len` variable. But while this variable is of type `size_t`,
some sites use an `int` to loop through the index entries.

Convert the code to consistently use `size_t`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/writer.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/reftable/writer.c b/reftable/writer.c
index 92935baa70..5a0b87b406 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -379,20 +379,21 @@ int reftable_writer_add_logs(struct reftable_writer *w,
 
 static int writer_finish_section(struct reftable_writer *w)
 {
+	struct reftable_block_stats *bstats = NULL;
 	uint8_t typ = block_writer_type(w->block_writer);
 	uint64_t index_start = 0;
 	int max_level = 0;
-	int threshold = w->opts.unpadded ? 1 : 3;
+	size_t threshold = w->opts.unpadded ? 1 : 3;
 	int before_blocks = w->stats.idx_stats.blocks;
-	int err = writer_flush_block(w);
-	int i = 0;
-	struct reftable_block_stats *bstats = NULL;
+	int err;
+
+	err = writer_flush_block(w);
 	if (err < 0)
 		return err;
 
 	while (w->index_len > threshold) {
 		struct reftable_index_record *idx = NULL;
-		int idx_len = 0;
+		size_t i, idx_len;
 
 		max_level++;
 		index_start = w->next;
@@ -630,11 +631,8 @@ int reftable_writer_close(struct reftable_writer *w)
 
 static void writer_clear_index(struct reftable_writer *w)
 {
-	int i = 0;
-	for (i = 0; i < w->index_len; i++) {
+	for (size_t i = 0; i < w->index_len; i++)
 		strbuf_release(&w->index[i].last_key);
-	}
-
 	FREE_AND_NULL(w->index);
 	w->index_len = 0;
 	w->index_cap = 0;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/5] reftable/writer: simplify writing index records
  2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
  2024-02-01  7:51   ` [PATCH v2 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
  2024-02-01  7:52   ` [PATCH v2 2/5] reftable/writer: use correct type to iterate through index entries Patrick Steinhardt
@ 2024-02-01  7:52   ` Patrick Steinhardt
  2024-02-01  7:52   ` [PATCH v2 4/5] reftable/writer: fix writing multi-level indices Patrick Steinhardt
  2024-02-01  7:52   ` [PATCH v2 5/5] reftable: document reading and writing indices Patrick Steinhardt
  4 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  7:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 1428 bytes --]

When finishing the current section some index records might be written
for the section to the table. The logic that adds these records to the
writer duplicates what we already have in `writer_add_record()`, making
this more complicated than it really has to be.

Simplify the code by using `writer_add_record()` instead. While at it,
drop the unneeded braces around a loop to make the code conform to our
code style better.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/writer.c | 18 +++---------------
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/reftable/writer.c b/reftable/writer.c
index 5a0b87b406..b5bcce0292 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -412,26 +412,14 @@ static int writer_finish_section(struct reftable_writer *w)
 					.idx = idx[i],
 				},
 			};
-			if (block_writer_add(w->block_writer, &rec) == 0) {
-				continue;
-			}
 
-			err = writer_flush_block(w);
+			err = writer_add_record(w, &rec);
 			if (err < 0)
 				return err;
-
-			writer_reinit_block_writer(w, BLOCK_TYPE_INDEX);
-
-			err = block_writer_add(w->block_writer, &rec);
-			if (err != 0) {
-				/* write into fresh block should always succeed
-				 */
-				abort();
-			}
 		}
-		for (i = 0; i < idx_len; i++) {
+
+		for (i = 0; i < idx_len; i++)
 			strbuf_release(&idx[i].last_key);
-		}
 		reftable_free(idx);
 	}
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 4/5] reftable/writer: fix writing multi-level indices
  2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2024-02-01  7:52   ` [PATCH v2 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
@ 2024-02-01  7:52   ` Patrick Steinhardt
  2024-02-05 23:56     ` jltobler
  2024-02-01  7:52   ` [PATCH v2 5/5] reftable: document reading and writing indices Patrick Steinhardt
  4 siblings, 1 reply; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  7:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 4607 bytes --]

When finishing a section we will potentially write an index that makes
it more efficient to look up relevant blocks. The index records written
will encode, for each block of the indexed section, what the offset of
that block is as well as the last key of that block. Thus, the reader
would iterate through the index records to find the first key larger or
equal to the wanted key and then use the encoded offset to look up the
desired block.

When there are a lot of blocks to index though we may end up writing
multiple index blocks, too. To not require a linear search across all
index blocks we instead end up writing a multi-level index. Instead of
referring to the block we are after, an index record may point to
another index block. The reader will then access the highest-level index
and follow down the chain of index blocks until it hits the sought-after
block.

It has been observed though that it is impossible to seek ref records of
the last ref block when using a multi-level index. While the multi-level
index exists and looks fine for most of the part, the highest-level
index was missing an index record pointing to the last block of the next
index. Thus, every additional level made more refs become unseekable at
the end of the ref section.

The root cause is that we are not flushing the last block of the current
level once done writing the level. Consequently, it wasn't recorded in
the blocks that need to be indexed by the next-higher level and thus we
forgot about it.

Fix this bug by flushing blocks after we have written all index records.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/readwrite_test.c | 56 +++++++++++++++++++++++++++++++++++++++
 reftable/writer.c         |  8 +++---
 2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index 6b99daeaf2..853923397e 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -866,6 +866,61 @@ static void test_write_multiple_indices(void)
 	strbuf_release(&buf);
 }
 
+static void test_write_multi_level_index(void)
+{
+	struct reftable_write_options opts = {
+		.block_size = 100,
+	};
+	struct strbuf writer_buf = STRBUF_INIT, buf = STRBUF_INIT;
+	struct reftable_block_source source = { 0 };
+	struct reftable_iterator it = { 0 };
+	const struct reftable_stats *stats;
+	struct reftable_writer *writer;
+	struct reftable_reader *reader;
+	int err;
+
+	writer = reftable_new_writer(&strbuf_add_void, &noop_flush, &writer_buf, &opts);
+	reftable_writer_set_limits(writer, 1, 1);
+	for (size_t i = 0; i < 200; i++) {
+		struct reftable_ref_record ref = {
+			.update_index = 1,
+			.value_type = REFTABLE_REF_VAL1,
+			.value.val1 = {i},
+		};
+
+		strbuf_reset(&buf);
+		strbuf_addf(&buf, "refs/heads/%03" PRIuMAX, (uintmax_t)i);
+		ref.refname = buf.buf,
+
+		err = reftable_writer_add_ref(writer, &ref);
+		EXPECT_ERR(err);
+	}
+	reftable_writer_close(writer);
+
+	/*
+	 * The written refs should be sufficiently large to result in a
+	 * multi-level index.
+	 */
+	stats = reftable_writer_stats(writer);
+	EXPECT(stats->ref_stats.max_index_level == 2);
+
+	block_source_from_strbuf(&source, &writer_buf);
+	err = reftable_new_reader(&reader, &source, "filename");
+	EXPECT_ERR(err);
+
+	/*
+	 * Seeking the last ref should work as expected.
+	 */
+	err = reftable_reader_seek_ref(reader, &it, "refs/heads/199");
+	EXPECT_ERR(err);
+
+	reftable_iterator_destroy(&it);
+	reftable_writer_free(writer);
+	reftable_reader_free(reader);
+	strbuf_release(&writer_buf);
+	strbuf_release(&buf);
+}
+
 static void test_corrupt_table_empty(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -916,5 +971,6 @@ int readwrite_test_main(int argc, const char *argv[])
 	RUN_TEST(test_write_object_id_length);
 	RUN_TEST(test_write_object_id_min_length);
 	RUN_TEST(test_write_multiple_indices);
+	RUN_TEST(test_write_multi_level_index);
 	return 0;
 }
diff --git a/reftable/writer.c b/reftable/writer.c
index b5bcce0292..0349094d29 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -418,15 +418,15 @@ static int writer_finish_section(struct reftable_writer *w)
 				return err;
 		}
 
+		err = writer_flush_block(w);
+		if (err < 0)
+			return err;
+
 		for (i = 0; i < idx_len; i++)
 			strbuf_release(&idx[i].last_key);
 		reftable_free(idx);
 	}
 
-	err = writer_flush_block(w);
-	if (err < 0)
-		return err;
-
 	writer_clear_index(w);
 
 	bstats = writer_reftable_block_stats(w, typ);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 5/5] reftable: document reading and writing indices
  2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2024-02-01  7:52   ` [PATCH v2 4/5] reftable/writer: fix writing multi-level indices Patrick Steinhardt
@ 2024-02-01  7:52   ` Patrick Steinhardt
  2024-02-06  1:43     ` jltobler
  4 siblings, 1 reply; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  7:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 3863 bytes --]

The way the index gets written and read is not trivial at all and
requires the reader to piece together a bunch of parts to figure out how
it works. Add some documentation to hopefully make this easier to
understand for the next reader.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/reader.c | 27 +++++++++++++++++++++++++++
 reftable/writer.c | 23 +++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/reftable/reader.c b/reftable/reader.c
index 278f727a3d..6011d0aa04 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -508,11 +508,38 @@ static int reader_seek_indexed(struct reftable_reader *r,
 	if (err < 0)
 		goto done;
 
+	/*
+	 * The index may consist of multiple levels, where each level may have
+	 * multiple index blocks. We start by doing a linear search in the
+	 * highest layer that identifies the relevant index block as well as
+	 * the record inside that block that corresponds to our wanted key.
+	 */
 	err = reader_seek_linear(&index_iter, &want_index);
 	if (err < 0)
 		goto done;
 
+	/*
+	 * Traverse down the levels until we find a non-index entry.
+	 */
 	while (1) {
+		/*
+		 * In case we seek a record that does not exist the index iter
+		 * will tell us that the iterator is over. This works because
+		 * the last index entry of the current level will contain the
+		 * last key it knows about. So in case our seeked key is larger
+		 * than the last indexed key we know that it won't exist.
+		 *
+		 * There is one subtlety in the layout of the index section
+		 * that makes this work as expected: the highest-level index is
+		 * at end of the section and will point backwards and thus we
+		 * start reading from the end of the index section, not the
+		 * beginning.
+		 *
+		 * If that wasn't the case and the order was reversed then the
+		 * linear seek would seek into the lower levels and traverse
+		 * all levels of the index only to find out that the key does
+		 * not exist.
+		 */
 		err = table_iter_next(&index_iter, &index_result);
 		table_iter_block_done(&index_iter);
 		if (err != 0)
diff --git a/reftable/writer.c b/reftable/writer.c
index 0349094d29..e23953e498 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -391,6 +391,24 @@ static int writer_finish_section(struct reftable_writer *w)
 	if (err < 0)
 		return err;
 
+	/*
+	 * When the section we are about to index has a lot of blocks then the
+	 * index itself may span across multiple blocks, as well. This would
+	 * require a linear scan over index blocks only to find the desired
+	 * indexed block, which is inefficient. Instead, we write a multi-level
+	 * index where index records of level N+1 will refer to index blocks of
+	 * level N. This isn't constant time, either, but at least logarithmic.
+	 *
+	 * This loop handles writing this multi-level index. Note that we write
+	 * the lowest-level index pointing to the indexed blocks first. We then
+	 * continue writing additional index levels until the current level has
+	 * less blocks than the threshold so that the highest level will be at
+	 * the end of the index section.
+	 *
+	 * Readers are thus required to start reading the index section from
+	 * its end, which is why we set `index_start` to the beginning of the
+	 * last index section.
+	 */
 	while (w->index_len > threshold) {
 		struct reftable_index_record *idx = NULL;
 		size_t i, idx_len;
@@ -427,6 +445,11 @@ static int writer_finish_section(struct reftable_writer *w)
 		reftable_free(idx);
 	}
 
+	/*
+	 * The index may still contain a number of index blocks lower than the
+	 * threshold. Clear it so that these entries don't leak into the next
+	 * index section.
+	 */
 	writer_clear_index(w);
 
 	bstats = writer_reftable_block_stats(w, typ);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] reftable/writer: simplify writing index records
  2024-01-31 15:55   ` Kristoffer Haugsbakk
@ 2024-02-01  8:39     ` Patrick Steinhardt
  0 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  8:39 UTC (permalink / raw)
  To: Kristoffer Haugsbakk; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

On Wed, Jan 31, 2024 at 04:55:33PM +0100, Kristoffer Haugsbakk wrote:
> Hi
> 
> It looks like this isn’t in `next` yet so I’ll leave a comment.
> 
> > @@ -405,6 +405,7 @@ static int writer_finish_section(struct reftable_writer *w)
> >  		w->index = NULL;
> >  		w->index_len = 0;
> >  		w->index_cap = 0;
> > +
> 
> This part and the next quoted one seem to be an unrelated edit by
> `clang-format`. They could perhaps be grouped in their own patch.

I'll just drop this newline change here, it indeed does make the reader
wonder what's going on. I'll keep the other one though -- it does not
quite feel sensible to move it into its own patch.

Thanks!

Patrick

> > -				abort();
> > -			}
> >  		}
> > -		for (i = 0; i < idx_len; i++) {
> > +
> > +		for (i = 0; i < idx_len; i++)
> >  			strbuf_release(&idx[i].last_key);
> > -		}
> >  		reftable_free(idx);
> >  	}
> 
> -- 
> Kristoffer Haugsbakk

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] reftable/writer: simplify writing index records
  2024-01-31 13:44   ` Toon Claes
@ 2024-02-01  8:39     ` Patrick Steinhardt
  0 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-01  8:39 UTC (permalink / raw)
  To: Toon Claes; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1117 bytes --]

On Wed, Jan 31, 2024 at 02:44:38PM +0100, Toon Claes wrote:
> 
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > When finishing the current section we may end up writing index records
> > for the section to the table. The logic to do so essentially copies what
> > we already have in `writer_add_record()`, making this more complicated
> > than it really has to be.
> 
> I didn't feel like this commit message made it easier for me to
> understand, because I interpreted words differently than you intended.
> Using "may end up" makes it sound like it's unexpected behavior. Also
> the use of "copies" implies to me it's doing a copy operation. I would
> rephrase it to something like:
> 
>   When finishing the current section some index records might be written
>   for the section to the table. The logic to do so is essentially
>   duplicated from what we already have in `writer_add_record()`, making
>   this more complicated than it really has to be.
> 
> Other than that, I don't have any comments about this patch series.

Thanks, I'll use a slightly adapted version of this.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/5] reftable/writer: fix writing multi-level indices
  2024-02-01  7:52   ` [PATCH v2 4/5] reftable/writer: fix writing multi-level indices Patrick Steinhardt
@ 2024-02-05 23:56     ` jltobler
  2024-02-06  7:01       ` Patrick Steinhardt
  0 siblings, 1 reply; 21+ messages in thread
From: jltobler @ 2024-02-05 23:56 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

On 24/02/01 08:52AM, Patrick Steinhardt wrote:
> When finishing a section we will potentially write an index that makes
> it more efficient to look up relevant blocks. The index records written
> will encode, for each block of the indexed section, what the offset of
> that block is as well as the last key of that block. Thus, the reader
> would iterate through the index records to find the first key larger or
> equal to the wanted key and then use the encoded offset to look up the
> desired block.
> 
> When there are a lot of blocks to index though we may end up writing
> multiple index blocks, too. To not require a linear search across all
> index blocks we instead end up writing a multi-level index. Instead of
> referring to the block we are after, an index record may point to
> another index block. The reader will then access the highest-level index
> and follow down the chain of index blocks until it hits the sought-after
> block.
> 
> It has been observed though that it is impossible to seek ref records of
> the last ref block when using a multi-level index. While the multi-level
> index exists and looks fine for most of the part, the highest-level
> index was missing an index record pointing to the last block of the next
> index. Thus, every additional level made more refs become unseekable at
> the end of the ref section.

Just to clarify, is only the highest-level index not recording the last
block when multi-level indexes are being used? Or are the indexes at all
levels leaving the last block unreachable?

> 
> The root cause is that we are not flushing the last block of the current
> level once done writing the level. Consequently, it wasn't recorded in
> the blocks that need to be indexed by the next-higher level and thus we
> forgot about it.
> 
> Fix this bug by flushing blocks after we have written all index records.

 -Justin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/5] reftable: document reading and writing indices
  2024-02-01  7:52   ` [PATCH v2 5/5] reftable: document reading and writing indices Patrick Steinhardt
@ 2024-02-06  1:43     ` jltobler
  2024-02-06  7:04       ` Patrick Steinhardt
  0 siblings, 1 reply; 21+ messages in thread
From: jltobler @ 2024-02-06  1:43 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

On 24/02/01 08:52AM, Patrick Steinhardt wrote:
> The way the index gets written and read is not trivial at all and
> requires the reader to piece together a bunch of parts to figure out how
> it works. Add some documentation to hopefully make this easier to
> understand for the next reader.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/reader.c | 27 +++++++++++++++++++++++++++
>  reftable/writer.c | 23 +++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/reftable/reader.c b/reftable/reader.c
> index 278f727a3d..6011d0aa04 100644
> --- a/reftable/reader.c
> +++ b/reftable/reader.c
> @@ -508,11 +508,38 @@ static int reader_seek_indexed(struct reftable_reader *r,
>  	if (err < 0)
>  		goto done;
>  
> +	/*
> +	 * The index may consist of multiple levels, where each level may have
> +	 * multiple index blocks. We start by doing a linear search in the
> +	 * highest layer that identifies the relevant index block as well as
> +	 * the record inside that block that corresponds to our wanted key.
> +	 */
>  	err = reader_seek_linear(&index_iter, &want_index);
>  	if (err < 0)
>  		goto done;
>  
> +	/*
> +	 * Traverse down the levels until we find a non-index entry.
> +	 */
>  	while (1) {
> +		/*
> +		 * In case we seek a record that does not exist the index iter
> +		 * will tell us that the iterator is over. This works because
> +		 * the last index entry of the current level will contain the
> +		 * last key it knows about. So in case our seeked key is larger
> +		 * than the last indexed key we know that it won't exist.

The last block in the highest-level index section should end with the
record key of greatest value. Doesn't that mean the initial linear seek
should be sufficient to stop the iterator from continuing if the wanted
record key is of a greater value?

> +		 *
> +		 * There is one subtlety in the layout of the index section
> +		 * that makes this work as expected: the highest-level index is
> +		 * at end of the section and will point backwards and thus we

s/end/the end

> +		 * start reading from the end of the index section, not the
> +		 * beginning.
> +		 *
> +		 * If that wasn't the case and the order was reversed then the
> +		 * linear seek would seek into the lower levels and traverse
> +		 * all levels of the index only to find out that the key does
> +		 * not exist.
> +		 */
>  		err = table_iter_next(&index_iter, &index_result);
>  		table_iter_block_done(&index_iter);
>  		if (err != 0)

-Justin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/5] reftable/writer: fix writing multi-level indices
  2024-02-05 23:56     ` jltobler
@ 2024-02-06  7:01       ` Patrick Steinhardt
  0 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-06  7:01 UTC (permalink / raw)
  To: git, Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]

On Mon, Feb 05, 2024 at 05:56:11PM -0600, jltobler wrote:
> On 24/02/01 08:52AM, Patrick Steinhardt wrote:
> > When finishing a section we will potentially write an index that makes
> > it more efficient to look up relevant blocks. The index records written
> > will encode, for each block of the indexed section, what the offset of
> > that block is as well as the last key of that block. Thus, the reader
> > would iterate through the index records to find the first key larger or
> > equal to the wanted key and then use the encoded offset to look up the
> > desired block.
> > 
> > When there are a lot of blocks to index though we may end up writing
> > multiple index blocks, too. To not require a linear search across all
> > index blocks we instead end up writing a multi-level index. Instead of
> > referring to the block we are after, an index record may point to
> > another index block. The reader will then access the highest-level index
> > and follow down the chain of index blocks until it hits the sought-after
> > block.
> > 
> > It has been observed though that it is impossible to seek ref records of
> > the last ref block when using a multi-level index. While the multi-level
> > index exists and looks fine for most of the part, the highest-level
> > index was missing an index record pointing to the last block of the next
> > index. Thus, every additional level made more refs become unseekable at
> > the end of the ref section.
> 
> Just to clarify, is only the highest-level index not recording the last
> block when multi-level indexes are being used? Or are the indexes at all
> levels leaving the last block unreachable?

Every level N+1 looses the last block of level N. So the latter.

Patrick

> > 
> > The root cause is that we are not flushing the last block of the current
> > level once done writing the level. Consequently, it wasn't recorded in
> > the blocks that need to be indexed by the next-higher level and thus we
> > forgot about it.
> > 
> > Fix this bug by flushing blocks after we have written all index records.
> 
>  -Justin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/5] reftable: document reading and writing indices
  2024-02-06  1:43     ` jltobler
@ 2024-02-06  7:04       ` Patrick Steinhardt
  0 siblings, 0 replies; 21+ messages in thread
From: Patrick Steinhardt @ 2024-02-06  7:04 UTC (permalink / raw)
  To: git, Junio C Hamano, Toon Claes, Kristoffer Haugsbakk

[-- Attachment #1: Type: text/plain, Size: 2275 bytes --]

On Mon, Feb 05, 2024 at 07:43:07PM -0600, jltobler wrote:
> On 24/02/01 08:52AM, Patrick Steinhardt wrote:
> > The way the index gets written and read is not trivial at all and
> > requires the reader to piece together a bunch of parts to figure out how
> > it works. Add some documentation to hopefully make this easier to
> > understand for the next reader.
> > 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  reftable/reader.c | 27 +++++++++++++++++++++++++++
> >  reftable/writer.c | 23 +++++++++++++++++++++++
> >  2 files changed, 50 insertions(+)
> > 
> > diff --git a/reftable/reader.c b/reftable/reader.c
> > index 278f727a3d..6011d0aa04 100644
> > --- a/reftable/reader.c
> > +++ b/reftable/reader.c
> > @@ -508,11 +508,38 @@ static int reader_seek_indexed(struct reftable_reader *r,
> >  	if (err < 0)
> >  		goto done;
> >  
> > +	/*
> > +	 * The index may consist of multiple levels, where each level may have
> > +	 * multiple index blocks. We start by doing a linear search in the
> > +	 * highest layer that identifies the relevant index block as well as
> > +	 * the record inside that block that corresponds to our wanted key.
> > +	 */
> >  	err = reader_seek_linear(&index_iter, &want_index);
> >  	if (err < 0)
> >  		goto done;
> >  
> > +	/*
> > +	 * Traverse down the levels until we find a non-index entry.
> > +	 */
> >  	while (1) {
> > +		/*
> > +		 * In case we seek a record that does not exist the index iter
> > +		 * will tell us that the iterator is over. This works because
> > +		 * the last index entry of the current level will contain the
> > +		 * last key it knows about. So in case our seeked key is larger
> > +		 * than the last indexed key we know that it won't exist.
> 
> The last block in the highest-level index section should end with the
> record key of greatest value. Doesn't that mean the initial linear seek
> should be sufficient to stop the iterator from continuing if the wanted
> record key is of a greater value?

Yes. But we only notice it here when calling `table_iter_next()`. The
call to `reader_seek_linear()` will not return an end-of-iterator
indication.

Is there any way you think this comment can be improved to clarify this?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-02-06  7:04 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-26 10:31 [PATCH 0/5] reftable: fix writing multi-level indices Patrick Steinhardt
2024-01-26 10:31 ` [PATCH 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
2024-01-26 10:31 ` [PATCH 2/5] reftable/writer: use correct type to iterate through index entries Patrick Steinhardt
2024-01-26 10:31 ` [PATCH 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
2024-01-31 13:44   ` Toon Claes
2024-02-01  8:39     ` Patrick Steinhardt
2024-01-31 15:55   ` Kristoffer Haugsbakk
2024-02-01  8:39     ` Patrick Steinhardt
2024-01-26 10:31 ` [PATCH 4/5] reftable/writer: fix writing multi-level indices Patrick Steinhardt
2024-01-26 10:31 ` [PATCH 5/5] reftable: document reading and writing indices Patrick Steinhardt
2024-01-26 16:26 ` [PATCH 0/5] reftable: fix writing multi-level indices Junio C Hamano
2024-02-01  7:51 ` [PATCH v2 " Patrick Steinhardt
2024-02-01  7:51   ` [PATCH v2 1/5] reftable/reader: be more careful about errors in indexed seeks Patrick Steinhardt
2024-02-01  7:52   ` [PATCH v2 2/5] reftable/writer: use correct type to iterate through index entries Patrick Steinhardt
2024-02-01  7:52   ` [PATCH v2 3/5] reftable/writer: simplify writing index records Patrick Steinhardt
2024-02-01  7:52   ` [PATCH v2 4/5] reftable/writer: fix writing multi-level indices Patrick Steinhardt
2024-02-05 23:56     ` jltobler
2024-02-06  7:01       ` Patrick Steinhardt
2024-02-01  7:52   ` [PATCH v2 5/5] reftable: document reading and writing indices Patrick Steinhardt
2024-02-06  1:43     ` jltobler
2024-02-06  7:04       ` Patrick Steinhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.