From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03F61C433EF for ; Wed, 2 Mar 2022 00:58:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238817AbiCBA6o (ORCPT ); Tue, 1 Mar 2022 19:58:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55142 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231641AbiCBA6n (ORCPT ); Tue, 1 Mar 2022 19:58:43 -0500 Received: from mail-io1-xd35.google.com (mail-io1-xd35.google.com [IPv6:2607:f8b0:4864:20::d35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2EB9F3CA51 for ; Tue, 1 Mar 2022 16:58:00 -0800 (PST) Received: by mail-io1-xd35.google.com with SMTP id q8so165146iod.2 for ; Tue, 01 Mar 2022 16:58:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=BDUjYZW/OldBvsGdf5BrAs35NInY60qMdLUMCRlsPjM=; b=OEt4quqM4ydYs/Jdthx128RCvhwKlk2ipvxICZ7uLGPm5SALXgd7nUSBe+HAhGpTSL ToEBI6yBncdxE50A8lsmVBDk5zNp5n6d9SDNIRmr4krBrUD937qwCPEYkZtCIGOh0VCR 76Fpox/QUxjnw5j2/TP/ym90p82ZnciHHZJPAriGkHVoNUAF31RTQhXjPXn59XdBSi94 UwENKbsvJ8ekKWcEsRimnoGBealOY7VEHxYYVjMKPwGzwP4G9IoCXKjMEFfrSX/O95N0 Fi+hXp8hzyFTIY0t5OEST1GsTkK56byRxr5tlamY/lNFf/8Ra0KJbdjXgWLcjnuFluF8 DZiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=BDUjYZW/OldBvsGdf5BrAs35NInY60qMdLUMCRlsPjM=; b=ulEAtOlTpax/2tdPfOiqohObKBDWbbMGvkXIDjyLPjbFLgZlntFPdzRx10/rgbo2OD /LRLKlQGmgAKv7RP2J34QTQlq+mpVf02Z6V5P9PZJdkmBO9ysOQZFXIMPFk481mwzMxF 8uFGTINgSmqrK0tNZ1o94DigVEyQ1xzyKqYQvJqsSmWjHu/RoN1Nq94wHCNTUj2qBq7t ycGRYRKmWhA3L3sHtbjLgQwYpDUNCq1MPrPY0QYWtfMq1wpXif135+15gxi8qcTNSHUu mJ2jdqt/qfpoTiMlbakYbhfrz1jnOHsD1XxR7oud02TuHDko3zv2v0s7iguurFp17c7E PhdQ== X-Gm-Message-State: AOAM531dlEMyyZYcTs/l84NHTpMVPLD+23afFgamfwfF125P9waPpBha 0WuxEqq0rTUP8aj7JWPkq/LFMcZVPAm9rZPM X-Google-Smtp-Source: ABdhPJwxX/pmOugp85IJ9c/Onrrj7qBRjBDALrtLwON5rFArdfHHP5ABIBpwJBep63Jl5Ug7foTxYw== X-Received: by 2002:a6b:e901:0:b0:640:7bf8:f61d with SMTP id u1-20020a6be901000000b006407bf8f61dmr21022021iof.112.1646182678964; Tue, 01 Mar 2022 16:57:58 -0800 (PST) Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net. [104.178.186.189]) by smtp.gmail.com with ESMTPSA id c9-20020a92b749000000b002c22c39554fsm7982772ilm.31.2022.03.01.16.57.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Mar 2022 16:57:58 -0800 (PST) Date: Tue, 1 Mar 2022 19:57:57 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: tytso@mit.edu, derrickstolee@github.com, gitster@pobox.com, larsxschneider@gmail.com Subject: [PATCH v2 00/17] cruft packs Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Here is a reroll of my series to implement "cruft packs", a pack which stores accumulated unreachable objects, along with a new ".mtimes" file which tracks each object's last known modification time. This was on the list towards the end of 2021[1], and I have been accumulating small changes to it locally for a couple of months now. Major changes since last time include: - Clearer documentation and commit message(s) to better illustrate how the feature works and is supposed to be used. - Some minor documentation updates to pack-format.txt, which make some ambiguous details more explicit. - Minor code movement / tweaks to make things easier to read, ensure that functions aren't introduced in patches before they are used / etc. - Moved the new test script to t5328 (instead of t5327, which happens to be taken up by a new MIDX bitmap-related test), and purged it of all "rm -fr .git/logs" (replacing them with "git reflog --expire --all --expire=all" instead). - A new test which fixes a bug where loose objects which have copies that appear in a cruft pack would not get accumulated when doing a `--geometric` repack. For convenience, a range-diff is below. Thanks in advance for taking another look! [1]: https://lore.kernel.org/git/cover.1638224692.git.me@ttaylorr.com/ Taylor Blau (17): Documentation/technical: add cruft-packs.txt pack-mtimes: support reading .mtimes files pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' chunk-format.h: extract oid_version() pack-mtimes: support writing pack .mtimes files t/helper: add 'pack-mtimes' test-tool builtin/pack-objects.c: return from create_object_entry() builtin/pack-objects.c: --cruft without expiration reachable: add options to add_unseen_recent_objects_to_traversal reachable: report precise timestamps from objects in cruft packs builtin/pack-objects.c: --cruft with expiration builtin/repack.c: support generating a cruft pack builtin/repack.c: allow configuring cruft pack generation builtin/repack.c: use named flags for existing_packs builtin/repack.c: add cruft packs to MIDX during geometric repack builtin/gc.c: conditionally avoid pruning objects via loose sha1-file.c: don't freshen cruft packs Documentation/Makefile | 1 + Documentation/config/gc.txt | 21 +- Documentation/config/repack.txt | 9 + Documentation/git-gc.txt | 5 + Documentation/git-pack-objects.txt | 30 + Documentation/git-repack.txt | 11 + Documentation/technical/cruft-packs.txt | 97 ++++ Documentation/technical/pack-format.txt | 19 + Makefile | 2 + builtin/gc.c | 10 +- builtin/pack-objects.c | 304 +++++++++- builtin/repack.c | 183 +++++- bulk-checkin.c | 2 +- chunk-format.c | 12 + chunk-format.h | 3 + commit-graph.c | 18 +- midx.c | 18 +- object-file.c | 4 +- object-store.h | 7 +- pack-mtimes.c | 129 +++++ pack-mtimes.h | 15 + pack-objects.c | 6 + pack-objects.h | 25 + pack-write.c | 93 ++- pack.h | 4 + packfile.c | 19 +- reachable.c | 58 +- reachable.h | 9 +- t/helper/test-pack-mtimes.c | 56 ++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/t5328-pack-objects-cruft.sh | 739 ++++++++++++++++++++++++ 32 files changed, 1810 insertions(+), 101 deletions(-) create mode 100644 Documentation/technical/cruft-packs.txt create mode 100644 pack-mtimes.c create mode 100644 pack-mtimes.h create mode 100644 t/helper/test-pack-mtimes.c create mode 100755 t/t5328-pack-objects-cruft.sh Range-diff against v1: 1: a9f7c738e0 ! 1: 784ee7e0ee Documentation/technical: add cruft-packs.txt @@ Documentation/technical/cruft-packs.txt (new) @@ += Cruft packs + -+Cruft packs offer an alternative to Git's traditional mechanism of removing -+unreachable objects. This document provides an overview of Git's pruning -+mechanism, and how cruft packs can be used instead to accomplish the same. ++The cruft packs feature offer an alternative to Git's traditional mechanism of ++removing unreachable objects. This document provides an overview of Git's ++pruning mechanism, and how a cruft pack can be used instead to accomplish the ++same. + +== Background + @@ Documentation/technical/cruft-packs.txt (new) + +== Cruft packs + -+Cruft packs are designed to eliminate the need for storing unreachable objects -+in a loose state by including the per-object mtimes in a separate file alongside -+a single pack containing all loose objects. ++A cruft pack eliminates the need for storing unreachable objects in a loose ++state by including the per-object mtimes in a separate file alongside a single ++pack containing all loose objects. + +A cruft pack is written by `git repack --cruft` when generating a new pack. +linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft` @@ Documentation/technical/cruft-packs.txt (new) +Notable alternatives to this design include: + + - The location of the per-object mtime data, and -+ - Whether cruft packs should be incremental or not. ++ - Storing unreachable objects in multiple cruft packs. + +On the location of mtime data, a new auxiliary file tied to the pack was chosen +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain +support for optional chunks of data, it may make sense to consolidate the +`.mtimes` format into the `.idx` itself. + -+Incremental cruft packs (i.e., where each time a repository is repacked a new -+cruft pack is generated containing only the unreachable objects introduced since -+the last time a cruft pack was written) are significantly more complicated to -+construct, and so aren't pursued here. The obvious drawback to the current -+implementation is that the entire cruft pack must be re-written from scratch. ++Storing unreachable objects among multiple cruft packs (e.g., creating a new ++cruft pack during each repacking operation including only unreachable objects ++which aren't already stored in an earlier cruft pack) is significantly more ++complicated to construct, and so aren't pursued here. The obvious drawback to ++the current implementation is that the entire cruft pack must be re-written from ++scratch. 2: 7d4ae7bd3e ! 2: 101b34660c pack-mtimes: support reading .mtimes files @@ Documentation/technical/pack-format.txt: Pack file entry: <+ + + - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256). + -+ - A table of mtimes (one per packed object, num_objects in total, each -+ a 4-byte unsigned integer in network order), in the same order as -+ objects appear in the index file (e.g., the first entry in the mtime -+ table corresponds to the object with the lowest lexically-sorted -+ oid). The mtimes count standard epoch seconds. ++ - A table of 4-byte unsigned integers in network order. The ith ++ value is the modification time (mtime) of the ith object in the ++ corresponding pack by lexicographic (index) order. The mtimes ++ count standard epoch seconds. + -+ - A trailer, containing a: -+ -+ checksum of the corresponding packfile, and -+ -+ a checksum of all of the above. ++ - A trailer, containing a checksum of the corresponding packfile, ++ and a checksum of all of the above (each having length according ++ to the specified hash function). + +All 4-byte numbers are in network order. + @@ pack-mtimes.c (new) + return xstrfmt("%.*s.mtimes", (int)len, p->pack_name); +} + -+int pack_has_mtimes(struct packed_git *p) -+{ -+ struct stat st; -+ char *fname = pack_mtimes_filename(p); -+ -+ if (stat(fname, &st) < 0) { -+ if (errno == ENOENT) -+ return 0; -+ die_errno(_("could not stat %s"), fname); -+ } -+ -+ free(fname); -+ return 1; -+} -+ +#define MTIMES_HEADER_SIZE (12) +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz)) + @@ pack-mtimes.c (new) + struct stat st; + void *data = NULL; + size_t mtimes_size; ++ struct mtimes_header header; + uint32_t *hdr; + + fd = git_open(mtimes_file); @@ pack-mtimes.c (new) + + data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0); + -+ if (ntohl(*hdr) != MTIMES_SIGNATURE) { ++ header.signature = ntohl(hdr[0]); ++ header.version = ntohl(hdr[1]); ++ header.hash_id = ntohl(hdr[2]); ++ ++ if (header.signature != MTIMES_SIGNATURE) { + ret = error(_("mtimes file %s has unknown signature"), mtimes_file); + goto cleanup; + } + -+ if (ntohl(*++hdr) != 1) { ++ if (header.version != 1) { + ret = error(_("mtimes file %s has unsupported version %"PRIu32), -+ mtimes_file, ntohl(*hdr)); ++ mtimes_file, header.version); + goto cleanup; + } -+ hdr++; -+ if (!(ntohl(*hdr) == 1 || ntohl(*hdr) == 2)) { ++ ++ if (!(header.hash_id == 1 || header.hash_id == 2)) { + ret = error(_("mtimes file %s has unsupported hash id %"PRIu32), -+ mtimes_file, ntohl(*hdr)); ++ mtimes_file, header.hash_id); + goto cleanup; + } + @@ pack-mtimes.h (new) + +struct packed_git; + -+int pack_has_mtimes(struct packed_git *p); +int load_pack_mtimes(struct packed_git *p); + +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos); @@ pack-mtimes.h (new) +#endif ## packfile.c ## -@@ packfile.c: void close_pack_revindex(struct packed_git *p) { +@@ packfile.c: static void close_pack_revindex(struct packed_git *p) p->revindex_data = NULL; } -+void close_pack_mtimes(struct packed_git *p) { ++static void close_pack_mtimes(struct packed_git *p) ++{ + if (!p->mtimes_map) + return; + @@ packfile.c: static void prepare_pack(const char *full_name, size_t full_name_len string_list_append(data->garbage, full_name); else report_garbage(PACKDIR_FILE_GARBAGE, full_name); - - ## packfile.h ## -@@ packfile.h: uint32_t get_pack_fanout(struct packed_git *p, uint32_t value); - unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *); - void close_pack_windows(struct packed_git *); - void close_pack_revindex(struct packed_git *); -+void close_pack_mtimes(struct packed_git *p); - void close_pack(struct packed_git *); - void close_object_store(struct raw_object_store *o); - void unuse_pack(struct pack_window **); 3: 7f4612e859 = 3: a94d7dfeb3 pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' 4: ea245b7216 = 4: 1e0ed363ae chunk-format.h: extract oid_version() 5: deece9eb70 ! 5: 5236490688 pack-mtimes: support writing pack .mtimes files @@ pack-objects.h: struct packing_data { unsigned int *tree_depth; unsigned char *layer; + -+ /* cruft packs */ ++ /* ++ * Used when writing cruft packs. ++ * ++ * Object mtimes are stored in pack order when writing, but ++ * written out in lexicographic (index) order. ++ */ + uint32_t *cruft_mtime; }; @@ pack-write.c: const char *write_rev_file_order(const char *rev_name, + hashwrite_be32(f, oid_version(the_hash_algo)); +} + ++/* ++ * Writes the object mtimes of "objects" for use in a .mtimes file. ++ * Note that objects must be in lexicographic (index) order, which is ++ * the expected ordering of these values in the .mtimes file. ++ */ +static void write_mtimes_objects(struct hashfile *f, + struct packing_data *to_pack, + struct pack_idx_entry **objects, @@ pack-write.c: const char *write_rev_file_order(const char *rev_name, + write_mtimes_objects(f, to_pack, objects, nr_objects); + write_mtimes_trailer(f, hash); + -+ if (mtimes_name && adjust_shared_perm(mtimes_name) < 0) ++ if (adjust_shared_perm(mtimes_name) < 0) + die(_("failed to make %s readable"), mtimes_name); + + finalize_hashfile(f, NULL, @@ pack-write.c: void stage_tmp_packfiles(struct strbuf *name_buffer, + mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list, + nr_written, + hash); -+ if (adjust_shared_perm(mtimes_tmp_name)) -+ die_errno("unable to make temporary mtimes file readable"); + } + rename_tmp_packfile(name_buffer, pack_tmp_name, "pack"); 6: e0a7b3b310 ! 6: 78313bc441 t/helper: add 'pack-mtimes' test-tool @@ t/helper/test-pack-mtimes.c (new) +#include "packfile.h" +#include "pack-mtimes.h" + -+static int dump_mtimes(struct packed_git *p) ++static void dump_mtimes(struct packed_git *p) +{ + uint32_t i; + if (load_pack_mtimes(p) < 0) @@ t/helper/test-pack-mtimes.c (new) + printf("%s %"PRIu32"\n", + oid_to_hex(&oid), nth_packed_mtime(p, i)); + } -+ -+ return 0; +} + +static const char *pack_mtimes_usage = "\n" @@ t/helper/test-pack-mtimes.c (new) + + strbuf_release(&buf); + -+ return p ? dump_mtimes(p) : 1; ++ if (!p) ++ die("could not find pack '%s'", argv[1]); ++ ++ dump_mtimes(p); ++ ++ return 0; +} ## t/helper/test-tool.c ## 7: 5710933127 = 7: 142098668d builtin/pack-objects.c: return from create_object_entry() 8: 66165917a4 ! 8: 2517a6be3d builtin/pack-objects.c: --cruft without expiration @@ Commit message which packs are about to be removed. - All packs which are going to be removed (we'll call these the - redundant ones) are marked as kept in-core, as well as any packs - that `pack-objects` found but the caller did not specify. + redundant ones) are marked as kept in-core. - These packs are presumed to have entered the repository between - the caller collecting packs and invoking `pack-objects`. Since we - do not want to include objects in these packs (because we don't know - which of their objects are or aren't reachable), these are also - marked as kept in-core. + Any packs the caller did not mention (but are known to the + `pack-objects` process) are also marked as kept in-core. Packs not + mentioned by the caller are assumed to be unknown to them, i.e., + they entered the repository after the caller decided which packs + should be kept and which should be discarded. + + Since we do not want to include objects in these "unknown" packs + (because we don't know which of their objects are or aren't + reachable), these are also marked as kept in-core. - Then, we enumerate all objects in the repository, and add them to our packing list if they do not appear in an in-core kept pack. @@ Documentation/git-pack-objects.txt: SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] + [--cruft] [--cruft-expiration=