* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") @ 2022-09-17 23:56 Вероника Кулешова 0 siblings, 0 replies; 7+ messages in thread From: Вероника Кулешова @ 2022-09-17 23:56 UTC (permalink / raw) To: jonathantanmy; +Cc: git Отправлено с iPhone ^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") @ 2017-07-11 19:48 Jonathan Tan 2017-07-16 15:23 ` Philip Oakley 0 siblings, 1 reply; 7+ messages in thread From: Jonathan Tan @ 2017-07-11 19:48 UTC (permalink / raw) To: git; +Cc: Jonathan Tan These patches are part of a set of patches implementing partial clone, as you can see here: https://github.com/jonathantanmy/git/tree/partialclone In that branch, clone with batch checkout works, as you can see in the README. The code and tests are generally done, but some patches are still missing documentation and commit messages. These 3 patches implement the foundational concept - formerly known as "missing blobs" in the "missing blob manifest", I decided to call them "promised blobs". The repo knows their object names and sizes. It also does not have the blobs themselves, but can be configured to know how to fetch them. An older version of these patches was sent as a single demonstration patch in versions 1 to 3 of [1]. In there, Junio suggested that I have only one file containing missing blob information. I have made that suggested change in this version. One thing remaining is to add a repository extension [2] so that older versions of Git fail immediately instead of trying to read missing blobs, but I thought I'd send these first in order to get some initial feedback. [1] https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/ [2] Documentation/technical/repository-version.txt Jonathan Tan (3): promised-blob, fsck: introduce promised blobs sha1-array: support appending unsigned char hash sha1_file: add promised blob hook support Documentation/config.txt | 8 ++ Documentation/gitrepository-layout.txt | 8 ++ Makefile | 1 + builtin/cat-file.c | 9 ++ builtin/fsck.c | 13 +++ promised-blob.c | 170 +++++++++++++++++++++++++++++++++ promised-blob.h | 27 ++++++ sha1-array.c | 7 ++ sha1-array.h | 1 + sha1_file.c | 44 ++++++--- t/t3907-promised-blob.sh | 65 +++++++++++++ t/test-lib-functions.sh | 6 ++ 12 files changed, 345 insertions(+), 14 deletions(-) create mode 100644 promised-blob.c create mode 100644 promised-blob.h create mode 100755 t/t3907-promised-blob.sh -- 2.13.2.932.g7449e964c-goog ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") 2017-07-11 19:48 Jonathan Tan @ 2017-07-16 15:23 ` Philip Oakley 2017-07-17 17:43 ` Ben Peart 2017-07-17 18:03 ` Jonathan Nieder 0 siblings, 2 replies; 7+ messages in thread From: Philip Oakley @ 2017-07-16 15:23 UTC (permalink / raw) To: Jonathan Tan, git; +Cc: Jonathan Tan, Ben Peart From: "Jonathan Tan" <jonathantanmy@google.com> Sent: Tuesday, July 11, 2017 8:48 PM > These patches are part of a set of patches implementing partial clone, > as you can see here: > > https://github.com/jonathantanmy/git/tree/partialclone > > In that branch, clone with batch checkout works, as you can see in the > README. The code and tests are generally done, but some patches are > still missing documentation and commit messages. > > These 3 patches implement the foundational concept - formerly known as > "missing blobs" in the "missing blob manifest", I decided to call them > "promised blobs". The repo knows their object names and sizes. It also > does not have the blobs themselves, but can be configured to know how to > fetch them. > If I understand correctly, this method doesn't give any direct user visibility of missing blobs in the file system. Is that correct? I was hoping that eventually the various 'on demand' approaches would still allow users to continue to work as they go off-line such that they can see directly (in the FS) where the missing blobs (and trees) are located, so that they can continue to commit new work on existing files. I had felt that some sort of 'gitlink' should be present (huma readable) as a place holder for the missing blob/tree. e.g. 'gitblob: 1234abcd' (showing the missing oid, jsut like sub-modules can do - it's no different really. I'm concerned that the various GVFS extensions haven't fully achieved a separation of concerns surrounding the DVCS capability for on-line/off-line conversion as comms drop in and out. The GVFS looks great for a fully networked, always on, environment, but it would be good to also have the sepration for those who (will) have shallow/narrow clones that may also need to work with a local upstream that is also shallow/narrow. -- Philip I wanted to at least get my thoughts into the discussion before it all passes by. > An older version of these patches was sent as a single demonstration > patch in versions 1 to 3 of [1]. In there, Junio suggested that I have > only one file containing missing blob information. I have made that > suggested change in this version. > > One thing remaining is to add a repository extension [2] so that older > versions of Git fail immediately instead of trying to read missing > blobs, but I thought I'd send these first in order to get some initial > feedback. > > [1] > https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/ > [2] Documentation/technical/repository-version.txt > > Jonathan Tan (3): > promised-blob, fsck: introduce promised blobs > sha1-array: support appending unsigned char hash > sha1_file: add promised blob hook support > > Documentation/config.txt | 8 ++ > Documentation/gitrepository-layout.txt | 8 ++ > Makefile | 1 + > builtin/cat-file.c | 9 ++ > builtin/fsck.c | 13 +++ > promised-blob.c | 170 > +++++++++++++++++++++++++++++++++ > promised-blob.h | 27 ++++++ > sha1-array.c | 7 ++ > sha1-array.h | 1 + > sha1_file.c | 44 ++++++--- > t/t3907-promised-blob.sh | 65 +++++++++++++ > t/test-lib-functions.sh | 6 ++ > 12 files changed, 345 insertions(+), 14 deletions(-) > create mode 100644 promised-blob.c > create mode 100644 promised-blob.h > create mode 100755 t/t3907-promised-blob.sh > > -- > 2.13.2.932.g7449e964c-goog > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") 2017-07-16 15:23 ` Philip Oakley @ 2017-07-17 17:43 ` Ben Peart 2017-07-25 20:48 ` Philip Oakley 2017-07-17 18:03 ` Jonathan Nieder 1 sibling, 1 reply; 7+ messages in thread From: Ben Peart @ 2017-07-17 17:43 UTC (permalink / raw) To: Philip Oakley, Jonathan Tan, git On 7/16/2017 11:23 AM, Philip Oakley wrote: > From: "Jonathan Tan" <jonathantanmy@google.com> > Sent: Tuesday, July 11, 2017 8:48 PM >> These patches are part of a set of patches implementing partial clone, >> as you can see here: >> >> https://github.com/jonathantanmy/git/tree/partialclone >> >> In that branch, clone with batch checkout works, as you can see in the >> README. The code and tests are generally done, but some patches are >> still missing documentation and commit messages. >> >> These 3 patches implement the foundational concept - formerly known as >> "missing blobs" in the "missing blob manifest", I decided to call them >> "promised blobs". The repo knows their object names and sizes. It also >> does not have the blobs themselves, but can be configured to know how to >> fetch them. >> > If I understand correctly, this method doesn't give any direct user > visibility of missing blobs in the file system. Is that correct? That is correct > > I was hoping that eventually the various 'on demand' approaches would > still allow users to continue to work as they go off-line such that they > can see directly (in the FS) where the missing blobs (and trees) are > located, so that they can continue to commit new work on existing files. > This is a challenge as git assumes all objects are always available (that is a key design principal of a DVCS) so any missing object is considered a corruption that typically results in a call to "die." The GVFS solution gets around this by ensuring any missing object is retrieved on behalf of git so that it never sees it as missing. The obvious tradeoff is that this requires a network connection so the object can be retrieved. > I had felt that some sort of 'gitlink' should be present (huma readable) > as a place holder for the missing blob/tree. e.g. 'gitblob: 1234abcd' > (showing the missing oid, jsut like sub-modules can do - it's no > different really. > We explored that option briefly but when you have a large number of files, even writing out some sort of place holder can take a very long time. In fact, since the typical source file is relatively small (a few kilobytes), writing out a placeholder doesn't save much time vs just writing out the actual file contents. Another challenge is that even if there is a placeholder written to disk, you still need a network connection to retrieve the actual contents if/when it is needed. > I'm concerned that the various GVFS extensions haven't fully achieved a > separation of concerns surrounding the DVCS capability for > on-line/off-line conversion as comms drop in and out. The GVFS looks > great for a fully networked, always on, environment, but it would be > good to also have the sepration for those who (will) have shallow/narrow > clones that may also need to work with a local upstream that is also > shallow/narrow. > You are correct that this hasn't been tackled yet. It is a challenging problem. I can envision something along the lines of what was done for the shallow clone feature where there are distinct ways to change the set of objects that are available but that would hopefully come in some future patch series. > -- > Philip > I wanted to at least get my thoughts into the discussion before it all > passes by. > >> An older version of these patches was sent as a single demonstration >> patch in versions 1 to 3 of [1]. In there, Junio suggested that I have >> only one file containing missing blob information. I have made that >> suggested change in this version. >> >> One thing remaining is to add a repository extension [2] so that older >> versions of Git fail immediately instead of trying to read missing >> blobs, but I thought I'd send these first in order to get some initial >> feedback. >> >> [1] >> https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/ >> >> [2] Documentation/technical/repository-version.txt >> >> Jonathan Tan (3): >> promised-blob, fsck: introduce promised blobs >> sha1-array: support appending unsigned char hash >> sha1_file: add promised blob hook support >> >> Documentation/config.txt | 8 ++ >> Documentation/gitrepository-layout.txt | 8 ++ >> Makefile | 1 + >> builtin/cat-file.c | 9 ++ >> builtin/fsck.c | 13 +++ >> promised-blob.c | 170 >> +++++++++++++++++++++++++++++++++ >> promised-blob.h | 27 ++++++ >> sha1-array.c | 7 ++ >> sha1-array.h | 1 + >> sha1_file.c | 44 ++++++--- >> t/t3907-promised-blob.sh | 65 +++++++++++++ >> t/test-lib-functions.sh | 6 ++ >> 12 files changed, 345 insertions(+), 14 deletions(-) >> create mode 100644 promised-blob.c >> create mode 100644 promised-blob.h >> create mode 100755 t/t3907-promised-blob.sh >> >> -- >> 2.13.2.932.g7449e964c-goog >> > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") 2017-07-17 17:43 ` Ben Peart @ 2017-07-25 20:48 ` Philip Oakley 0 siblings, 0 replies; 7+ messages in thread From: Philip Oakley @ 2017-07-25 20:48 UTC (permalink / raw) To: Jonathan Tan, git, Ben Peart Sorry for the delay - been away... From: "Ben Peart" <peartben@gmail.com> Sent: Monday, July 17, 2017 6:43 PM > > On 7/16/2017 11:23 AM, Philip Oakley wrote: >> From: "Jonathan Tan" <jonathantanmy@google.com> >> Sent: Tuesday, July 11, 2017 8:48 PM >>> These patches are part of a set of patches implementing partial clone, >>> as you can see here: >>> >>> https://github.com/jonathantanmy/git/tree/partialclone >>> >>> In that branch, clone with batch checkout works, as you can see in the >>> README. The code and tests are generally done, but some patches are >>> still missing documentation and commit messages. >>> >>> These 3 patches implement the foundational concept - formerly known as >>> "missing blobs" in the "missing blob manifest", I decided to call them >>> "promised blobs". The repo knows their object names and sizes. It also >>> does not have the blobs themselves, but can be configured to know how to >>> fetch them. >>> >> If I understand correctly, this method doesn't give any direct user >> visibility of missing blobs in the file system. Is that correct? > > That is correct > >> >> I was hoping that eventually the various 'on demand' approaches would >> still allow users to continue to work as they go off-line such that they >> can see directly (in the FS) where the missing blobs (and trees) are >> located, so that they can continue to commit new work on existing files. >> > > This is a challenge as git assumes all objects are always available (that > is a key design principal of a DVCS) so any missing object is considered a > corruption that typically results in a call to "die." My view/concept was more based on the fact that Git is happy to have missing 'trees', as long as they are submodules ;-), so I was hoping to massage that so git could carry on working as if the whole 'tree' (or blob when they were omitted) was still present in as 'unchanged', so the oid's would stay as they were. I see that you don't omit the trees, which would be more common in my type of environment (defence/security). I expect in an idealised BigWin repo the same would also be true - user only gets /Office/Excel if that's what they are working on ;-) > > The GVFS solution gets around this by ensuring any missing object is > retrieved on behalf of git so that it never sees it as missing. The > obvious tradeoff is that this requires a network connection so the object > can be retrieved. In my concept, the user would not have the opportunity to fetch the tree/blob, but could replace it in its entirety (we'd still have the meta data of the tree/blob name and it's old oid, but couldn't do a diff) > >> I had felt that some sort of 'gitlink' should be present (huma readable) >> as a place holder for the missing blob/tree. e.g. 'gitblob: 1234abcd' >> (showing the missing oid, jsut like sub-modules can do - it's no >> different really. >> > > We explored that option briefly but when you have a large number of files, > even writing out some sort of place holder can take a very long time. In > fact, since the typical source file is relatively small (a few kilobytes), > writing out a placeholder doesn't save much time vs just writing out the > actual file contents. > > Another challenge is that even if there is a placeholder written to disk, > you still need a network connection to retrieve the actual contents > if/when it is needed. I was viewing the 'missing' tree/blobs to be part of a narrow clone concept, so the user would need to explicitly widen the narrow clone to get missing trees/blobs (which could have been omitted by age, size, name, style of a .gitNarrowIgnore spec etc) > >> I'm concerned that the various GVFS extensions haven't fully achieved a >> separation of concerns surrounding the DVCS capability for >> on-line/off-line conversion as comms drop in and out. The GVFS looks >> great for a fully networked, always on, environment, but it would be good >> to also have the sepration for those who (will) have shallow/narrow >> clones that may also need to work with a local upstream that is also >> shallow/narrow. >> > > You are correct that this hasn't been tackled yet. It is a challenging > problem. I can envision something along the lines of what was done for the > shallow clone feature where there are distinct ways to change the set of > objects that are available but that would hopefully come in some future > patch series. OK. That's good to know. If the GFVS could be expanded to create a type of Narrow Clone capability so that the 'going off-line' problem easily transitions between being just a neat VFS and then to being a neat narrow clone, and that it may solve two problems in one. I had it in my mind that the missing blobs/trees could be simply stubbed out within the repo itself, as just the oid ref, or maybe the oid ref plus the length (given that size is one of the common causes on not wanting the content just yet). The repo could still be packed etc, as long as the format is understood. -- Philip > >> -- >> Philip >> I wanted to at least get my thoughts into the discussion before it all >> passes by. >> >>> An older version of these patches was sent as a single demonstration >>> patch in versions 1 to 3 of [1]. In there, Junio suggested that I have >>> only one file containing missing blob information. I have made that >>> suggested change in this version. >>> >>> One thing remaining is to add a repository extension [2] so that older >>> versions of Git fail immediately instead of trying to read missing >>> blobs, but I thought I'd send these first in order to get some initial >>> feedback. >>> >>> [1] >>> https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/ >>> [2] Documentation/technical/repository-version.txt >>> >>> Jonathan Tan (3): >>> promised-blob, fsck: introduce promised blobs >>> sha1-array: support appending unsigned char hash >>> sha1_file: add promised blob hook support >>> >>> Documentation/config.txt | 8 ++ >>> Documentation/gitrepository-layout.txt | 8 ++ >>> Makefile | 1 + >>> builtin/cat-file.c | 9 ++ >>> builtin/fsck.c | 13 +++ >>> promised-blob.c | 170 >>> +++++++++++++++++++++++++++++++++ >>> promised-blob.h | 27 ++++++ >>> sha1-array.c | 7 ++ >>> sha1-array.h | 1 + >>> sha1_file.c | 44 ++++++--- >>> t/t3907-promised-blob.sh | 65 +++++++++++++ >>> t/test-lib-functions.sh | 6 ++ >>> 12 files changed, 345 insertions(+), 14 deletions(-) >>> create mode 100644 promised-blob.c >>> create mode 100644 promised-blob.h >>> create mode 100755 t/t3907-promised-blob.sh >>> >>> -- >>> 2.13.2.932.g7449e964c-goog >>> >> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") 2017-07-16 15:23 ` Philip Oakley 2017-07-17 17:43 ` Ben Peart @ 2017-07-17 18:03 ` Jonathan Nieder 2017-07-29 12:51 ` Philip Oakley 1 sibling, 1 reply; 7+ messages in thread From: Jonathan Nieder @ 2017-07-17 18:03 UTC (permalink / raw) To: Philip Oakley; +Cc: Jonathan Tan, git, Ben Peart Hi Philip, Philip Oakley wrote: > From: "Jonathan Tan" <jonathantanmy@google.com> >> These patches are part of a set of patches implementing partial clone, >> as you can see here: >> >> https://github.com/jonathantanmy/git/tree/partialclone [...] > If I understand correctly, this method doesn't give any direct user > visibility of missing blobs in the file system. Is that correct? > > I was hoping that eventually the various 'on demand' approaches > would still allow users to continue to work as they go off-line such > that they can see directly (in the FS) where the missing blobs (and > trees) are located, so that they can continue to commit new work on > existing files. > > I had felt that some sort of 'gitlink' should be present (huma > readable) as a place holder for the missing blob/tree. e.g. > 'gitblob: 1234abcd' (showing the missing oid, jsut like sub-modules > can do - it's no different really. That's a reasonable thing to want, but it's a little different from the use cases that partial clone work so far has aimed to support. They are: A. Avoiding downloading all blobs (and likely trees as well) that are not needed in the current operation (e.g. checkout). This blends well with the sparse checkout feature, which allows the current checkout to be fairly small in a large repository. GVFS uses a trick that makes it a little easier to widen a sparse checkout upon access of a directory. But the same building blocks should work fine with a sparse checkout that has been set up explicitly. B. Avoiding downloading large blobs, except for those needed in the current operation (e.g. checkout). When not using sparse checkout, the main benefit out of the box is avoiding downloading *historical versions* of large blobs. It sounds like you are looking for a sort of placeholder outside the sparse checkout area. In a way, that's orthogonal to these patches: even if you have all relevant blobs, you may want to avoid inflating them to check them out and reading them to compare to the index (i.e. the usual benefits of sparse checkout). In a sparse checkout, you still might like to be able to get a listing of files outside the sparse area (which you can get with "git ls-tree") and you may even want to be able to get such a listing with plain "ls" (as with your proposal). Thanks and hope that helps, Jonathan ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") 2017-07-17 18:03 ` Jonathan Nieder @ 2017-07-29 12:51 ` Philip Oakley 0 siblings, 0 replies; 7+ messages in thread From: Philip Oakley @ 2017-07-29 12:51 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jonathan Tan, git, Ben Peart From: "Jonathan Nieder" <jrnieder@gmail.com> Sent: Monday, July 17, 2017 7:03 PM > Hi Philip, > > Philip Oakley wrote: >> From: "Jonathan Tan" <jonathantanmy@google.com> > >>> These patches are part of a set of patches implementing partial clone, >>> as you can see here: >>> >>> https://github.com/jonathantanmy/git/tree/partialclone > [...] >> If I understand correctly, this method doesn't give any direct user >> visibility of missing blobs in the file system. Is that correct? >> >> I was hoping that eventually the various 'on demand' approaches >> would still allow users to continue to work as they go off-line such >> that they can see directly (in the FS) where the missing blobs (and >> trees) are located, so that they can continue to commit new work on >> existing files. >> >> I had felt that some sort of 'gitlink' should be present (huma >> readable) as a place holder for the missing blob/tree. e.g. >> 'gitblob: 1234abcd' (showing the missing oid, jsut like sub-modules >> can do - it's no different really. > > That's a reasonable thing to want, but it's a little different from > the use cases that partial clone work so far has aimed to support. > They are: > > A. Avoiding downloading all blobs (and likely trees as well) that are > not needed in the current operation (e.g. checkout). This blends > well with the sparse checkout feature, which allows the current > checkout to be fairly small in a large repository. True. In my case I was looking for a method that would allow a 'Narrow clone' such that the local repo would be smaller (have less content), but would feel as if all the usefull files/directories were available, and there would be place holders at the points where the trees were pruned, both in the object store, and in the user's work-tree. As you say, in some ways its conceptually orthogonal to the original sparse checket (which has a full width object store / repo, and then omitted files from the checkout. > > GVFS uses a trick that makes it a little easier to widen a sparse > checkout upon access of a directory. But the same building blocks > should work fine with a sparse checkout that has been set up > explicitly. > > B. Avoiding downloading large blobs, except for those needed in the > current operation (e.g. checkout). > > When not using sparse checkout, the main benefit out of the box is > avoiding downloading *historical versions* of large blobs. > > It sounds like you are looking for a sort of placeholder outside the > sparse checkout area. True. > In a way, that's orthogonal to these patches: > even if you have all relevant blobs, you may want to avoid inflating > them to check them out and reading them to compare to the index (i.e. > the usual benefits of sparse checkout). In my concept, it should be possible to create the ('sparse'/narrow) index from the content of the local object store, without any network connection (though that content is determined by the prior fetch/clone;-). The proper git sparse checkout could proceed from there as being a further local restriction on what is omitted from the worktree. Those missing from the narrow clone would still show as place holders with content ".gitnarrowtree 13a24b..<oid>" (so we know what the hash oid of the file/tree should be (so they can be moved/renamed etc!). The index would only know the content/structure as far as the place holders (just like sub-modules are a break point in the tracking, with identical caveats) It would be interesting to know from Ben the level of sparseness/narrowness has been seen typically in the BigWin GVFS repo case. > In a sparse checkout, you > still might like to be able to get a listing of files outside the > sparse area (which you can get with "git ls-tree") and you may even > want to be able to get such a listing with plain "ls" (as with your > proposal). > > Thanks and hope that helps, > Jonathan Thanks, yes. It has help consolidate some of the parts of my concept that has been in the back of my mind for a while now. Philip ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-09-17 23:57 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-17 23:56 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Вероника Кулешова -- strict thread matches above, loose matches on Subject: below -- 2017-07-11 19:48 Jonathan Tan 2017-07-16 15:23 ` Philip Oakley 2017-07-17 17:43 ` Ben Peart 2017-07-25 20:48 ` Philip Oakley 2017-07-17 18:03 ` Jonathan Nieder 2017-07-29 12:51 ` Philip Oakley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).