All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fetch: git: add reference parameter
@ 2020-10-18 21:33 Rasmus Villemoes
  2020-10-19  9:09 ` Richard Purdie
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-18 21:33 UTC (permalink / raw)
  To: bitbake-devel; +Cc: Richard Purdie, Rasmus Villemoes

The --reference(-if-able) option to git clone can significantly speed
up the initial do_fetch if one is able to point git at a local
repository containing a large number of the objects that would
otherwise need to be downloaded.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
---
 lib/bb/fetch2/git.py | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/lib/bb/fetch2/git.py b/lib/bb/fetch2/git.py
index b97967b4..c8837816 100644
--- a/lib/bb/fetch2/git.py
+++ b/lib/bb/fetch2/git.py
@@ -51,6 +51,13 @@ Supported SRC_URI options are:
    For local git:// urls to use the current branch HEAD as the revision for use with
    AUTOREV. Implies nobranch.
 
+- reference
+   This can be set to the path to a local repository expected to
+   contain many of the objects that would otherwise need to be
+   downloaded. It will be passed as the argument to the
+   '--reference-if-able' option of 'git clone', along with the
+   '--dissociate' option.
+
 """
 
 # Copyright (C) 2005 Richard Purdie
@@ -151,6 +158,8 @@ class Git(FetchMethod):
 
         ud.nobranch = ud.parm.get("nobranch","0") == "1"
 
+        ud.reference = ud.parm.get("reference", "")
+
         # usehead implies nobranch
         ud.usehead = ud.parm.get("usehead","0") == "1"
         if ud.usehead:
@@ -344,6 +353,8 @@ class Git(FetchMethod):
             if repourl.startswith("file://"):
                 repourl = repourl[7:]
             clone_cmd = "LANG=C %s clone --bare --mirror %s %s --progress" % (ud.basecmd, shlex.quote(repourl), ud.clonedir)
+            if ud.reference:
+                clone_cmd += " --reference-if-able %s --dissociate" % shlex.quote(ud.reference)
             if ud.proto.lower() != 'file':
                 bb.fetch2.check_network_access(d, clone_cmd, ud.url)
             progresshandler = GitProgressHandler(d)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-18 21:33 [PATCH] fetch: git: add reference parameter Rasmus Villemoes
@ 2020-10-19  9:09 ` Richard Purdie
  2020-10-19 10:25   ` Rasmus Villemoes
  2020-10-22 13:38 ` [PATCH v2] " Rasmus Villemoes
       [not found] ` <1640541B397C8D4C.8847@lists.openembedded.org>
  2 siblings, 1 reply; 14+ messages in thread
From: Richard Purdie @ 2020-10-19  9:09 UTC (permalink / raw)
  To: Rasmus Villemoes, bitbake-devel

On Sun, 2020-10-18 at 23:33 +0200, Rasmus Villemoes wrote:
> The --reference(-if-able) option to git clone can significantly speed
> up the initial do_fetch if one is able to point git at a local
> repository containing a large number of the objects that would
> otherwise need to be downloaded.
> 
> Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
> ---
>  lib/bb/fetch2/git.py | 11 +++++++++++
>  1 file changed, 11 insertions(+)

Whilst I can see why this is useful, it is going to end up causing
problems when you share that recipe using this term with others :(. I
don't think we should add API which will cause scaling issues end
"encourage" behaviour which leads to problems.

Clone by reference also causes problems for the archiver.

I have often wondered about this issue though, my own thoughts were
some kind of "keyword" system where if present, a base repository could
be cloned, then added to from the original upstream. At the end of
cloning, revisions could be pushed back into that making it a kind of
mirror.

So yes, I see the attraction/need but I think to have something merged,
it would need to be more usable in the general case.

Cheers,
Richard




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19  9:09 ` Richard Purdie
@ 2020-10-19 10:25   ` Rasmus Villemoes
  2020-10-19 10:43     ` Richard Purdie
  0 siblings, 1 reply; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-19 10:25 UTC (permalink / raw)
  To: Richard Purdie, bitbake-devel

On 19/10/2020 11.09, Richard Purdie wrote:
> On Sun, 2020-10-18 at 23:33 +0200, Rasmus Villemoes wrote:
>> The --reference(-if-able) option to git clone can significantly speed
>> up the initial do_fetch if one is able to point git at a local
>> repository containing a large number of the objects that would
>> otherwise need to be downloaded.
>>
>> Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
>> ---
>>  lib/bb/fetch2/git.py | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
> 
> Whilst I can see why this is useful, it is going to end up causing
> problems when you share that recipe using this term with others :(. I
> don't think we should add API which will cause scaling issues end
> "encourage" behaviour which leads to problems.
> 
> Clone by reference also causes problems for the archiver.

How? Note that I unconditionally add the --dissociate, so that after the
download, there's absolutely no link between the referenced and the
cloned repository. Also, I use the -if-able variant which will just
silently ignore the given repo if it cannot be used (often, because it
doesn't exist). So I don't think the presence of absence of ;reference=
can ever affect the correctness or end result. But see below; I don't
think any recipes should/would ever grow a ;reference= directly, as it's
really a local opt-in optimization, so the recipe should just be written
to allow that easily.
> I have often wondered about this issue though, my own thoughts were
> some kind of "keyword" system where if present, a base repository could
> be cloned, then added to from the original upstream. At the end of
> cloning, revisions could be pushed back into that making it a kind of
> mirror.
> 
> So yes, I see the attraction/need but I think to have something merged,
> it would need to be more usable in the general case.

By far, the biggest pain point and what we're mostly going to use this
for is cloning linux repositories - those are usually at least 2G in
size, and every project/board uses such a thing. So my example is based
on that. In our linux recipe, we have (simplified)

LINUX_SRC_URI ?= "git://our.server/generic/linux.git;protocol=ssh"
SRC_URI = "${LINUX_SRC_URI}"
SRC_URI += [various .cfg files etc.]

i.e., we have made sure that the main component of SRC_URI is its own
variable, which can thus easily be overridden from machine config or
elsewhere - when doing development, it's nice to able to just put

LINUX_SRC_URI = "git:///some/local/path;protocol=file"

in local.conf (the branch and SRCREV are similarly defined via such
LINUX_* variables). So I was thinking that a LINUX_SRC_URI_append =
";reference=..." would do the trick. Alternatively, I'd rewrite the
recipe to read

SRC_URI = "${LINUX_SRC_URI}${LINUX_SRC_URI_REF}"

with LINUX_SRC_URI_REF of course being ?= "". Then any developer can
just set LINUX_SRC_URI_REF = ";reference=/my/local/linux-tree", and our
CI setup (which, when asked to do clean build, runs with an empty
initial downloads dir) could similarly be taught to set that variable.

Sure, this is all just for one recipe/repo, but (a) the user would
necessarily need to provide such a reference for each recipe making use
of this and (b) there are not really that many projects whose repo sizes
warrant this, so it's quite manageable to tweak the few recipes.

I'll also note that this has been a bigger pain for the past few months
when I've been working from home, connected to our infrastructure over a
slow vpn connection. Nothing indicates that things are returning to
normal any time soon, so I'd really like to find some way to reduce the
necessary network traffic.

Rasmus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19 10:25   ` Rasmus Villemoes
@ 2020-10-19 10:43     ` Richard Purdie
  2020-10-19 11:25       ` Rasmus Villemoes
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Purdie @ 2020-10-19 10:43 UTC (permalink / raw)
  To: Rasmus Villemoes, bitbake-devel

On Mon, 2020-10-19 at 12:25 +0200, Rasmus Villemoes wrote:
> On 19/10/2020 11.09, Richard Purdie wrote:
> > On Sun, 2020-10-18 at 23:33 +0200, Rasmus Villemoes wrote:
> > > The --reference(-if-able) option to git clone can significantly
> > > speed
> > > up the initial do_fetch if one is able to point git at a local
> > > repository containing a large number of the objects that would
> > > otherwise need to be downloaded.
> > > 
> > > Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
> > > ---
> > >  lib/bb/fetch2/git.py | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > 
> > Whilst I can see why this is useful, it is going to end up causing
> > problems when you share that recipe using this term with others :(.
> > I
> > don't think we should add API which will cause scaling issues end
> > "encourage" behaviour which leads to problems.
> > 
> > Clone by reference also causes problems for the archiver.
> 
> How? Note that I unconditionally add the --dissociate, so that after
> the download, there's absolutely no link between the referenced and
> the cloned repository. Also, I use the -if-able variant which will
> just silently ignore the given repo if it cannot be used (often,
> because it doesn't exist). So I don't think the presence of absence
> of ;reference= can ever affect the correctness or end result. But see
> below;

Ok, I'd missed that dissociate was available. I think that is new since
the last time I looked at this (its been a while) and that helps
simplify things.

>  I don't think any recipes should/would ever grow a ;reference=
> directly, as it's really a local opt-in optimization, so the recipe
> should just be written to allow that easily.

You know how this works, why and what it means. Most people won't so it
will get added to public recipes and it will cause problems. I've been
a maintainer long enough to know how this kind of works out
unfortunately.

I realised there is also a bigger problem. You're changing SRC_URI to
be specific to local paths. This will affect the sstate checksums and
break sstate reuse. We don't want to do that. My counter-proposal of an
identifier in SRC_URI like ";reference=linux-kernel" won't cause a
problem with checksums.

> > I have often wondered about this issue though, my own thoughts were
> > some kind of "keyword" system where if present, a base repository
> > could
> > be cloned, then added to from the original upstream. At the end of
> > cloning, revisions could be pushed back into that making it a kind
> > of
> > mirror.
> > 
> > So yes, I see the attraction/need but I think to have something
> > merged,
> > it would need to be more usable in the general case.
> 
> By far, the biggest pain point and what we're mostly going to use
> this for is cloning linux repositories - those are usually at least
> 2G in size, and every project/board uses such a thing. So my example
> is based on that. In our linux recipe, we have (simplified)
> 
> LINUX_SRC_URI ?= "git://our.server/generic/linux.git;protocol=ssh"
> SRC_URI = "${LINUX_SRC_URI}"
> SRC_URI += [various .cfg files etc.]
> 
> i.e., we have made sure that the main component of SRC_URI is its own
> variable, which can thus easily be overridden from machine config or
> elsewhere - when doing development, it's nice to able to just put
> 
> LINUX_SRC_URI = "git:///some/local/path;protocol=file"
> 
> in local.conf (the branch and SRCREV are similarly defined via such
> LINUX_* variables). So I was thinking that a LINUX_SRC_URI_append =
> ";reference=..." would do the trick. Alternatively, I'd rewrite the
> recipe to read
> 
> SRC_URI = "${LINUX_SRC_URI}${LINUX_SRC_URI_REF}"
> 
> with LINUX_SRC_URI_REF of course being ?= "". Then any developer can
> just set LINUX_SRC_URI_REF = ";reference=/my/local/linux-tree", and
> our CI setup (which, when asked to do clean build, runs with an empty
> initial downloads dir) could similarly be taught to set that
> variable.
> 
> Sure, this is all just for one recipe/repo, but (a) the user would
> necessarily need to provide such a reference for each recipe making
> use of this and (b) there are not really that many projects whose
> repo sizes warrant this, so it's quite manageable to tweak the few
> recipes.
> 
> I'll also note that this has been a bigger pain for the past few
> months when I've been working from home, connected to our
> infrastructure over a slow vpn connection. Nothing indicates that
> things are returning to normal any time soon, so I'd really like to
> find some way to reduce the necessary network traffic.

As I said, I totally understand the need/usage. I just don't think the
implementation can be merged as is for various reasons, we need
something more generic. I've tried to give some hints on a better
"interface" which would allow something to be merged.

Cheers,

Richard


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19 10:43     ` Richard Purdie
@ 2020-10-19 11:25       ` Rasmus Villemoes
  2020-10-19 11:37         ` Richard Purdie
  0 siblings, 1 reply; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-19 11:25 UTC (permalink / raw)
  To: Richard Purdie, bitbake-devel

On 19/10/2020 12.43, Richard Purdie wrote:
> On Mon, 2020-10-19 at 12:25 +0200, Rasmus Villemoes wrote:
>> On 19/10/2020 11.09, Richard Purdie wrote:
>>> On Sun, 2020-10-18 at 23:33 +0200, Rasmus Villemoes wrote:
>>>> The --reference(-if-able) option to git clone can significantly
>>>> speed
>>>> up the initial do_fetch if one is able to point git at a local
>>>> repository containing a large number of the objects that would
>>>> otherwise need to be downloaded.
>>>>
>>>> Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
>>>> ---
>>>>  lib/bb/fetch2/git.py | 11 +++++++++++
>>>>  1 file changed, 11 insertions(+)
>>>
>>> Whilst I can see why this is useful, it is going to end up causing
>>> problems when you share that recipe using this term with others :(.
>>> I
>>> don't think we should add API which will cause scaling issues end
>>> "encourage" behaviour which leads to problems.
>>>
>>> Clone by reference also causes problems for the archiver.
>>
>> How? Note that I unconditionally add the --dissociate, so that after
>> the download, there's absolutely no link between the referenced and
>> the cloned repository. Also, I use the -if-able variant which will
>> just silently ignore the given repo if it cannot be used (often,
>> because it doesn't exist). So I don't think the presence of absence
>> of ;reference= can ever affect the correctness or end result. But see
>> below;
> 
> Ok, I'd missed that dissociate was available. I think that is new since
> the last time I looked at this (its been a while) and that helps
> simplify things.

Well, yes, I glossed over the git version requirements. --dissociate is
from v2.3 (Feb 2015, fb1d6dab), while the -if-able variant requires 2.11
(Nov 2016, f7415b4d711). It would probably be nice if this was all just
silently ignored when git is too old. Do we have some way to get the git
version? I suppose a simple "git --version" subprocess isn't really that
costly given that we're about to spawn a "git clone" anyway. And we'd
only have to do it in the case where a reference is actually given. But
then, since the developer must have opted in by providing some path
(through whichever means we can agree on), perhaps the documentation
should just mention that requirement.

>>  I don't think any recipes should/would ever grow a ;reference=
>> directly, as it's really a local opt-in optimization, so the recipe
>> should just be written to allow that easily.
> 
> You know how this works, why and what it means. Most people won't so it
> will get added to public recipes and it will cause problems. I've been
> a maintainer long enough to know how this kind of works out
> unfortunately.
> 
> I realised there is also a bigger problem. You're changing SRC_URI to
> be specific to local paths. This will affect the sstate checksums and
> break sstate reuse. We don't want to do that.

Competely agreed, and I oversimplified. I'd expect the LINUX_SRC_URI_REF
to have a vardepvalue of "". But yeah, it's easy to forget to add that.

 My counter-proposal of an
> identifier in SRC_URI like ";reference=linux-kernel" won't cause a
> problem with checksums.

Perhaps, but where would you get the information on what linux-kernel
should result in? I guess one could have the user set

SRC_URI_GIT_REFERENCE[linux-kernel] = "/some/path"

and then in urldata_init() do d.getVarFlag() to get the path. With
SRC_URI_GIT_REFERENCE having whatever appropriate magic necessary to
avoid its flags affecting checksums.

But, do we even need that indirection through a keyword? What if
bitbake.conf (or whatever more appropriate place there might be) grew a

SRC_URI_GIT_REFERENCE ?= ""
SRC_URI_GIT_REFERENCE[vardepvalue] = ""

Recipes where one wants to support this could then just be written

  SRC_URI +=
"git://:some.where/foo/bar;protocol=https;reference=${SRC_URI_GIT_REFERENCE}"

[an empty reference= parameter is equivalent to none at all],
and the way for a developer to opt-in and specify a reference would be
the usual pn overrides

SRC_URI_GIT_REFERENCE_pn-the-kernel-recipe = "/my/local/path"

?

> As I said, I totally understand the need/usage. I just don't think the
> implementation can be merged as is for various reasons, we need
> something more generic. I've tried to give some hints on a better
> "interface" which would allow something to be merged.

I can certainly try doing it with the keyword approach, but could you
put a few more words on what you were thinking? I.e., am I right about
how you thought the path should be stored and retrieved?

I'll have to respin for the git version issue anyway I think, so what do
you prefer?

Rasmus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19 11:25       ` Rasmus Villemoes
@ 2020-10-19 11:37         ` Richard Purdie
  2020-10-19 12:11           ` Rasmus Villemoes
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Purdie @ 2020-10-19 11:37 UTC (permalink / raw)
  To: Rasmus Villemoes, bitbake-devel

On Mon, 2020-10-19 at 13:25 +0200, Rasmus Villemoes wrote:
> On 19/10/2020 12.43, Richard Purdie wrote:
> > On Mon, 2020-10-19 at 12:25 +0200, Rasmus Villemoes wrote:
> > > On 19/10/2020 11.09, Richard Purdie wrote:
> > > > On Sun, 2020-10-18 at 23:33 +0200, Rasmus Villemoes wrote:
> > > > > The --reference(-if-able) option to git clone can
> > > > > significantly
> > > > > speed
> > > > > up the initial do_fetch if one is able to point git at a
> > > > > local
> > > > > repository containing a large number of the objects that
> > > > > would
> > > > > otherwise need to be downloaded.
> > > > > 
> > > > > Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
> > > > > ---
> > > > >  lib/bb/fetch2/git.py | 11 +++++++++++
> > > > >  1 file changed, 11 insertions(+)
> > > > 
> > > > Whilst I can see why this is useful, it is going to end up
> > > > causing
> > > > problems when you share that recipe using this term with others
> > > > :(.
> > > > I
> > > > don't think we should add API which will cause scaling issues
> > > > end
> > > > "encourage" behaviour which leads to problems.
> > > > 
> > > > Clone by reference also causes problems for the archiver.
> > > 
> > > How? Note that I unconditionally add the --dissociate, so that
> > > after
> > > the download, there's absolutely no link between the referenced
> > > and
> > > the cloned repository. Also, I use the -if-able variant which
> > > will
> > > just silently ignore the given repo if it cannot be used (often,
> > > because it doesn't exist). So I don't think the presence of
> > > absence
> > > of ;reference= can ever affect the correctness or end result. But
> > > see
> > > below;
> > 
> > Ok, I'd missed that dissociate was available. I think that is new
> > since
> > the last time I looked at this (its been a while) and that helps
> > simplify things.
> 
> Well, yes, I glossed over the git version requirements. --dissociate
> is from v2.3 (Feb 2015, fb1d6dab), while the -if-able variant
> requires 2.11
> (Nov 2016, f7415b4d711). It would probably be nice if this was all
> just
> silently ignored when git is too old. Do we have some way to get the
> git
> version? I suppose a simple "git --version" subprocess isn't really
> that
> costly given that we're about to spawn a "git clone" anyway. And we'd
> only have to do it in the case where a reference is actually given.
> But
> then, since the developer must have opted in by providing some path
> (through whichever means we can agree on), perhaps the documentation
> should just mention that requirement.

sanity.bbclass documents the current minimum git version as 1.8.3.1.

We could make the check conditional upon whether
BBFETCH_GIT_REFERENCEDIR is set (see below).

> Competely agreed, and I oversimplified. I'd expect the
> LINUX_SRC_URI_REF
> to have a vardepvalue of "". But yeah, it's easy to forget to add
> that.

I think we can do something simpler.

>  My counter-proposal of an
> > identifier in SRC_URI like ";reference=linux-kernel" won't cause a
> > problem with checksums.
> 
> Perhaps, but where would you get the information on what linux-kernel
> should result in? I guess one could have the user set
> 
> SRC_URI_GIT_REFERENCE[linux-kernel] = "/some/path"
> 
> and then in urldata_init() do d.getVarFlag() to get the path. With
> SRC_URI_GIT_REFERENCE having whatever appropriate magic necessary to
> avoid its flags affecting checksums.
> 
> But, do we even need that indirection through a keyword? What if
> bitbake.conf (or whatever more appropriate place there might be) grew
> a
> 
> SRC_URI_GIT_REFERENCE ?= ""
> SRC_URI_GIT_REFERENCE[vardepvalue] = ""
> 
> Recipes where one wants to support this could then just be written
> 
>   SRC_URI +=
> "git://:some.where/foo/bar;protocol=https;reference=${SRC_URI_GIT_REF
> ERENCE}"
> 
> [an empty reference= parameter is equivalent to none at all],
> and the way for a developer to opt-in and specify a reference would
> be
> the usual pn overrides
> 
> SRC_URI_GIT_REFERENCE_pn-the-kernel-recipe = "/my/local/path"
> 
> ?
> 
> > As I said, I totally understand the need/usage. I just don't think
> > the
> > implementation can be merged as is for various reasons, we need
> > something more generic. I've tried to give some hints on a better
> > "interface" which would allow something to be merged.
> 
> I can certainly try doing it with the keyword approach, but could you
> put a few more words on what you were thinking? I.e., am I right
> about
> how you thought the path should be stored and retrieved?
> 
> I'll have to respin for the git version issue anyway I think, so what
> do you prefer?

How about we just have the user set:

BBFETCH_GIT_REFERENCEDIR = "${DL_DIR}/git-reference-mirrors/"

and then unconditionally mark up recipes with:

;reference=linux-kernel

then the fetcher code looks in ${BBFETCH_GIT_REFERENCEDIR}/<reference>
for reference clones if and only if BBFETCH_GIT_REFERENCEDIR is set?

This avoids all the messing with vardeps, keeps the SRC_URIs simple and
should generally be much more usable/scalable?

If BBFETCH_GIT_REFERENCEDIR is set, we'd also probably want to make
successful clones push references into that directory to cache them?

In general you'd set this globally but if you do really need do
something specific to a recipe, a recipe specific override would also
happen to work.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19 11:37         ` Richard Purdie
@ 2020-10-19 12:11           ` Rasmus Villemoes
  2020-10-19 12:28             ` Richard Purdie
  0 siblings, 1 reply; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-19 12:11 UTC (permalink / raw)
  To: Richard Purdie, bitbake-devel

On 19/10/2020 13.37, Richard Purdie wrote:
> On Mon, 2020-10-19 at 13:25 +0200, Rasmus Villemoes wrote:

>> I'll have to respin for the git version issue anyway I think, so what
>> do you prefer?
> 
> How about we just have the user set:
> 
> BBFETCH_GIT_REFERENCEDIR = "${DL_DIR}/git-reference-mirrors/"
> 
> and then unconditionally mark up recipes with:
> 
> ;reference=linux-kernel
> 
> then the fetcher code looks in ${BBFETCH_GIT_REFERENCEDIR}/<reference>
> for reference clones if and only if BBFETCH_GIT_REFERENCEDIR is set?

Works for me, as long as there's no requirement that
BBFETCH_GIT_REFERENCEDIR is located below DL_DIR (but I assume that was
just an example). In fact, at least for our CI setup (where everything
runs in docker containers and we map in appropriate host dirs), setting
up that directory is rather easy, adding

  -v /hosts/linux-repo:/git-refs/linux-kernel:ro

to the docker run command and set BBFETCH_GIT_REFERENCEDIR = "/git-refs".

But, uh, how does this avoid the checksum/sstate issue?
BBFETCH_GIT_REFERENCEDIR will need to be exempt somehow, right? I mean,
I do want to be able to use sstate artifacts generated by our CI, but
not setting up BBFETCH_GIT_REFERENCEDIR (at least not necessarily
setting it up in exactly the same way).

> This avoids all the messing with vardeps, keeps the SRC_URIs simple and
> should generally be much more usable/scalable?
> 
> If BBFETCH_GIT_REFERENCEDIR is set, we'd also probably want to make
> successful clones push references into that directory to cache them?

Absolutely not, the referenced repository is not to be modified in any way.

Rasmus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19 12:11           ` Rasmus Villemoes
@ 2020-10-19 12:28             ` Richard Purdie
  2020-10-19 12:56               ` Rasmus Villemoes
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Purdie @ 2020-10-19 12:28 UTC (permalink / raw)
  To: Rasmus Villemoes, bitbake-devel

On Mon, 2020-10-19 at 14:11 +0200, Rasmus Villemoes wrote:
> On 19/10/2020 13.37, Richard Purdie wrote:
> > On Mon, 2020-10-19 at 13:25 +0200, Rasmus Villemoes wrote:
> > > I'll have to respin for the git version issue anyway I think, so
> > > what
> > > do you prefer?
> > 
> > How about we just have the user set:
> > 
> > BBFETCH_GIT_REFERENCEDIR = "${DL_DIR}/git-reference-mirrors/"
> > 
> > and then unconditionally mark up recipes with:
> > 
> > ;reference=linux-kernel
> > 
> > then the fetcher code looks in
> > ${BBFETCH_GIT_REFERENCEDIR}/<reference>
> > for reference clones if and only if BBFETCH_GIT_REFERENCEDIR is
> > set?
> 
> Works for me, as long as there's no requirement that
> BBFETCH_GIT_REFERENCEDIR is located below DL_DIR (but I assume that
> was just an example). In fact, at least for our CI setup (where
> everything runs in docker containers and we map in appropriate host
> dirs), setting up that directory is rather easy, adding
> 
>   -v /hosts/linux-repo:/git-refs/linux-kernel:ro
> 
> to the docker run command and set BBFETCH_GIT_REFERENCEDIR = "/git-
> refs".

Right, DL_DIR would likely work well for most users but its
configurable to anywhere.

> But, uh, how does this avoid the checksum/sstate issue?
> BBFETCH_GIT_REFERENCEDIR will need to be exempt somehow, right? I
> mean, I do want to be able to use sstate artifacts generated by our
> CI, but not setting up BBFETCH_GIT_REFERENCEDIR (at least not
> necessarily setting it up in exactly the same way).

The dependency code would have no idea the variable is being used as it
doesn't know how to peek that deeply inside the fetcher.

> > This avoids all the messing with vardeps, keeps the SRC_URIs simple
> > and
> > should generally be much more usable/scalable?
> > 
> > If BBFETCH_GIT_REFERENCEDIR is set, we'd also probably want to make
> > successful clones push references into that directory to cache
> > them?
> 
> Absolutely not, the referenced repository is not to be modified in
> any way.

I'd like something which can be used by others so we need a way to
populate/update it. If its read-only in your case, fine, we can detect
and not update but in the general case I'd like others to be able to
benefit and use this too.

Cheers,

Richard


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19 12:28             ` Richard Purdie
@ 2020-10-19 12:56               ` Rasmus Villemoes
  2020-10-19 13:15                 ` Richard Purdie
  0 siblings, 1 reply; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-19 12:56 UTC (permalink / raw)
  To: Richard Purdie, bitbake-devel

On 19/10/2020 14.28, Richard Purdie wrote:
> On Mon, 2020-10-19 at 14:11 +0200, Rasmus Villemoes wrote:
>> On 19/10/2020 13.37, Richard Purdie wrote:
>>> On Mon, 2020-10-19 at 13:25 +0200, Rasmus Villemoes wrote:
>>>> I'll have to respin for the git version issue anyway I think, so
>>>> what
>>>> do you prefer?
>>>
>>> How about we just have the user set:
>>>
>>> BBFETCH_GIT_REFERENCEDIR = "${DL_DIR}/git-reference-mirrors/"
>>>
>>> and then unconditionally mark up recipes with:
>>>
>>> ;reference=linux-kernel
>>>
>>> then the fetcher code looks in
>>> ${BBFETCH_GIT_REFERENCEDIR}/<reference>
>>> for reference clones if and only if BBFETCH_GIT_REFERENCEDIR is
>>> set?

Yes, as I said, we can make the use of the --reference-if-able etc.
conditional on the user having opted in in the first place, then if the
host provided git is not new enough it will just fail.

> Right, DL_DIR would likely work well for most users but its
> configurable to anywhere.
> 
>> But, uh, how does this avoid the checksum/sstate issue?
>> BBFETCH_GIT_REFERENCEDIR will need to be exempt somehow, right? I
>> mean, I do want to be able to use sstate artifacts generated by our
>> CI, but not setting up BBFETCH_GIT_REFERENCEDIR (at least not
>> necessarily setting it up in exactly the same way).
> 
> The dependency code would have no idea the variable is being used as it
> doesn't know how to peek that deeply inside the fetcher.

Which is of course totally non-obvious for everybody but a handful of
people. But that means one might as well make it slightly more flexible
and not force every reference repo to live in the same directory, and go
back to the keyword-style

BBFETCH_GIT_REFERENCEDIR[linux-kernel] = "/some/path"
BBFETCH_GIT_REFERENCEDIR[chromium] = "/totally/different/place"

But if you insist, it doesn't matter much, testing suggests that git is
fine with the reference actually being a symlink to the real repo.

>>> This avoids all the messing with vardeps, keeps the SRC_URIs simple
>>> and
>>> should generally be much more usable/scalable?
>>>
>>> If BBFETCH_GIT_REFERENCEDIR is set, we'd also probably want to make
>>> successful clones push references into that directory to cache
>>> them?
>>
>> Absolutely not, the referenced repository is not to be modified in
>> any way.
> 
> I'd like something which can be used by others so we need a way to
> populate/update it. If its read-only in your case, fine, we can detect
> and not update but in the general case I'd like others to be able to
> benefit and use this too.

I really don't see the point of doing that (pushing "references" into
the referenced repo). Why do you claim that we need a way to
populate/update it? Keeping a  mirror of the upstream is exactly what we
do under DL_DIR/git2/munged-path-to-upstream, why should some other
local, but completely external to bitbake, repository also act like
that? The purpose is just to speed up the initial clone (say, by having
a clone of Linus' upstream repo lying around). I really don't get it.

Rasmus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fetch: git: add reference parameter
  2020-10-19 12:56               ` Rasmus Villemoes
@ 2020-10-19 13:15                 ` Richard Purdie
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Purdie @ 2020-10-19 13:15 UTC (permalink / raw)
  To: Rasmus Villemoes, bitbake-devel

On Mon, 2020-10-19 at 14:56 +0200, Rasmus Villemoes wrote:
> On 19/10/2020 14.28, Richard Purdie wrote:
> > > But, uh, how does this avoid the checksum/sstate issue?
> > > BBFETCH_GIT_REFERENCEDIR will need to be exempt somehow, right? I
> > > mean, I do want to be able to use sstate artifacts generated by
> > > our
> > > CI, but not setting up BBFETCH_GIT_REFERENCEDIR (at least not
> > > necessarily setting it up in exactly the same way).
> > 
> > The dependency code would have no idea the variable is being used
> > as it
> > doesn't know how to peek that deeply inside the fetcher.
> 
> Which is of course totally non-obvious for everybody but a handful of
> people.

Variable dependency tracking does not go into *any* library functions,
be it lib/oe/*, scripts/lib/* or bitbake itself. Whether obvious or
not, it is how it works and always has worked and the line is quite
clear.

>  But that means one might as well make it slightly more flexible and
> not force every reference repo to live in the same directory, and go
> back to the keyword-style
> 
> BBFETCH_GIT_REFERENCEDIR[linux-kernel] = "/some/path"
> BBFETCH_GIT_REFERENCEDIR[chromium] = "/totally/different/place"
> 
> But if you insist, it doesn't matter much, testing suggests that git
> is fine with the reference actually being a symlink to the real repo.

My preference is for simple API where we can. Users are much more used
to SSTATE_DIR = "XXX" rather than setting flags variables for each
repo.

> > > > This avoids all the messing with vardeps, keeps the SRC_URIs
> > > > simple
> > > > and
> > > > should generally be much more usable/scalable?
> > > > 
> > > > If BBFETCH_GIT_REFERENCEDIR is set, we'd also probably want to
> > > > make
> > > > successful clones push references into that directory to cache
> > > > them?
> > > 
> > > Absolutely not, the referenced repository is not to be modified
> > > in
> > > any way.
> > 
> > I'd like something which can be used by others so we need a way to
> > populate/update it. If its read-only in your case, fine, we can
> > detect
> > and not update but in the general case I'd like others to be able
> > to
> > benefit and use this too.
> 
> I really don't see the point of doing that (pushing "references" into
> the referenced repo). Why do you claim that we need a way to
> populate/update it? Keeping a  mirror of the upstream is exactly what
> we
> do under DL_DIR/git2/munged-path-to-upstream, why should some other
> local, but completely external to bitbake, repository also act like
> that? The purpose is just to speed up the initial clone (say, by
> having
> a clone of Linus' upstream repo lying around). I really don't get it.

You need to step back and think about how others might use this. I
think that most users will want to use a single common storage
location. They may have multiple BSPs, each using a kernel git repo so
being able to have a central store of those git references would be
useful. In the common case people will want the first clone to populate
it, the second clone to benefit from the speedup (regardless of which
BSP they build first, or whether they have local clone originally or
not).

I'd actually argue for making it part of the standard DL_DIR and drop
the references inside the other pieces of DL_DIR so its all more space
efficient. Its probably better to do this one step at a time though but
that would be a nice speedup/efficiency to ultimately have.

Whilst I appreciate this isn't exactly the problem you're trying to
solve, I'm interested in the solution which lets us be generally useful
to a majority of users.

Cheers,

Richard


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] fetch: git: add reference parameter
  2020-10-18 21:33 [PATCH] fetch: git: add reference parameter Rasmus Villemoes
  2020-10-19  9:09 ` Richard Purdie
@ 2020-10-22 13:38 ` Rasmus Villemoes
       [not found] ` <1640541B397C8D4C.8847@lists.openembedded.org>
  2 siblings, 0 replies; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-22 13:38 UTC (permalink / raw)
  To: bitbake-devel; +Cc: Richard Purdie, Rasmus Villemoes

The --reference(-if-able) option to git clone can significantly speed
up the initial do_fetch if one is able to point git at a local
repository containing a large number of the objects that would
otherwise need to be downloaded.

Allow the developer to define a directory BBFETCH_GIT_REFERENCEDIR,
subdirectories of which can be used as keys for the new reference=
parameter to git SRC_URIs. So, say for example

  BBFETCH_GIT_REFERENCEDIR = "/git-refs"

and /git-refs/linux-kernel contains a (even possibly somewhat dated)
clone of
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git. Then
if a kernel recipe has

  SRC_URI = "git://some.vendor/tree;reference=linux-kernel"

the initial clone from some.vendor/tree will be much faster, since the
subset of objects that come from one of Greg's stable branches will
not need to be downloaded.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
---

Pushing the downloaded objects to the referenced repo, making that
more or less contain a union of all SRC_URI repos using the same
reference=foo key, does not seem particularly hard. It's mostly a
matter of running

  git -C $clonedir send-pack --all --verbose $ref_repo

(I've done that manually, then removed $clonedir and rerun do_fetch;
now even more objects were available as expected.)

However, one will need to ignore any errors, since two different
vendor kernels can have a "master" branch (or any other ref name for
that matter) mean different things.

There's also a decision to be made regarding the initial state: If the
directory $ref_repo doesn't exist, should one do a "git init --bare
$ref_repo" first? What if it does exist but is not a (bare) git
repository?

And another decision to be made regarding when the send-pack should be
run - only after an initial clone, or also after a git fetch?

Since I'm not one of those alleged "most users" that would want any of
this, I'll leave all it to someone that does. Just please make it
opt-in based on some additional BBFETCH_GIT_REFERENCEDIR_PUSH_REFS or
whatnot.


 lib/bb/fetch2/git.py | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/lib/bb/fetch2/git.py b/lib/bb/fetch2/git.py
index b97967b4..bc39962c 100644
--- a/lib/bb/fetch2/git.py
+++ b/lib/bb/fetch2/git.py
@@ -51,6 +51,18 @@ Supported SRC_URI options are:
    For local git:// urls to use the current branch HEAD as the revision for use with
    AUTOREV. Implies nobranch.
 
+- reference
+   This can be used to point the fetcher at a local repository
+   expected to contain many of the objects that would otherwise need
+   to be downloaded. The path to the local repository is formed by
+   joining ${BBFETCH_GIT_REFERENCEDIR} with the value of this
+   parameter. It will be passed as the argument to the
+   '--reference-if-able' option of 'git clone', along with the
+   '--dissociate' option.
+
+   If ${BBFETCH_GIT_REFERENCEDIR} is not set, this has no
+   effect. Also, the provided git version must be >= 2.11 for
+   --reference-if-able to be supported.
 """
 
 # Copyright (C) 2005 Richard Purdie
@@ -151,6 +163,8 @@ class Git(FetchMethod):
 
         ud.nobranch = ud.parm.get("nobranch","0") == "1"
 
+        ud.reference = ud.parm.get("reference", "")
+
         # usehead implies nobranch
         ud.usehead = ud.parm.get("usehead","0") == "1"
         if ud.usehead:
@@ -337,6 +351,11 @@ class Git(FetchMethod):
             runfetchcmd("tar -xzf %s" % ud.fullmirror, d, workdir=ud.clonedir)
 
         repourl = self._get_repo_url(ud)
+        ref_repo = None
+        if ud.reference:
+            refdir = d.getVar("BBFETCH_GIT_REFERENCEDIR")
+            if refdir:
+                ref_repo = os.path.join(refdir, ud.reference)
 
         # If the repo still doesn't exist, fallback to cloning it
         if not os.path.exists(ud.clonedir):
@@ -344,6 +363,9 @@ class Git(FetchMethod):
             if repourl.startswith("file://"):
                 repourl = repourl[7:]
             clone_cmd = "LANG=C %s clone --bare --mirror %s %s --progress" % (ud.basecmd, shlex.quote(repourl), ud.clonedir)
+            if ref_repo:
+                clone_cmd += " --reference-if-able %s --dissociate" % shlex.quote(ref_repo)
+
             if ud.proto.lower() != 'file':
                 bb.fetch2.check_network_access(d, clone_cmd, ud.url)
             progresshandler = GitProgressHandler(d)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [bitbake-devel] [PATCH v2] fetch: git: add reference parameter
       [not found] ` <1640541B397C8D4C.8847@lists.openembedded.org>
@ 2020-10-30  8:52   ` Rasmus Villemoes
  2020-10-30  9:39     ` Richard Purdie
  0 siblings, 1 reply; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-30  8:52 UTC (permalink / raw)
  To: bitbake-devel; +Cc: Richard Purdie

On 22/10/2020 15.38, Rasmus Villemoes via lists.openembedded.org wrote:
> The --reference(-if-able) option to git clone can significantly speed
> up the initial do_fetch if one is able to point git at a local
> repository containing a large number of the objects that would
> otherwise need to be downloaded.
> 
> Allow the developer to define a directory BBFETCH_GIT_REFERENCEDIR,
> subdirectories of which can be used as keys for the new reference=
> parameter to git SRC_URIs. 

Hi Richard,

Is there anything I can do to help move this forward?

Rasmus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bitbake-devel] [PATCH v2] fetch: git: add reference parameter
  2020-10-30  8:52   ` [bitbake-devel] " Rasmus Villemoes
@ 2020-10-30  9:39     ` Richard Purdie
  2020-10-30 15:06       ` Rasmus Villemoes
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Purdie @ 2020-10-30  9:39 UTC (permalink / raw)
  To: Rasmus Villemoes, bitbake-devel

Hi Rasmus,

On Fri, 2020-10-30 at 09:52 +0100, Rasmus Villemoes wrote:
> On 22/10/2020 15.38, Rasmus Villemoes via lists.openembedded.org
> wrote:
> > The --reference(-if-able) option to git clone can significantly
> > speed
> > up the initial do_fetch if one is able to point git at a local
> > repository containing a large number of the objects that would
> > otherwise need to be downloaded.
> > 
> > Allow the developer to define a directory BBFETCH_GIT_REFERENCEDIR,
> > subdirectories of which can be used as keys for the new reference=
> > parameter to git SRC_URIs. 
> 
> Is there anything I can do to help move this forward?

There are several things concerning me:

a) no test case for bitbake-selftest

b) no update to the git fetcher in the documentation

c) the functionality I mentioned to make this more useful to others is
left as an exercise for someone else

d) the maintainer is apparently misguided about c) in the first place
having taken the time to try and explain it

e) we're in the middle of a feature freeze and release


If c) is left there is the risk we find the interface needs tweaking to
support it and we become a little stuck if the functionality already
merged. I often wish I could say I didn't need or like something so I'd
not work on it but if I did that the project would look very different.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bitbake-devel] [PATCH v2] fetch: git: add reference parameter
  2020-10-30  9:39     ` Richard Purdie
@ 2020-10-30 15:06       ` Rasmus Villemoes
  0 siblings, 0 replies; 14+ messages in thread
From: Rasmus Villemoes @ 2020-10-30 15:06 UTC (permalink / raw)
  To: Richard Purdie, bitbake-devel

On 30/10/2020 10.39, Richard Purdie wrote:
> Hi Rasmus,
> 
> On Fri, 2020-10-30 at 09:52 +0100, Rasmus Villemoes wrote:
>> On 22/10/2020 15.38, Rasmus Villemoes via lists.openembedded.org
>> wrote:
>>> The --reference(-if-able) option to git clone can significantly
>>> speed
>>> up the initial do_fetch if one is able to point git at a local
>>> repository containing a large number of the objects that would
>>> otherwise need to be downloaded.
>>>
>>> Allow the developer to define a directory BBFETCH_GIT_REFERENCEDIR,
>>> subdirectories of which can be used as keys for the new reference=
>>> parameter to git SRC_URIs. 
>>
>> Is there anything I can do to help move this forward?
> 
> There are several things concerning me:
> 
> a) no test case for bitbake-selftest

Agreed. I tested it manually (simply observing that the do_fetch was
much faster when a ref repo was provided), but I'm not really sure how
to translate that to an automatic test.

> b) no update to the git fetcher in the documentation

I can certainly copy the text I added at the top of the file to
doc/bitbake-user-manual/bitbake-user-manual-fetching.xml, with the
relevant markup added.

> c) the functionality I mentioned to make this more useful to others is
> left as an exercise for someone else

Yes, because as I said, there are some questions to answer. As I don't
personally see that extra thing you insist on as being something I'd
want to use, I won't even try to provide the answers to those questions.

Also, it is unfair to keep stating that this won't be useful to others
in the current form (though I do note that you said "more useful" here -
previously you said "I'd like something which can be used by others" and
"[...] but in the general case I'd like others to be able to
benefit and use this too.").

> d) the maintainer is apparently misguided about c) in the first place
> having taken the time to try and explain it

I don't know if you're misguided, I merely pointed out that there is no
evidence backing the use of phrases like "most users" or "In the common
case people will want". Sure, _some_ people (and that may, in fact, be
_most_, I don't know) may also want to push the objects into the ref
repo. But nobody gains anything if a ref repo can't be specified in the
first place.

> e) we're in the middle of a feature freeze and release

All the more reason to try and get this in shape for inclusion into the
next release.

> If c) is left there is the risk we find the interface needs tweaking to
> support it and we become a little stuck if the functionality already
> merged.

That's a convenient argument you can use to refuse any feature that is
not implemented to the full extent and exactly as you would have done it
had you had the time to do the code yourself.

I'd argue the other way around: If I implemented the "push the objects
to the ref repo", thus implicitly answering the design questions that
need answering by the behaviour of however I managed to do it, we'd be
stuck with _that_ behaviour alright, even if the choices I had to make
were not those that would have been made by someone who actually had a
real-life usecase for that part, and thus much better insight into what
would be reasonable answers.

Rasmus

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-10-30 15:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-18 21:33 [PATCH] fetch: git: add reference parameter Rasmus Villemoes
2020-10-19  9:09 ` Richard Purdie
2020-10-19 10:25   ` Rasmus Villemoes
2020-10-19 10:43     ` Richard Purdie
2020-10-19 11:25       ` Rasmus Villemoes
2020-10-19 11:37         ` Richard Purdie
2020-10-19 12:11           ` Rasmus Villemoes
2020-10-19 12:28             ` Richard Purdie
2020-10-19 12:56               ` Rasmus Villemoes
2020-10-19 13:15                 ` Richard Purdie
2020-10-22 13:38 ` [PATCH v2] " Rasmus Villemoes
     [not found] ` <1640541B397C8D4C.8847@lists.openembedded.org>
2020-10-30  8:52   ` [bitbake-devel] " Rasmus Villemoes
2020-10-30  9:39     ` Richard Purdie
2020-10-30 15:06       ` Rasmus Villemoes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.