linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] scripts/setlocalversion: make git describe output more reliable
@ 2020-09-10 11:26 Rasmus Villemoes
  2020-09-10 14:28 ` Guenter Roeck
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Rasmus Villemoes @ 2020-09-10 11:26 UTC (permalink / raw)
  To: Masahiro Yamada, Randy Dunlap, Brian Norris, Guenter Roeck,
	Bhaskar Chowdhury
  Cc: Rasmus Villemoes, linux-kernel

When building for an embedded target using Yocto, we're sometimes
observing that the version string that gets built into vmlinux (and
thus what uname -a reports) differs from the path under /lib/modules/
where modules get installed in the rootfs, but only in the length of
the -gabc123def suffix. Hence modprobe always fails.

The problem is that Yocto has the concept of "sstate" (shared state),
which allows different developers/buildbots/etc. to share build
artifacts, based on a hash of all the metadata that went into building
that artifact - and that metadata includes all dependencies (e.g. the
compiler used etc.). That normally works quite well; usually a clean
build (without using any sstate cache) done by one developer ends up
being binary identical to a build done on another host. However, one
thing that can cause two developers to end up with different builds
[and thus make one's vmlinux package incompatible with the other's
kernel-dev package], which is not captured by the metadata hashing, is
this `git describe`: The output of that can be affected by

(1) git version: before 2.11 git defaulted to a minimum of 7, since
2.11 (git.git commit e6c587) the default is dynamic based on the
number of objects in the repo
(2) hence even if both run the same git version, the output can differ
based on how many remotes are being tracked (or just lots of local
development branches or plain old garbage)
(3) and of course somebody could have a core.abbrev config setting in
~/.gitconfig

So in order to avoid `uname -a` output relying on such random details
of the build environment which are rather hard to ensure are
consistent between developers and buildbots, use an explicit
--abbrev=15 option (and for consistency, also use rev-parse --short=15
for the unlikely case of no signed tags being usable).

Now, why is 60 bits enough for everyone? It's not mathematically
guaranteed that git won't have to use 16 in some git repo, but it is
beyond unlikely: Even in a repo with 100M objects, the probability
that any given commit (i.e. the one being described) clashes with some
other object in the first 15 hex chars is less than 1e-10, and
currently a git repo tracking Linus', -stable and -rt only has around
10M objects.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
I could probably fix things by adding a 'git config --local
core.abbrev 15' step to the Yocto build process after the repo to
build from has been cloned but before building has started. But in the
interest of binary reproducibility outside of just Yocto builds, I
think it's better if this lives in the kernel.

 scripts/setlocalversion | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/setlocalversion b/scripts/setlocalversion
index 20f2efd57b11..c5262f0d953d 100755
--- a/scripts/setlocalversion
+++ b/scripts/setlocalversion
@@ -45,7 +45,7 @@ scm_version()
 
 	# Check for git and a git repo.
 	if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
-	   head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
+	   head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then
 
 		# If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
 		# it, because this version is defined in the top level Makefile.
@@ -59,7 +59,7 @@ scm_version()
 			fi
 			# If we are past a tagged commit (like
 			# "v2.6.30-rc5-302-g72357d5"), we pretty print it.
-			if atag="$(git describe 2>/dev/null)"; then
+			if atag="$(git describe --abbrev=15 2>/dev/null)"; then
 				echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
 
 			# If we don't have a tag at all we print -g{commitish}.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-10 11:26 [PATCH] scripts/setlocalversion: make git describe output more reliable Rasmus Villemoes
@ 2020-09-10 14:28 ` Guenter Roeck
  2020-09-10 14:34 ` Masahiro Yamada
  2020-09-17  6:56 ` [PATCH v2] " Rasmus Villemoes
  2 siblings, 0 replies; 17+ messages in thread
From: Guenter Roeck @ 2020-09-10 14:28 UTC (permalink / raw)
  To: Rasmus Villemoes, Masahiro Yamada, Randy Dunlap, Brian Norris,
	Bhaskar Chowdhury
  Cc: linux-kernel

On 9/10/20 4:26 AM, Rasmus Villemoes wrote:
> When building for an embedded target using Yocto, we're sometimes
> observing that the version string that gets built into vmlinux (and
> thus what uname -a reports) differs from the path under /lib/modules/
> where modules get installed in the rootfs, but only in the length of
> the -gabc123def suffix. Hence modprobe always fails.
> 
> The problem is that Yocto has the concept of "sstate" (shared state),
> which allows different developers/buildbots/etc. to share build
> artifacts, based on a hash of all the metadata that went into building
> that artifact - and that metadata includes all dependencies (e.g. the
> compiler used etc.). That normally works quite well; usually a clean
> build (without using any sstate cache) done by one developer ends up
> being binary identical to a build done on another host. However, one
> thing that can cause two developers to end up with different builds
> [and thus make one's vmlinux package incompatible with the other's
> kernel-dev package], which is not captured by the metadata hashing, is
> this `git describe`: The output of that can be affected by
> 
> (1) git version: before 2.11 git defaulted to a minimum of 7, since
> 2.11 (git.git commit e6c587) the default is dynamic based on the
> number of objects in the repo
> (2) hence even if both run the same git version, the output can differ
> based on how many remotes are being tracked (or just lots of local
> development branches or plain old garbage)
> (3) and of course somebody could have a core.abbrev config setting in
> ~/.gitconfig
> 
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, use an explicit
> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> for the unlikely case of no signed tags being usable).
> 
> Now, why is 60 bits enough for everyone? It's not mathematically
> guaranteed that git won't have to use 16 in some git repo, but it is
> beyond unlikely: Even in a repo with 100M objects, the probability
> that any given commit (i.e. the one being described) clashes with some
> other object in the first 15 hex chars is less than 1e-10, and
> currently a git repo tracking Linus', -stable and -rt only has around
> 10M objects.
> 
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>

Makes sense to me.

Acked-by: Guenter Roeck <linux@roeck-us.net>

> ---
> I could probably fix things by adding a 'git config --local
> core.abbrev 15' step to the Yocto build process after the repo to
> build from has been cloned but before building has started. But in the
> interest of binary reproducibility outside of just Yocto builds, I
> think it's better if this lives in the kernel.
> 
>  scripts/setlocalversion | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
> index 20f2efd57b11..c5262f0d953d 100755
> --- a/scripts/setlocalversion
> +++ b/scripts/setlocalversion
> @@ -45,7 +45,7 @@ scm_version()
>  
>  	# Check for git and a git repo.
>  	if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
> -	   head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
> +	   head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then
>  
>  		# If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
>  		# it, because this version is defined in the top level Makefile.
> @@ -59,7 +59,7 @@ scm_version()
>  			fi
>  			# If we are past a tagged commit (like
>  			# "v2.6.30-rc5-302-g72357d5"), we pretty print it.
> -			if atag="$(git describe 2>/dev/null)"; then
> +			if atag="$(git describe --abbrev=15 2>/dev/null)"; then
>  				echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
>  
>  			# If we don't have a tag at all we print -g{commitish}.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-10 11:26 [PATCH] scripts/setlocalversion: make git describe output more reliable Rasmus Villemoes
  2020-09-10 14:28 ` Guenter Roeck
@ 2020-09-10 14:34 ` Masahiro Yamada
  2020-09-10 19:05   ` Brian Norris
                     ` (2 more replies)
  2020-09-17  6:56 ` [PATCH v2] " Rasmus Villemoes
  2 siblings, 3 replies; 17+ messages in thread
From: Masahiro Yamada @ 2020-09-10 14:34 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Randy Dunlap, Brian Norris, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> When building for an embedded target using Yocto, we're sometimes
> observing that the version string that gets built into vmlinux (and
> thus what uname -a reports) differs from the path under /lib/modules/
> where modules get installed in the rootfs, but only in the length of
> the -gabc123def suffix. Hence modprobe always fails.
>
> The problem is that Yocto has the concept of "sstate" (shared state),
> which allows different developers/buildbots/etc. to share build
> artifacts, based on a hash of all the metadata that went into building
> that artifact - and that metadata includes all dependencies (e.g. the
> compiler used etc.). That normally works quite well; usually a clean
> build (without using any sstate cache) done by one developer ends up
> being binary identical to a build done on another host. However, one
> thing that can cause two developers to end up with different builds
> [and thus make one's vmlinux package incompatible with the other's
> kernel-dev package], which is not captured by the metadata hashing, is
> this `git describe`: The output of that can be affected by
>
> (1) git version: before 2.11 git defaulted to a minimum of 7, since
> 2.11 (git.git commit e6c587) the default is dynamic based on the
> number of objects in the repo
> (2) hence even if both run the same git version, the output can differ
> based on how many remotes are being tracked (or just lots of local
> development branches or plain old garbage)
> (3) and of course somebody could have a core.abbrev config setting in
> ~/.gitconfig
>
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, use an explicit
> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> for the unlikely case of no signed tags being usable).
>
> Now, why is 60 bits enough for everyone? It's not mathematically
> guaranteed that git won't have to use 16 in some git repo, but it is
> beyond unlikely: Even in a repo with 100M objects, the probability
> that any given commit (i.e. the one being described) clashes with some
> other object in the first 15 hex chars is less than 1e-10, and
> currently a git repo tracking Linus', -stable and -rt only has around
> 10M objects.


I agree that any randomness should be avoided.

My question is, do we need 15-digits?


The kernelrelease is formed by
[kernel version] + [some digits of git hash].


For example, "git describe" shows as follows:

v5.9.0-rc4-00034-g7fe10096c150


Linus gives a new tag every week (or every two week).


So, I think the conflict happens
only when we have two commits that start with the same 7-digits
in the _same_ release. Is this correct?

We have 14000 - 15000 commits in each release,
not 100M.





>
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> ---
> I could probably fix things by adding a 'git config --local
> core.abbrev 15' step to the Yocto build process after the repo to
> build from has been cloned but before building has started. But in the
> interest of binary reproducibility outside of just Yocto builds, I
> think it's better if this lives in the kernel.
>
>  scripts/setlocalversion | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
> index 20f2efd57b11..c5262f0d953d 100755
> --- a/scripts/setlocalversion
> +++ b/scripts/setlocalversion
> @@ -45,7 +45,7 @@ scm_version()
>
>         # Check for git and a git repo.
>         if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
> -          head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
> +          head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then
>
>                 # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
>                 # it, because this version is defined in the top level Makefile.
> @@ -59,7 +59,7 @@ scm_version()
>                         fi
>                         # If we are past a tagged commit (like
>                         # "v2.6.30-rc5-302-g72357d5"), we pretty print it.
> -                       if atag="$(git describe 2>/dev/null)"; then
> +                       if atag="$(git describe --abbrev=15 2>/dev/null)"; then
>                                 echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
>
>                         # If we don't have a tag at all we print -g{commitish}.
> --
> 2.23.0
>


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-10 14:34 ` Masahiro Yamada
@ 2020-09-10 19:05   ` Brian Norris
  2020-09-11  8:28     ` Rasmus Villemoes
  2020-09-10 22:56   ` Bhaskar Chowdhury
  2020-09-16  8:48   ` Rasmus Villemoes
  2 siblings, 1 reply; 17+ messages in thread
From: Brian Norris @ 2020-09-10 19:05 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Rasmus Villemoes, Randy Dunlap, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
> > So in order to avoid `uname -a` output relying on such random details
> > of the build environment which are rather hard to ensure are
> > consistent between developers and buildbots, use an explicit
> > --abbrev=15 option (and for consistency, also use rev-parse --short=15
> > for the unlikely case of no signed tags being usable).

For the patch:

Reviewed-by: Brian Norris <briannorris@chromium.org>

> I agree that any randomness should be avoided.
>
> My question is, do we need 15-digits?
...
> So, I think the conflict happens
> only when we have two commits that start with the same 7-digits
> in the _same_ release. Is this correct?

For the rev-parse usage ("unlikely case where we have no signed tag"),
the total number of objects is definitely relevant, and the man-page
says:
"unique prefix with at least length characters"
i.e., it might be longer, if needed for uniqueness.

For git-describe (the case where we have a tag to base off):
"use <n> digits, or as many digits as needed to form a unique object name"
I'm not quite sure whether the uniqueness is including anything about
the tag-relative prefix, but if not, then we have the same problem.

At least, that's my reading of the situation.

Brian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-10 14:34 ` Masahiro Yamada
  2020-09-10 19:05   ` Brian Norris
@ 2020-09-10 22:56   ` Bhaskar Chowdhury
  2020-09-16  8:48   ` Rasmus Villemoes
  2 siblings, 0 replies; 17+ messages in thread
From: Bhaskar Chowdhury @ 2020-09-10 22:56 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Rasmus Villemoes, Randy Dunlap, Brian Norris, Guenter Roeck,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 5048 bytes --]

On 23:34 Thu 10 Sep 2020, Masahiro Yamada wrote:
>On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
><linux@rasmusvillemoes.dk> wrote:
>>
>> When building for an embedded target using Yocto, we're sometimes
>> observing that the version string that gets built into vmlinux (and
>> thus what uname -a reports) differs from the path under /lib/modules/
>> where modules get installed in the rootfs, but only in the length of
>> the -gabc123def suffix. Hence modprobe always fails.
>>
>> The problem is that Yocto has the concept of "sstate" (shared state),
>> which allows different developers/buildbots/etc. to share build
>> artifacts, based on a hash of all the metadata that went into building
>> that artifact - and that metadata includes all dependencies (e.g. the
>> compiler used etc.). That normally works quite well; usually a clean
>> build (without using any sstate cache) done by one developer ends up
>> being binary identical to a build done on another host. However, one
>> thing that can cause two developers to end up with different builds
>> [and thus make one's vmlinux package incompatible with the other's
>> kernel-dev package], which is not captured by the metadata hashing, is
>> this `git describe`: The output of that can be affected by
>>
>> (1) git version: before 2.11 git defaulted to a minimum of 7, since
>> 2.11 (git.git commit e6c587) the default is dynamic based on the
>> number of objects in the repo
>> (2) hence even if both run the same git version, the output can differ
>> based on how many remotes are being tracked (or just lots of local
>> development branches or plain old garbage)
>> (3) and of course somebody could have a core.abbrev config setting in
>> ~/.gitconfig
>>
>> So in order to avoid `uname -a` output relying on such random details
>> of the build environment which are rather hard to ensure are
>> consistent between developers and buildbots, use an explicit
>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
>> for the unlikely case of no signed tags being usable).
>>
>> Now, why is 60 bits enough for everyone? It's not mathematically
>> guaranteed that git won't have to use 16 in some git repo, but it is
>> beyond unlikely: Even in a repo with 100M objects, the probability
>> that any given commit (i.e. the one being described) clashes with some
>> other object in the first 15 hex chars is less than 1e-10, and
>> currently a git repo tracking Linus', -stable and -rt only has around
>> 10M objects.
>
>
>I agree that any randomness should be avoided.
>
>My question is, do we need 15-digits?
>
>
>The kernelrelease is formed by
>[kernel version] + [some digits of git hash].
>
>
>For example, "git describe" shows as follows:
>
>v5.9.0-rc4-00034-g7fe10096c150
>
>
>Linus gives a new tag every week (or every two week).
>
>
>So, I think the conflict happens
>only when we have two commits that start with the same 7-digits
>in the _same_ release. Is this correct?
>
>We have 14000 - 15000 commits in each release,
>not 100M.
>
>
  I kinda agree with this...we need to chopped down the excess bits from the
  information .  

  Indeed , a 15 digits is too long  to keep up.

  ~Bhaskar
>
>
>>
>> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
>> ---
>> I could probably fix things by adding a 'git config --local
>> core.abbrev 15' step to the Yocto build process after the repo to
>> build from has been cloned but before building has started. But in the
>> interest of binary reproducibility outside of just Yocto builds, I
>> think it's better if this lives in the kernel.
>>
>>  scripts/setlocalversion | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
>> index 20f2efd57b11..c5262f0d953d 100755
>> --- a/scripts/setlocalversion
>> +++ b/scripts/setlocalversion
>> @@ -45,7 +45,7 @@ scm_version()
>>
>>         # Check for git and a git repo.
>>         if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
>> -          head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
>> +          head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then
>>
>>                 # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
>>                 # it, because this version is defined in the top level Makefile.
>> @@ -59,7 +59,7 @@ scm_version()
>>                         fi
>>                         # If we are past a tagged commit (like
>>                         # "v2.6.30-rc5-302-g72357d5"), we pretty print it.
>> -                       if atag="$(git describe 2>/dev/null)"; then
>> +                       if atag="$(git describe --abbrev=15 2>/dev/null)"; then
>>                                 echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
>>
>>                         # If we don't have a tag at all we print -g{commitish}.
>> --
>> 2.23.0
>>
>
>
>-- 
>Best Regards
>Masahiro Yamada

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-10 19:05   ` Brian Norris
@ 2020-09-11  8:28     ` Rasmus Villemoes
  2020-09-16 14:28       ` Masahiro Yamada
  0 siblings, 1 reply; 17+ messages in thread
From: Rasmus Villemoes @ 2020-09-11  8:28 UTC (permalink / raw)
  To: Brian Norris, Masahiro Yamada
  Cc: Randy Dunlap, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On 10/09/2020 21.05, Brian Norris wrote:
> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
>> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
>> <linux@rasmusvillemoes.dk> wrote:
>>> So in order to avoid `uname -a` output relying on such random details
>>> of the build environment which are rather hard to ensure are
>>> consistent between developers and buildbots, use an explicit
>>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
>>> for the unlikely case of no signed tags being usable).
> 
> For the patch:
> 
> Reviewed-by: Brian Norris <briannorris@chromium.org>
> 
>> I agree that any randomness should be avoided.
>>
>> My question is, do we need 15-digits?
> ...
>> So, I think the conflict happens
>> only when we have two commits that start with the same 7-digits
>> in the _same_ release. Is this correct?

No.

> For git-describe (the case where we have a tag to base off):
> "use <n> digits, or as many digits as needed to form a unique object name"

Yes, the abbreviated hash that `git describe` produces is unique among
all objects (and objects are more than just commits) in the current
repo, so what matters for probability-of-collision is the total number
of objects - linus.git itself probably grows ~60000 objects per release.

As for "do we need 15 digits", well, in theory the next time I merge the
next rt-stable tag into our kernel I could end up with a commit that
matches some existing object in the first 33 hex chars at which point I
should have chosen 34 - but of course that's so unlikely that it's
irrelevant.

But the upshot of that is that there really is no objective answer to
"how many digits do we need", so it becomes a tradeoff between "low
enough probability that anyone anywhere in the next few years would have
needed more than X when building their own kernel" and readability of
`uname -r` etc. So I decided somewhat arbitrarily that each time one
rolls a new release, there should be less than 1e-9 probability that
HEAD collides with some other object when abbreviated to X hexchars.
X=12 doesn't pass that criteria even when the repo has only 10M objects
(and, current linus.git already has objects that need 12 to be unique,
so such collisions do happen in practice, though of course very rarely).
13 and 14 are just weird numbers, so I ended with 15, which also allows
many many more objects in the repo before the probability crosses that
1e-9 threshold.

Rasmus

Side note 1: Note that there really isn't any such thing as "last
tag/previous tag" in a DAG; I think what git does is a best-effort
search for the eligible tag that minimizes #({objects reachable from
commit-to-be-described} - {objects reachable from tag}), where eligible
tag means at least reachable from commit-to-be-described (so that set
difference is a proper one), but there can be additional constraints
(e.g. --match=...).

Side note 2: Linus or Greg releases are actually not interesting here
(see the logic in setlocalversion that stops early if we're exactly at
an annotated tag) - the whole raison d'etre for setlocalversion is that
people do maintain and build non-mainline kernels, and it's extremely
useful for `uname -a` to accurately tell just which commit is running on
the target.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-10 14:34 ` Masahiro Yamada
  2020-09-10 19:05   ` Brian Norris
  2020-09-10 22:56   ` Bhaskar Chowdhury
@ 2020-09-16  8:48   ` Rasmus Villemoes
  2 siblings, 0 replies; 17+ messages in thread
From: Rasmus Villemoes @ 2020-09-16  8:48 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Randy Dunlap, Brian Norris, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On 10/09/2020 16.34, Masahiro Yamada wrote:
> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>>
...
>>
>> So in order to avoid `uname -a` output relying on such random details
>> of the build environment which are rather hard to ensure are
>> consistent between developers and buildbots, use an explicit
>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
>> for the unlikely case of no signed tags being usable).
>>

Hi Masahiro

Can I get you to carry this through the kbuild tree? Unless of course
your questions/concerns haven't been addressed.

Thanks,
Rasmus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-11  8:28     ` Rasmus Villemoes
@ 2020-09-16 14:28       ` Masahiro Yamada
  2020-09-16 15:23         ` Rasmus Villemoes
  0 siblings, 1 reply; 17+ messages in thread
From: Masahiro Yamada @ 2020-09-16 14:28 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Brian Norris, Randy Dunlap, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On 10/09/2020 21.05, Brian Norris wrote:
> > On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
> >> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
> >> <linux@rasmusvillemoes.dk> wrote:
> >>> So in order to avoid `uname -a` output relying on such random details
> >>> of the build environment which are rather hard to ensure are
> >>> consistent between developers and buildbots, use an explicit
> >>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> >>> for the unlikely case of no signed tags being usable).
> >
> > For the patch:
> >
> > Reviewed-by: Brian Norris <briannorris@chromium.org>
> >
> >> I agree that any randomness should be avoided.
> >>
> >> My question is, do we need 15-digits?
> > ...
> >> So, I think the conflict happens
> >> only when we have two commits that start with the same 7-digits
> >> in the _same_ release. Is this correct?
>
> No.
>
> > For git-describe (the case where we have a tag to base off):
> > "use <n> digits, or as many digits as needed to form a unique object name"
>
> Yes, the abbreviated hash that `git describe` produces is unique among
> all objects (and objects are more than just commits) in the current
> repo, so what matters for probability-of-collision is the total number
> of objects - linus.git itself probably grows ~60000 objects per release.
>
> As for "do we need 15 digits", well, in theory the next time I merge the
> next rt-stable tag into our kernel I could end up with a commit that
> matches some existing object in the first 33 hex chars at which point I
> should have chosen 34 - but of course that's so unlikely that it's
> irrelevant.
>
> But the upshot of that is that there really is no objective answer to
> "how many digits do we need", so it becomes a tradeoff between "low
> enough probability that anyone anywhere in the next few years would have
> needed more than X when building their own kernel" and readability of
> `uname -r` etc. So I decided somewhat arbitrarily that each time one
> rolls a new release, there should be less than 1e-9 probability that
> HEAD collides with some other object when abbreviated to X hexchars.
> X=12 doesn't pass that criteria even when the repo has only 10M objects
> (and, current linus.git already has objects that need 12 to be unique,
> so such collisions do happen in practice, though of course very rarely).
> 13 and 14 are just weird numbers, so I ended with 15, which also allows
> many many more objects in the repo before the probability crosses that
> 1e-9 threshold.
>
> Rasmus
>
> Side note 1: Note that there really isn't any such thing as "last
> tag/previous tag" in a DAG; I think what git does is a best-effort
> search for the eligible tag that minimizes #({objects reachable from
> commit-to-be-described} - {objects reachable from tag}), where eligible
> tag means at least reachable from commit-to-be-described (so that set
> difference is a proper one), but there can be additional constraints
> (e.g. --match=...).
>
> Side note 2: Linus or Greg releases are actually not interesting here
> (see the logic in setlocalversion that stops early if we're exactly at
> an annotated tag) - the whole raison d'etre for setlocalversion is that
> people do maintain and build non-mainline kernels, and it's extremely
> useful for `uname -a` to accurately tell just which commit is running on
> the target.



This is because you use the output from git as-is.

Why are you still trying to rely on such obscure behavior of git?


It is pretty easy to get the fixed number of digits reliably.

For example,
git rev-parse --verify HEAD 2>/dev/null | cut -c1-7


It always returns a 7-digits hash,
and two different people will get the same result for sure.

Your solution, --short=15, means "at least 15",
which still contains ambiguity.



As I already noted, the kernelrelease string is
constructed in this format:

<kernel-version>-<number-of-commits-on-top>-<abbreviated-commit-hash>


For example, if I enable CONFIG_LOCALVERSION_AUTO=y
in today's Linus tree, I got this:

5.9.0-rc5-00005-gfc4f28bb3daf


What if the number of digits were 7?

5.9.0-rc5-00005-gfc4f28b

I see no problem here.

Even if we have another object that starts with "fc4f28b",
the kernelrelease string is still unique thanks to the
<kernel-version>-<number-of-commits-on-top> prefix.

Why do we care about the uniqueness of the abbreviated
hash in the whole git history?



Note:
We prepend $(KERNELVERSION) to the result
of setlocalversion. [1]

[1] https://github.com/torvalds/linux/blob/v5.9-rc4/Makefile#L1175


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-16 14:28       ` Masahiro Yamada
@ 2020-09-16 15:23         ` Rasmus Villemoes
  2020-09-16 18:01           ` Masahiro Yamada
  0 siblings, 1 reply; 17+ messages in thread
From: Rasmus Villemoes @ 2020-09-16 15:23 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Brian Norris, Randy Dunlap, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On 16/09/2020 16.28, Masahiro Yamada wrote:
> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>>
>> On 10/09/2020 21.05, Brian Norris wrote:
>>> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
>>>> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
>>>> <linux@rasmusvillemoes.dk> wrote:
>>>>> So in order to avoid `uname -a` output relying on such random details
>>>>> of the build environment which are rather hard to ensure are
>>>>> consistent between developers and buildbots, use an explicit
>>>>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
>>>>> for the unlikely case of no signed tags being usable).
>>>
>>> For the patch:
>>>
>>> Reviewed-by: Brian Norris <briannorris@chromium.org>
>>>
>>>> I agree that any randomness should be avoided.
>>>>
>>>> My question is, do we need 15-digits?
>>> ...
>>>> So, I think the conflict happens
>>>> only when we have two commits that start with the same 7-digits
>>>> in the _same_ release. Is this correct?
>>
>> No.
>>
>>> For git-describe (the case where we have a tag to base off):
>>> "use <n> digits, or as many digits as needed to form a unique object name"
>>
>> Yes, the abbreviated hash that `git describe` produces is unique among
>> all objects (and objects are more than just commits) in the current
>> repo, so what matters for probability-of-collision is the total number
>> of objects - linus.git itself probably grows ~60000 objects per release.
>>
>> As for "do we need 15 digits", well, in theory the next time I merge the
>> next rt-stable tag into our kernel I could end up with a commit that
>> matches some existing object in the first 33 hex chars at which point I
>> should have chosen 34 - but of course that's so unlikely that it's
>> irrelevant.
>>
>> But the upshot of that is that there really is no objective answer to
>> "how many digits do we need", so it becomes a tradeoff between "low
>> enough probability that anyone anywhere in the next few years would have
>> needed more than X when building their own kernel" and readability of
>> `uname -r` etc. So I decided somewhat arbitrarily that each time one
>> rolls a new release, there should be less than 1e-9 probability that
>> HEAD collides with some other object when abbreviated to X hexchars.
>> X=12 doesn't pass that criteria even when the repo has only 10M objects
>> (and, current linus.git already has objects that need 12 to be unique,
>> so such collisions do happen in practice, though of course very rarely).
>> 13 and 14 are just weird numbers, so I ended with 15, which also allows
>> many many more objects in the repo before the probability crosses that
>> 1e-9 threshold.
>>
> 
> 
> This is because you use the output from git as-is.
> 
> Why are you still trying to rely on such obscure behavior of git?
> 
> 
> It is pretty easy to get the fixed number of digits reliably.
> 
> For example,
> git rev-parse --verify HEAD 2>/dev/null | cut -c1-7
> 
> 
> It always returns a 7-digits hash,
> and two different people will get the same result for sure.
> 
> Your solution, --short=15, means "at least 15",
> which still contains ambiguity.
> 
> 
> 
> As I already noted, the kernelrelease string is
> constructed in this format:
> 
> <kernel-version>-<number-of-commits-on-top>-<abbreviated-commit-hash>
> 
> 
> For example, if I enable CONFIG_LOCALVERSION_AUTO=y
> in today's Linus tree, I got this:
> 
> 5.9.0-rc5-00005-gfc4f28bb3daf
> 
> 
> What if the number of digits were 7?
> 
> 5.9.0-rc5-00005-gfc4f28b
> 
> I see no problem here.

The problem is that currently, the build relies on things which cannot
easily be controlled or reproduced between different developers: Apart
from git version (which is reasonable to mandate is the same, just like
one must use same compiler, binutils etc. to get binary reproducible
output), it depends on whether ~/.gitconfig has a core.abbrev setting -
and even worse, it depends on the _total number of objects in the source
git repo_, i.e. something that has nothing to do with what is currently
in the work tree at all.

And that leads to real bugs when one builds external modules that end up
in one directory in the rootfs, while `uname -r` tells modprobe to look
in some other directory (differing in the length of the abbreviated hash).

> Even if we have another object that starts with "fc4f28b",
> the kernelrelease string is still unique thanks to the
> <kernel-version>-<number-of-commits-on-top> prefix.
> 
> Why do we care about the uniqueness of the abbreviated
> hash in the whole git history?

Because when I ask a customer "what kernel are you running", and they
tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
identifies the right commit, instead of me having to dig around each of
the commits starting with that prefix and see which one of them matches
"is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
enough for that today, which is why Linus himself did that "auto-tune
abbrev based on repo size" patch for git.git.

But what I mostly care about is getting a consistent result. And yes,
you're right that the porcelain command 'git describe' could end up
using something more than 15 digits (though that's extremely unlikely).
So if you prefer, I can try to rewrite the logic purely in terms of
plumbing commands. But that's a much more invasive patch, and one would
obviously lose the guarantee of the abbreviation being unique among
current git objects.

Alternatively, would you consider a Kconfig knob, LOCALVERSION_ABBREV?

Rasmus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-16 15:23         ` Rasmus Villemoes
@ 2020-09-16 18:01           ` Masahiro Yamada
  2020-09-16 19:31             ` Rasmus Villemoes
  0 siblings, 1 reply; 17+ messages in thread
From: Masahiro Yamada @ 2020-09-16 18:01 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Brian Norris, Randy Dunlap, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On 16/09/2020 16.28, Masahiro Yamada wrote:
> > On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
> > <linux@rasmusvillemoes.dk> wrote:
> >>
> >> On 10/09/2020 21.05, Brian Norris wrote:
> >>> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
> >>>> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
> >>>> <linux@rasmusvillemoes.dk> wrote:
> >>>>> So in order to avoid `uname -a` output relying on such random details
> >>>>> of the build environment which are rather hard to ensure are
> >>>>> consistent between developers and buildbots, use an explicit
> >>>>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> >>>>> for the unlikely case of no signed tags being usable).
> >>>
> >>> For the patch:
> >>>
> >>> Reviewed-by: Brian Norris <briannorris@chromium.org>
> >>>
> >>>> I agree that any randomness should be avoided.
> >>>>
> >>>> My question is, do we need 15-digits?
> >>> ...
> >>>> So, I think the conflict happens
> >>>> only when we have two commits that start with the same 7-digits
> >>>> in the _same_ release. Is this correct?
> >>
> >> No.
> >>
> >>> For git-describe (the case where we have a tag to base off):
> >>> "use <n> digits, or as many digits as needed to form a unique object name"
> >>
> >> Yes, the abbreviated hash that `git describe` produces is unique among
> >> all objects (and objects are more than just commits) in the current
> >> repo, so what matters for probability-of-collision is the total number
> >> of objects - linus.git itself probably grows ~60000 objects per release.
> >>
> >> As for "do we need 15 digits", well, in theory the next time I merge the
> >> next rt-stable tag into our kernel I could end up with a commit that
> >> matches some existing object in the first 33 hex chars at which point I
> >> should have chosen 34 - but of course that's so unlikely that it's
> >> irrelevant.
> >>
> >> But the upshot of that is that there really is no objective answer to
> >> "how many digits do we need", so it becomes a tradeoff between "low
> >> enough probability that anyone anywhere in the next few years would have
> >> needed more than X when building their own kernel" and readability of
> >> `uname -r` etc. So I decided somewhat arbitrarily that each time one
> >> rolls a new release, there should be less than 1e-9 probability that
> >> HEAD collides with some other object when abbreviated to X hexchars.
> >> X=12 doesn't pass that criteria even when the repo has only 10M objects
> >> (and, current linus.git already has objects that need 12 to be unique,
> >> so such collisions do happen in practice, though of course very rarely).
> >> 13 and 14 are just weird numbers, so I ended with 15, which also allows
> >> many many more objects in the repo before the probability crosses that
> >> 1e-9 threshold.
> >>
> >
> >
> > This is because you use the output from git as-is.
> >
> > Why are you still trying to rely on such obscure behavior of git?
> >
> >
> > It is pretty easy to get the fixed number of digits reliably.
> >
> > For example,
> > git rev-parse --verify HEAD 2>/dev/null | cut -c1-7
> >
> >
> > It always returns a 7-digits hash,
> > and two different people will get the same result for sure.
> >
> > Your solution, --short=15, means "at least 15",
> > which still contains ambiguity.
> >
> >
> >
> > As I already noted, the kernelrelease string is
> > constructed in this format:
> >
> > <kernel-version>-<number-of-commits-on-top>-<abbreviated-commit-hash>
> >
> >
> > For example, if I enable CONFIG_LOCALVERSION_AUTO=y
> > in today's Linus tree, I got this:
> >
> > 5.9.0-rc5-00005-gfc4f28bb3daf
> >
> >
> > What if the number of digits were 7?
> >
> > 5.9.0-rc5-00005-gfc4f28b
> >
> > I see no problem here.
>
> The problem is that currently, the build relies on things which cannot
> easily be controlled or reproduced between different developers: Apart
> from git version (which is reasonable to mandate is the same, just like
> one must use same compiler, binutils etc. to get binary reproducible
> output), it depends on whether ~/.gitconfig has a core.abbrev setting -
> and even worse, it depends on the _total number of objects in the source
> git repo_, i.e. something that has nothing to do with what is currently
> in the work tree at all.
>
> And that leads to real bugs when one builds external modules that end up
> in one directory in the rootfs, while `uname -r` tells modprobe to look
> in some other directory (differing in the length of the abbreviated hash).
>
> > Even if we have another object that starts with "fc4f28b",
> > the kernelrelease string is still unique thanks to the
> > <kernel-version>-<number-of-commits-on-top> prefix.
> >
> > Why do we care about the uniqueness of the abbreviated
> > hash in the whole git history?
>
> Because when I ask a customer "what kernel are you running", and they
> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
> identifies the right commit, instead of me having to dig around each of
> the commits starting with that prefix and see which one of them matches
> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
> enough for that today, which is why Linus himself did that "auto-tune
> abbrev based on repo size" patch for git.git.



I like:

   git rev-parse --verify HEAD 2>/dev/null | cut -c1-15

better than:

   git rev-parse --verify --short=15 HEAD 2>/dev/null



The former produces the deterministic kernelrelease string.


But, let's reconsider if we need as long as 15-digits.

There are a couple of kinds of objects in git: commit, tree, blob.

I think you came up with 15-digits to ensure the uniqueness
in _all_ kinds of objects.

But, when your customer says "4.19.45-rt67-00567-123abc8",
123abc8 is apparently a commit instead of a tree or a blob.



In the context of the kernel version, we can exclude
tree and blob objects.

In other words, I think "grows ~60000 objects per release"
is somewhat over-estimating.

We have ~15000 commits per release. So, the difference
is just 4x factor, though...



BTW, I did some experiments, and I noticed
"git log" accepts a shorter hash
than "git show" does presumably because
"git log" only takes a commit.



For example, "git show 06a0d" fails, but
"git log 06a0d" works.


masahiro@oscar:~/ref/linux$ git  show   06a0d
error: short SHA1 06a0d is ambiguous
hint: The candidates are:
hint:   06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued
'softback_lines' cursor() argument
hint:   06a0d81911b3 tree
hint:   06a0dc5a84d2 tree
hint:   06a0d1947c77 blob
hint:   06a0df434249 blob
fatal: ambiguous argument '06a0d': unknown revision or path not in the
working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
masahiro@oscar:~/ref/linux$ git  log --oneline  -1   06a0d
06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument




What is interesting is, if you prepend <tag>-<number-of-commits>-
to the abbreviated hash, "git show" accepts as short as a commit
"git log" does.

masahiro@oscar:~/ref/linux$ git  show   v5.9-rc5-00002-g06a0d  | head -1
commit 06a0df4d1b8b13b551668e47b11fd7629033b7df


I guess git cleverly understands it refers to a commit object
since "v5.9-rc5-00002-" prefix only makes sense
in the commit context.



Keeping those above in mind, I believe 15-digits is too long.


So, I propose two options.


[1] 7 digits

The abbreviated hash part may not uniquely identify
the commit. In that case, you need some extra git
operations to find out which one is it.

As for the kernel build,
<kernel-version>-<number-of-commits>-<7-digits> is enough
to provide the unique kernelrelease string.



[2] 12 digits

This matches to the Fixes: tag policy specified in
Documentation/process/submitting-patches.rst

The abbreviated hash part is very likely unique
in the commit object space.
(Of course, it is impossible to guarantee the uniqueness)



I wait for some comments.






> But what I mostly care about is getting a consistent result. And yes,
> you're right that the porcelain command 'git describe' could end up
> using something more than 15 digits (though that's extremely unlikely).
> So if you prefer, I can try to rewrite the logic purely in terms of
> plumbing commands. But that's a much more invasive patch, and one would
> obviously lose the guarantee of the abbreviation being unique among
> current git objects.
>
> Alternatively, would you consider a Kconfig knob, LOCALVERSION_ABBREV?


No, I do not think LOCALVERSION_ABBREV is a good idea.

We should eliminate this problem
irrespective of the kernel configuration.


--
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-16 18:01           ` Masahiro Yamada
@ 2020-09-16 19:31             ` Rasmus Villemoes
  2020-09-17  0:48               ` Masahiro Yamada
  0 siblings, 1 reply; 17+ messages in thread
From: Rasmus Villemoes @ 2020-09-16 19:31 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Brian Norris, Randy Dunlap, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On 16/09/2020 20.01, Masahiro Yamada wrote:
> On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>>
>> On 16/09/2020 16.28, Masahiro Yamada wrote:
>>> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
>>> <linux@rasmusvillemoes.dk> wrote:
>>>>

>>> Why do we care about the uniqueness of the abbreviated
>>> hash in the whole git history?
>>
>> Because when I ask a customer "what kernel are you running", and they
>> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
>> identifies the right commit, instead of me having to dig around each of
>> the commits starting with that prefix and see which one of them matches
>> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
>> enough for that today, which is why Linus himself did that "auto-tune
>> abbrev based on repo size" patch for git.git.
> 
> I like:
> 
>    git rev-parse --verify HEAD 2>/dev/null | cut -c1-15
> 
> better than:
> 
>    git rev-parse --verify --short=15 HEAD 2>/dev/null
> 
> The former produces the deterministic kernelrelease string.
> 
> 
> But, let's reconsider if we need as long as 15-digits.
> 
> There are a couple of kinds of objects in git: commit, tree, blob.
> 
> I think you came up with 15-digits to ensure the uniqueness
> in _all_ kinds of objects.
> 
> But, when your customer says "4.19.45-rt67-00567-123abc8",
> 123abc8 is apparently a commit instead of a tree or a blob.
> 
> 
> 
> In the context of the kernel version, we can exclude
> tree and blob objects.
> 
> In other words, I think "grows ~60000 objects per release"
> is somewhat over-estimating.
> 
> We have ~15000 commits per release. So, the difference
> is just 4x factor, though...
> 
> 
> 
> BTW, I did some experiments, and I noticed
> "git log" accepts a shorter hash
> than "git show" does presumably because
> "git log" only takes a commit.
> 
> 
> 
> For example, "git show 06a0d" fails, but
> "git log 06a0d" works.
> 
> 
> masahiro@oscar:~/ref/linux$ git  show   06a0d
> error: short SHA1 06a0d is ambiguous
> hint: The candidates are:
> hint:   06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued
> 'softback_lines' cursor() argument
> hint:   06a0d81911b3 tree
> hint:   06a0dc5a84d2 tree
> hint:   06a0d1947c77 blob
> hint:   06a0df434249 blob
> fatal: ambiguous argument '06a0d': unknown revision or path not in the
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git <command> [<revision>...] -- [<file>...]'
> masahiro@oscar:~/ref/linux$ git  log --oneline  -1   06a0d
> 06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument
> 
> 
> 
> 
> What is interesting is, if you prepend <tag>-<number-of-commits>-
> to the abbreviated hash, "git show" accepts as short as a commit
> "git log" does.
> 
> masahiro@oscar:~/ref/linux$ git  show   v5.9-rc5-00002-g06a0d  | head -1
> commit 06a0df4d1b8b13b551668e47b11fd7629033b7df
> 
> 
> I guess git cleverly understands it refers to a commit object
> since "v5.9-rc5-00002-" prefix only makes sense
> in the commit context.
> 

Well, yes, but unfortunately only so far as applying the same "the user
means a commit object" logic; git doesn't do anything to actually use
the prefix to disambiguate. In fact, looking at a semi-random commit in
4.19-stable v4.19.114-7-g236c445eb352:

$ git show 236c44
error: short SHA1 236c44 is ambiguous
hint: The candidates are:
hint:   236c445eb352 commit 2020-03-13 - drm/bochs: downgrade
pci_request_region failure from error to warning
hint:   236c448cb6e7 commit 2011-09-08 - usbmon vs. tcpdump: fix dropped
packet count
hint:   236c445b1930 blob

Now you're right that we get

$ git show v4.19.114-7-g236c445 | head -n1
commit 236c445eb3529aa7c976f8812513c3cb26d77e27

so it knows we're not looking at that blob. But "git show
v4.19.114-7-g236c44" still fails because there are two commits. Adding
to the fun is that "git show v4.19.114-7-g236c448" actually shows the
usbmon commit, which is old v3.2 stuff and certainly not a descendant of
v4.19.114.

I didn't actually know that git would even accept those prefixed strings
as possible refspecs, and for a moment you had me hoping that git would
actually do the disambiguation using that prefix, which would certainly
make 7 hex chars enough; unfortunately that's not the case.

> Keeping those above in mind, I believe 15-digits is too long.

Well, as you indicated, commits are probably ~1/4 of all objects, i.e.
half a hexchar worth of bits. So the commit/object distinction shouldn't
matter very much for the choice of suitable fixed length.

> So, I propose two options.
> 
> 
> [1] 7 digits
> 
> The abbreviated hash part may not uniquely identify
> the commit. In that case, you need some extra git
> operations to find out which one is it.
> 
> As for the kernel build,
> <kernel-version>-<number-of-commits>-<7-digits> is enough
> to provide the unique kernelrelease string.
> 
> 
> 
> [2] 12 digits
> 
> This matches to the Fixes: tag policy specified in
> Documentation/process/submitting-patches.rst
> 
> The abbreviated hash part is very likely unique
> in the commit object space.
> (Of course, it is impossible to guarantee the uniqueness)

I can live with 12. It would be extremely rare that I would have to do
some extra git yoga to figure out which commit they actually mean. I
think we can still use git describe to do the bulk of the work (the 'git
rev-parse | cut ... is easy, but it's somewhat tedious to find the
closest tag, then compute the #commits-on-top part), then just have the
awk script used for pretty-printing (the %05d of the #commits-on-top)
can probably also be used to chop the abbreviated sha1 at 12 chars,
should git have needed to make it longer. Please see below for an
updated patch+commit log.

>> Alternatively, would you consider a Kconfig knob, LOCALVERSION_ABBREV?
> 
> 
> No, I do not think LOCALVERSION_ABBREV is a good idea.

Neither do I; I considered it before sending the patch but decided that yes:

> We should eliminate this problem
> irrespective of the kernel configuration.

Rasmus

====

something like this, then?


    scripts/setlocalversion: make git describe output more reliable

    When building for an embedded target using Yocto, we're sometimes
    observing that the version string that gets built into vmlinux (and
    thus what uname -a reports) differs from the path under /lib/modules/
    where modules get installed in the rootfs, but only in the length of
    the -gabc123def suffix. Hence modprobe always fails.

    The problem is that Yocto has the concept of "sstate" (shared state),
    which allows different developers/buildbots/etc. to share build
    artifacts, based on a hash of all the metadata that went into building
    that artifact - and that metadata includes all dependencies (e.g. the
    compiler used etc.). That normally works quite well; usually a clean
    build (without using any sstate cache) done by one developer ends up
    being binary identical to a build done on another host. However, one
    thing that can cause two developers to end up with different builds
    [and thus make one's vmlinux package incompatible with the other's
    kernel-dev package], which is not captured by the metadata hashing, is
    this `git describe`: The output of that can be affected by

    (1) git version: before 2.11 git defaulted to a minimum of 7, since
    2.11 (git.git commit e6c587) the default is dynamic based on the
    number of objects in the repo
    (2) hence even if both run the same git version, the output can differ
    based on how many remotes are being tracked (or just lots of local
    development branches or plain old garbage)
    (3) and of course somebody could have a core.abbrev config setting in
    ~/.gitconfig

    So in order to avoid `uname -a` output relying on such random details
    of the build environment which are rather hard to ensure are
    consistent between developers and buildbots, make sure the abbreviated
    sha1 always consists of exactly 12 hex characters. That is consistent
    with the current rule for -stable patches, and is almost always enough
    to identify the head commit unambigously - in the few cases where it
    does not, the v5.4.3-00021- prefix would certainly nail it down.

diff --git a/scripts/setlocalversion b/scripts/setlocalversion
index 20f2efd57b11..b51072167df1 100755
--- a/scripts/setlocalversion
+++ b/scripts/setlocalversion
@@ -45,7 +45,7 @@ scm_version()

 	# Check for git and a git repo.
 	if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
-	   head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
+	   head=$(git rev-parse --verify HEAD 2>/dev/null); then

 		# If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
 		# it, because this version is defined in the top level Makefile.
@@ -59,11 +59,22 @@ scm_version()
 			fi
 			# If we are past a tagged commit (like
 			# "v2.6.30-rc5-302-g72357d5"), we pretty print it.
-			if atag="$(git describe 2>/dev/null)"; then
-				echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
-
-			# If we don't have a tag at all we print -g{commitish}.
+                        #
+                        # Ensure the abbreviated sha1 has exactly 12
+                        # hex characters, to make the output
+                        # independent of git version, local
+                        # core.abbrev settings and/or total number of
+                        # objects in the current repository - passing
+                        # --abbrev=12 ensures a minimum of 12, and the
+                        # awk substr() then picks the 'g' and first 12
+                        # hex chars.
+			if atag="$(git describe --abbrev=12 2>/dev/null)"; then
+				echo "$atag" | awk -F- '{printf("-%05d-%s",
$(NF-1),substr($(NF),0,13))}'
+
+			# If we don't have a tag at all we print -g{commitish},
+			# again using exactly 12 hex chars.
 			else
+				head="$(echo $head | cut -c1-12)"
 				printf '%s%s' -g $head
 			fi
 		fi

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
  2020-09-16 19:31             ` Rasmus Villemoes
@ 2020-09-17  0:48               ` Masahiro Yamada
  0 siblings, 0 replies; 17+ messages in thread
From: Masahiro Yamada @ 2020-09-17  0:48 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Brian Norris, Randy Dunlap, Guenter Roeck, Bhaskar Chowdhury,
	Linux Kernel Mailing List

On Thu, Sep 17, 2020 at 4:31 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On 16/09/2020 20.01, Masahiro Yamada wrote:
> > On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes
> > <linux@rasmusvillemoes.dk> wrote:
> >>
> >> On 16/09/2020 16.28, Masahiro Yamada wrote:
> >>> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
> >>> <linux@rasmusvillemoes.dk> wrote:
> >>>>
>
> >>> Why do we care about the uniqueness of the abbreviated
> >>> hash in the whole git history?
> >>
> >> Because when I ask a customer "what kernel are you running", and they
> >> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
> >> identifies the right commit, instead of me having to dig around each of
> >> the commits starting with that prefix and see which one of them matches
> >> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
> >> enough for that today, which is why Linus himself did that "auto-tune
> >> abbrev based on repo size" patch for git.git.
> >
> > I like:
> >
> >    git rev-parse --verify HEAD 2>/dev/null | cut -c1-15
> >
> > better than:
> >
> >    git rev-parse --verify --short=15 HEAD 2>/dev/null
> >
> > The former produces the deterministic kernelrelease string.
> >
> >
> > But, let's reconsider if we need as long as 15-digits.
> >
> > There are a couple of kinds of objects in git: commit, tree, blob.
> >
> > I think you came up with 15-digits to ensure the uniqueness
> > in _all_ kinds of objects.
> >
> > But, when your customer says "4.19.45-rt67-00567-123abc8",
> > 123abc8 is apparently a commit instead of a tree or a blob.
> >
> >
> >
> > In the context of the kernel version, we can exclude
> > tree and blob objects.
> >
> > In other words, I think "grows ~60000 objects per release"
> > is somewhat over-estimating.
> >
> > We have ~15000 commits per release. So, the difference
> > is just 4x factor, though...
> >
> >
> >
> > BTW, I did some experiments, and I noticed
> > "git log" accepts a shorter hash
> > than "git show" does presumably because
> > "git log" only takes a commit.
> >
> >
> >
> > For example, "git show 06a0d" fails, but
> > "git log 06a0d" works.
> >
> >
> > masahiro@oscar:~/ref/linux$ git  show   06a0d
> > error: short SHA1 06a0d is ambiguous
> > hint: The candidates are:
> > hint:   06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued
> > 'softback_lines' cursor() argument
> > hint:   06a0d81911b3 tree
> > hint:   06a0dc5a84d2 tree
> > hint:   06a0d1947c77 blob
> > hint:   06a0df434249 blob
> > fatal: ambiguous argument '06a0d': unknown revision or path not in the
> > working tree.
> > Use '--' to separate paths from revisions, like this:
> > 'git <command> [<revision>...] -- [<file>...]'
> > masahiro@oscar:~/ref/linux$ git  log --oneline  -1   06a0d
> > 06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument
> >
> >
> >
> >
> > What is interesting is, if you prepend <tag>-<number-of-commits>-
> > to the abbreviated hash, "git show" accepts as short as a commit
> > "git log" does.
> >
> > masahiro@oscar:~/ref/linux$ git  show   v5.9-rc5-00002-g06a0d  | head -1
> > commit 06a0df4d1b8b13b551668e47b11fd7629033b7df
> >
> >
> > I guess git cleverly understands it refers to a commit object
> > since "v5.9-rc5-00002-" prefix only makes sense
> > in the commit context.
> >
>
> Well, yes, but unfortunately only so far as applying the same "the user
> means a commit object" logic; git doesn't do anything to actually use
> the prefix to disambiguate. In fact, looking at a semi-random commit in
> 4.19-stable v4.19.114-7-g236c445eb352:
>
> $ git show 236c44
> error: short SHA1 236c44 is ambiguous
> hint: The candidates are:
> hint:   236c445eb352 commit 2020-03-13 - drm/bochs: downgrade
> pci_request_region failure from error to warning
> hint:   236c448cb6e7 commit 2011-09-08 - usbmon vs. tcpdump: fix dropped
> packet count
> hint:   236c445b1930 blob
>
> Now you're right that we get
>
> $ git show v4.19.114-7-g236c445 | head -n1
> commit 236c445eb3529aa7c976f8812513c3cb26d77e27
>
> so it knows we're not looking at that blob. But "git show
> v4.19.114-7-g236c44" still fails because there are two commits. Adding
> to the fun is that "git show v4.19.114-7-g236c448" actually shows the
> usbmon commit, which is old v3.2 stuff and certainly not a descendant of
> v4.19.114.


Oh, I did not notice that.
Maybe git can be improved to check the validity of
the 'git describe' form, but that is another story.



> I didn't actually know that git would even accept those prefixed strings
> as possible refspecs, and for a moment you had me hoping that git would
> actually do the disambiguation using that prefix, which would certainly
> make 7 hex chars enough; unfortunately that's not the case.
>
> > Keeping those above in mind, I believe 15-digits is too long.
>
> Well, as you indicated, commits are probably ~1/4 of all objects, i.e.
> half a hexchar worth of bits. So the commit/object distinction shouldn't
> matter very much for the choice of suitable fixed length.
>
> > So, I propose two options.
> >
> >
> > [1] 7 digits
> >
> > The abbreviated hash part may not uniquely identify
> > the commit. In that case, you need some extra git
> > operations to find out which one is it.
> >
> > As for the kernel build,
> > <kernel-version>-<number-of-commits>-<7-digits> is enough
> > to provide the unique kernelrelease string.
> >
> >
> >
> > [2] 12 digits
> >
> > This matches to the Fixes: tag policy specified in
> > Documentation/process/submitting-patches.rst
> >
> > The abbreviated hash part is very likely unique
> > in the commit object space.
> > (Of course, it is impossible to guarantee the uniqueness)
>
> I can live with 12. It would be extremely rare that I would have to do
> some extra git yoga to figure out which commit they actually mean. I
> think we can still use git describe to do the bulk of the work (the 'git
> rev-parse | cut ... is easy, but it's somewhat tedious to find the
> closest tag, then compute the #commits-on-top part), then just have the
> awk script used for pretty-printing (the %05d of the #commits-on-top)
> can probably also be used to chop the abbreviated sha1 at 12 chars,
> should git have needed to make it longer. Please see below for an
> updated patch+commit log.


LGTM.

Could you send it to ML as a patch?



> >> Alternatively, would you consider a Kconfig knob, LOCALVERSION_ABBREV?
> >
> >
> > No, I do not think LOCALVERSION_ABBREV is a good idea.
>
> Neither do I; I considered it before sending the patch but decided that yes:
>
> > We should eliminate this problem
> > irrespective of the kernel configuration.
>
> Rasmus
>
> ====
>
> something like this, then?
>
>
>     scripts/setlocalversion: make git describe output more reliable
>
>     When building for an embedded target using Yocto, we're sometimes
>     observing that the version string that gets built into vmlinux (and
>     thus what uname -a reports) differs from the path under /lib/modules/
>     where modules get installed in the rootfs, but only in the length of
>     the -gabc123def suffix. Hence modprobe always fails.
>
>     The problem is that Yocto has the concept of "sstate" (shared state),
>     which allows different developers/buildbots/etc. to share build
>     artifacts, based on a hash of all the metadata that went into building
>     that artifact - and that metadata includes all dependencies (e.g. the
>     compiler used etc.). That normally works quite well; usually a clean
>     build (without using any sstate cache) done by one developer ends up
>     being binary identical to a build done on another host. However, one
>     thing that can cause two developers to end up with different builds
>     [and thus make one's vmlinux package incompatible with the other's
>     kernel-dev package], which is not captured by the metadata hashing, is
>     this `git describe`: The output of that can be affected by
>
>     (1) git version: before 2.11 git defaulted to a minimum of 7, since
>     2.11 (git.git commit e6c587) the default is dynamic based on the
>     number of objects in the repo
>     (2) hence even if both run the same git version, the output can differ
>     based on how many remotes are being tracked (or just lots of local
>     development branches or plain old garbage)
>     (3) and of course somebody could have a core.abbrev config setting in
>     ~/.gitconfig
>
>     So in order to avoid `uname -a` output relying on such random details
>     of the build environment which are rather hard to ensure are
>     consistent between developers and buildbots, make sure the abbreviated
>     sha1 always consists of exactly 12 hex characters. That is consistent
>     with the current rule for -stable patches, and is almost always enough
>     to identify the head commit unambigously - in the few cases where it
>     does not, the v5.4.3-00021- prefix would certainly nail it down.
>
> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
> index 20f2efd57b11..b51072167df1 100755
> --- a/scripts/setlocalversion
> +++ b/scripts/setlocalversion
> @@ -45,7 +45,7 @@ scm_version()
>
>         # Check for git and a git repo.
>         if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
> -          head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
> +          head=$(git rev-parse --verify HEAD 2>/dev/null); then
>
>                 # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
>                 # it, because this version is defined in the top level Makefile.
> @@ -59,11 +59,22 @@ scm_version()
>                         fi
>                         # If we are past a tagged commit (like
>                         # "v2.6.30-rc5-302-g72357d5"), we pretty print it.
> -                       if atag="$(git describe 2>/dev/null)"; then
> -                               echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
> -
> -                       # If we don't have a tag at all we print -g{commitish}.
> +                        #
> +                        # Ensure the abbreviated sha1 has exactly 12
> +                        # hex characters, to make the output
> +                        # independent of git version, local
> +                        # core.abbrev settings and/or total number of
> +                        # objects in the current repository - passing
> +                        # --abbrev=12 ensures a minimum of 12, and the
> +                        # awk substr() then picks the 'g' and first 12
> +                        # hex chars.
> +                       if atag="$(git describe --abbrev=12 2>/dev/null)"; then
> +                               echo "$atag" | awk -F- '{printf("-%05d-%s",
> $(NF-1),substr($(NF),0,13))}'
> +
> +                       # If we don't have a tag at all we print -g{commitish},
> +                       # again using exactly 12 hex chars.
>                         else
> +                               head="$(echo $head | cut -c1-12)"
>                                 printf '%s%s' -g $head
>                         fi
>                 fi



-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] scripts/setlocalversion: make git describe output more reliable
  2020-09-10 11:26 [PATCH] scripts/setlocalversion: make git describe output more reliable Rasmus Villemoes
  2020-09-10 14:28 ` Guenter Roeck
  2020-09-10 14:34 ` Masahiro Yamada
@ 2020-09-17  6:56 ` Rasmus Villemoes
  2020-09-17 12:22   ` Nico Schottelius
  2020-09-24 17:27   ` Masahiro Yamada
  2 siblings, 2 replies; 17+ messages in thread
From: Rasmus Villemoes @ 2020-09-17  6:56 UTC (permalink / raw)
  To: Masahiro Yamada, Randy Dunlap, Brian Norris, Bhaskar Chowdhury,
	Nico Schottelius, Guenter Roeck
  Cc: Rasmus Villemoes, linux-kernel

When building for an embedded target using Yocto, we're sometimes
observing that the version string that gets built into vmlinux (and
thus what uname -a reports) differs from the path under /lib/modules/
where modules get installed in the rootfs, but only in the length of
the -gabc123def suffix. Hence modprobe always fails.

The problem is that Yocto has the concept of "sstate" (shared state),
which allows different developers/buildbots/etc. to share build
artifacts, based on a hash of all the metadata that went into building
that artifact - and that metadata includes all dependencies (e.g. the
compiler used etc.). That normally works quite well; usually a clean
build (without using any sstate cache) done by one developer ends up
being binary identical to a build done on another host. However, one
thing that can cause two developers to end up with different builds
[and thus make one's vmlinux package incompatible with the other's
kernel-dev package], which is not captured by the metadata hashing, is
this `git describe`: The output of that can be affected by

(1) git version: before 2.11 git defaulted to a minimum of 7, since
2.11 (git.git commit e6c587) the default is dynamic based on the
number of objects in the repo
(2) hence even if both run the same git version, the output can differ
based on how many remotes are being tracked (or just lots of local
development branches or plain old garbage)
(3) and of course somebody could have a core.abbrev config setting in
~/.gitconfig

So in order to avoid `uname -a` output relying on such random details
of the build environment which are rather hard to ensure are
consistent between developers and buildbots, make sure the abbreviated
sha1 always consists of exactly 12 hex characters. That is consistent
with the current rule for -stable patches, and is almost always enough
to identify the head commit unambigously - in the few cases where it
does not, the v5.4.3-00021- prefix would certainly nail it down.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
v2: use 12 instead of 15, and ensure that the result does have exactly
12 hex chars.

 scripts/setlocalversion | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/scripts/setlocalversion b/scripts/setlocalversion
index 20f2efd57b11..bb709eda96cd 100755
--- a/scripts/setlocalversion
+++ b/scripts/setlocalversion
@@ -45,7 +45,7 @@ scm_version()
 
 	# Check for git and a git repo.
 	if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
-	   head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
+	   head=$(git rev-parse --verify HEAD 2>/dev/null); then
 
 		# If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
 		# it, because this version is defined in the top level Makefile.
@@ -59,11 +59,22 @@ scm_version()
 			fi
 			# If we are past a tagged commit (like
 			# "v2.6.30-rc5-302-g72357d5"), we pretty print it.
-			if atag="$(git describe 2>/dev/null)"; then
-				echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
-
-			# If we don't have a tag at all we print -g{commitish}.
+			#
+			# Ensure the abbreviated sha1 has exactly 12
+			# hex characters, to make the output
+			# independent of git version, local
+			# core.abbrev settings and/or total number of
+			# objects in the current repository - passing
+			# --abbrev=12 ensures a minimum of 12, and the
+			# awk substr() then picks the 'g' and first 12
+			# hex chars.
+			if atag="$(git describe --abbrev=12 2>/dev/null)"; then
+				echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),substr($(NF),0,13))}'
+
+			# If we don't have a tag at all we print -g{commitish},
+			# again using exactly 12 hex chars.
 			else
+				head="$(echo $head | cut -c1-12)"
 				printf '%s%s' -g $head
 			fi
 		fi
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] scripts/setlocalversion: make git describe output more reliable
  2020-09-17  6:56 ` [PATCH v2] " Rasmus Villemoes
@ 2020-09-17 12:22   ` Nico Schottelius
  2020-09-17 12:58     ` Rasmus Villemoes
  2020-09-24 17:27   ` Masahiro Yamada
  1 sibling, 1 reply; 17+ messages in thread
From: Nico Schottelius @ 2020-09-17 12:22 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Masahiro Yamada, Randy Dunlap, Brian Norris, Bhaskar Chowdhury,
	Nico Schottelius, Guenter Roeck, linux-kernel


Thanks for the patch Rasmus. Overall it looks good to me, be aligned to
the stable patch submission rules makes sense. A tiny thing though:

I did not calculate the exact collision probability with 12 characters
and it does not make sense to even discuss this, if this is a current
rule for stable patches. However we have a couple of 12's scattered in
the code. And if the submission rule changes in the future, we should
have a single location to update it.

So I suggest you introduce something on the line of:

...
num_chars=12
...
--abbrev=$num_chars
...

I guess you get the picture.

Greetings from the sunny mountains,

Nico


Rasmus Villemoes <linux@rasmusvillemoes.dk> writes:

> When building for an embedded target using Yocto, we're sometimes
> observing that the version string that gets built into vmlinux (and
> thus what uname -a reports) differs from the path under /lib/modules/
> where modules get installed in the rootfs, but only in the length of
> the -gabc123def suffix. Hence modprobe always fails.
>
> The problem is that Yocto has the concept of "sstate" (shared state),
> which allows different developers/buildbots/etc. to share build
> artifacts, based on a hash of all the metadata that went into building
> that artifact - and that metadata includes all dependencies (e.g. the
> compiler used etc.). That normally works quite well; usually a clean
> build (without using any sstate cache) done by one developer ends up
> being binary identical to a build done on another host. However, one
> thing that can cause two developers to end up with different builds
> [and thus make one's vmlinux package incompatible with the other's
> kernel-dev package], which is not captured by the metadata hashing, is
> this `git describe`: The output of that can be affected by
>
> (1) git version: before 2.11 git defaulted to a minimum of 7, since
> 2.11 (git.git commit e6c587) the default is dynamic based on the
> number of objects in the repo
> (2) hence even if both run the same git version, the output can differ
> based on how many remotes are being tracked (or just lots of local
> development branches or plain old garbage)
> (3) and of course somebody could have a core.abbrev config setting in
> ~/.gitconfig
>
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, make sure the abbreviated
> sha1 always consists of exactly 12 hex characters. That is consistent
> with the current rule for -stable patches, and is almost always enough
> to identify the head commit unambigously - in the few cases where it
> does not, the v5.4.3-00021- prefix would certainly nail it down.
>
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> ---
> v2: use 12 instead of 15, and ensure that the result does have exactly
> 12 hex chars.
>
>  scripts/setlocalversion | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
> index 20f2efd57b11..bb709eda96cd 100755
> --- a/scripts/setlocalversion
> +++ b/scripts/setlocalversion
> @@ -45,7 +45,7 @@ scm_version()
>
>  	# Check for git and a git repo.
>  	if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
> -	   head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
> +	   head=$(git rev-parse --verify HEAD 2>/dev/null); then
>
>  		# If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
>  		# it, because this version is defined in the top level Makefile.
> @@ -59,11 +59,22 @@ scm_version()
>  			fi
>  			# If we are past a tagged commit (like
>  			# "v2.6.30-rc5-302-g72357d5"), we pretty print it.
> -			if atag="$(git describe 2>/dev/null)"; then
> -				echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
> -
> -			# If we don't have a tag at all we print -g{commitish}.
> +			#
> +			# Ensure the abbreviated sha1 has exactly 12
> +			# hex characters, to make the output
> +			# independent of git version, local
> +			# core.abbrev settings and/or total number of
> +			# objects in the current repository - passing
> +			# --abbrev=12 ensures a minimum of 12, and the
> +			# awk substr() then picks the 'g' and first 12
> +			# hex chars.
> +			if atag="$(git describe --abbrev=12 2>/dev/null)"; then
> +				echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),substr($(NF),0,13))}'
> +
> +			# If we don't have a tag at all we print -g{commitish},
> +			# again using exactly 12 hex chars.
>  			else
> +				head="$(echo $head | cut -c1-12)"
>  				printf '%s%s' -g $head
>  			fi
>  		fi


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] scripts/setlocalversion: make git describe output more reliable
  2020-09-17 12:22   ` Nico Schottelius
@ 2020-09-17 12:58     ` Rasmus Villemoes
  2020-09-21  9:35       ` Nico Schottelius
  0 siblings, 1 reply; 17+ messages in thread
From: Rasmus Villemoes @ 2020-09-17 12:58 UTC (permalink / raw)
  To: Nico Schottelius
  Cc: Masahiro Yamada, Randy Dunlap, Brian Norris, Bhaskar Chowdhury,
	Guenter Roeck, linux-kernel

On 17/09/2020 14.22, Nico Schottelius wrote:
> 
> Thanks for the patch Rasmus. Overall it looks good to me, be aligned to
> the stable patch submission rules makes sense. A tiny thing though:
> 
> I did not calculate the exact collision probability with 12 characters

For reference, the math is something like this: Consider a repo with N+1
objects. We look at one specific object (for setlocalversion its the
head commit being built, for the stable rules its whatever particular
commit one is interested in for backporting), and want to know the
probability that its sha1 collides with some other object in the first b
bits (here b=48). Assuming the sha1s are independent and uniformly
distributed, the probability of not colliding with one specific other
commit is x=1-1/2^b, and the probability of not colliding with any of
the other N commits is x^N, making the probability of a collision 1-x^N
= (1-x)(1+x+x^2+...+x^{N-1}). Now the N terms in the second factor are
very-close-to-but-slightly-smaller-than 1, so an upper bound for this
probability is (1-x)N = N/2^b, which is also what one would naively
expect. [This estimate is always valid, but it becomes a void statement
of "the probability is less then 1" when N is >= 2^b].

So, assuming some vendor kernel repo that has all of Greg's stable.git
(around 10M objects I think) and another 10M objects because random
vendor, that works out to 20e6/2^48 = 7.1e-8, 71 ppb.

> So I suggest you introduce something on the line of:
> 
> ...
> num_chars=12
> ...
> --abbrev=$num_chars

I considered that, but it becomes quite ugly since it needs to get into
the awk script (as a 13, though perhaps we could get awk to do the +1, I
don't really speak awk), where we'd then need to use " instead of ' and
then escape the $ that are to be interpreted by awk and not the shell.
So I think it's more readable with hardcoding and comments explaining
why they are there; should anyone ever want to change 12.

Rasmus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] scripts/setlocalversion: make git describe output more reliable
  2020-09-17 12:58     ` Rasmus Villemoes
@ 2020-09-21  9:35       ` Nico Schottelius
  0 siblings, 0 replies; 17+ messages in thread
From: Nico Schottelius @ 2020-09-21  9:35 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Nico Schottelius, Masahiro Yamada, Randy Dunlap, Brian Norris,
	Bhaskar Chowdhury, Guenter Roeck, linux-kernel


Hey Rasmus,

Rasmus Villemoes <linux@rasmusvillemoes.dk> writes:
>> So I suggest you introduce something on the line of:
>>
>> ...
>> num_chars=12
>> ...
>> --abbrev=$num_chars
>
> I considered that, but it becomes quite ugly since it needs to get into
> the awk script (as a 13, though perhaps we could get awk to do the +1, I
> don't really speak awk), where we'd then need to use " instead of ' and
> then escape the $ that are to be interpreted by awk and not the shell.

No need for that, awk can read (and use) environment variables...

> So I think it's more readable with hardcoding and comments explaining
> why they are there; should anyone ever want to change 12.

... so that in practice you only need to change 13 to var+1 and 12 to $var.

Cheers,

Nico


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] scripts/setlocalversion: make git describe output more reliable
  2020-09-17  6:56 ` [PATCH v2] " Rasmus Villemoes
  2020-09-17 12:22   ` Nico Schottelius
@ 2020-09-24 17:27   ` Masahiro Yamada
  1 sibling, 0 replies; 17+ messages in thread
From: Masahiro Yamada @ 2020-09-24 17:27 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Randy Dunlap, Brian Norris, Bhaskar Chowdhury, Nico Schottelius,
	Guenter Roeck, Linux Kernel Mailing List

On Thu, Sep 17, 2020 at 3:56 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> When building for an embedded target using Yocto, we're sometimes
> observing that the version string that gets built into vmlinux (and
> thus what uname -a reports) differs from the path under /lib/modules/
> where modules get installed in the rootfs, but only in the length of
> the -gabc123def suffix. Hence modprobe always fails.
>
> The problem is that Yocto has the concept of "sstate" (shared state),
> which allows different developers/buildbots/etc. to share build
> artifacts, based on a hash of all the metadata that went into building
> that artifact - and that metadata includes all dependencies (e.g. the
> compiler used etc.). That normally works quite well; usually a clean
> build (without using any sstate cache) done by one developer ends up
> being binary identical to a build done on another host. However, one
> thing that can cause two developers to end up with different builds
> [and thus make one's vmlinux package incompatible with the other's
> kernel-dev package], which is not captured by the metadata hashing, is
> this `git describe`: The output of that can be affected by
>
> (1) git version: before 2.11 git defaulted to a minimum of 7, since
> 2.11 (git.git commit e6c587) the default is dynamic based on the
> number of objects in the repo
> (2) hence even if both run the same git version, the output can differ
> based on how many remotes are being tracked (or just lots of local
> development branches or plain old garbage)
> (3) and of course somebody could have a core.abbrev config setting in
> ~/.gitconfig
>
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, make sure the abbreviated
> sha1 always consists of exactly 12 hex characters. That is consistent
> with the current rule for -stable patches, and is almost always enough
> to identify the head commit unambigously - in the few cases where it
> does not, the v5.4.3-00021- prefix would certainly nail it down.
>
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> ---
> v2: use 12 instead of 15, and ensure that the result does have exactly
> 12 hex chars.
>
>  scripts/setlocalversion | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)



Applied to linux-kbuild.
Thanks.

-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-09-24 17:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 11:26 [PATCH] scripts/setlocalversion: make git describe output more reliable Rasmus Villemoes
2020-09-10 14:28 ` Guenter Roeck
2020-09-10 14:34 ` Masahiro Yamada
2020-09-10 19:05   ` Brian Norris
2020-09-11  8:28     ` Rasmus Villemoes
2020-09-16 14:28       ` Masahiro Yamada
2020-09-16 15:23         ` Rasmus Villemoes
2020-09-16 18:01           ` Masahiro Yamada
2020-09-16 19:31             ` Rasmus Villemoes
2020-09-17  0:48               ` Masahiro Yamada
2020-09-10 22:56   ` Bhaskar Chowdhury
2020-09-16  8:48   ` Rasmus Villemoes
2020-09-17  6:56 ` [PATCH v2] " Rasmus Villemoes
2020-09-17 12:22   ` Nico Schottelius
2020-09-17 12:58     ` Rasmus Villemoes
2020-09-21  9:35       ` Nico Schottelius
2020-09-24 17:27   ` Masahiro Yamada

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).