git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Dealing with corporate email recycling
@ 2022-03-12 22:38 Sean Allred
  2022-03-13  0:03 ` Junio C Hamano
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Sean Allred @ 2022-03-12 22:38 UTC (permalink / raw)
  To: git; +Cc: sallred, grmason, sconrad

Hi all,

We are currently replaying a 15-year SVN history into Git -- with
contributions from thousands of developers -- and are faced with the
challenge of corporate email recycling, departures, re-hires, and name
changes causing identity issues.

* Background

As you know (also to validate my own knowledge/assumptions), a Git
commit stores identity as a name and an email.  The only means to
validate this information is via signing; commits are otherwise taken
at face-value.  This seems pretty core to Git's decentralized design.
So to identify who is responsible for a commit, you have only the two
name+email pairs.

The problem in a nutshell: names and emails change over time.  The
simple cases can be handled by gitmailmap, but there are more
challenging cases:

  - A commit author might have had some email <one@corp.net>, but then
    was able to 'upgrade' to <two@corp.net> after a departure.

  - It's even possible that this departure might 'boomerang' and
    return to their old job, albeit now with a different email (since
    they forfeited <two@corp.net> upon departure).

In effect, the email address (or even email+name pair) used by a given
developer is not enough to identify that developer.

This issue is exacerbated by the features of some Git forges (e.g.
GitHub, GitLab, etc.) that will map an email address to a user
account.  In the degenerate case, this would cause the forge to
attribute the commit incorrectly -- causing communication issues as
developers use the misattribution to reach out to the wrong people.

As a baseline, we know the following statements are true:

  1. A person's preferred name can change at any time.
  2. A person's preferred email can change at any time.
  3. Neither of these pieces of information are necessarily
     identifying in a given codebase.

* Current Options

Setting aside the dubious practice of email recycling, how should we
look at resolving this confusion in a sensible way?  I see three
general options that are possible today, each with their drawbacks:

  1. Do nothing.  Leave it to the developer to determine the correct
     contact information without assistance.

     This doesn't really resolve the confusion, but it is technically
     an option.

  2. Use gitmailmap(5) functionality to resolve historical emails to
     primary emails.

     Sadly this doesn't actually solve the email recycling problem.
     Since one email could be used by multiple developers, there's no
     way (that I can see) to use a single mailmap file to resolve one
     of these emails to a single person.

  3. Use and require commit signing -- using some separate system to
     keep track of who used what public key when
     (valid-before/-after).

     This is an attractive option and very much fits the 'identity'
     portion of this problem, but evidently, it's not yet supported
     by git-fast-import.  This becomes a non-starter for at least all
     our SVN history we're importing over.  Even a potential concept
     of providing personal private keys to an impersonal import
     process also seems less than appealing, but I'm open to the
     possibility that I'm being too paranoid here.

At this point, I'd like to ask the community what approaches have
already proven successful.  I have a design sketch below, but I would
not want to propose introducing a new standard if a standard already
exists.  I tend to think this is not the first time this problem has
come up.

* Bad Proposal: Using Mailmap Over-Time

In the absence of other options, this leaves us to consider another:
if the mailmap file is tracked in history, we can know who had which
email when.  Taking advantage of this fact right now is a bit
roundabout, but workable.  Using the `mailmap.blob` config and
pointing to the mailmap version as of a given commit yields behavior
that *looks* promising: we can resolve an email in a commit to the
person who had that email when the commit was made.  Mentally
extending this concept to do this automatically in git-log and
friends, however, shows that this wholesale removes the value of
mailmap: the ability to change your name as it appears in history
*after* that history is created.  More details at [1].

There are of course legitimate reasons for a developer to change their
name and desire that name to be used throughout the history.  It would
seem then that current mailmap functionality is insufficient to create
an external system that solves the problem.

* Proposal: UUIDs

To get what we want (i.e., the ability to run `git show HEAD~1`, know
that Ada wrote it, and report her current contact information), we
need some way of tracking identity over time.  A naive solution could
be to extend the mailmap format as recognized by Git:

    $ git cat-file blob HEAD~1:.mailmap
    A. U. Thor <foo@example.com> [uuid A] <ada@example.com>

    $ git cat-file blob HEAD:.mailmap
    A. U. Thor <ada@example.com> [uuid A]
    Roy G. Biv <foo@example.com> [uuid B] <roy@example.com>

Now, when I run `git show HEAD~1`, Git would determine the UUID of the
email on the commit using the mailmap version in that tree:

    $ git -c mailmap.blob=HEAD~1:.mailmap check-mailmap --uuid "<foo@example.com>"
    A

Then, we can use that UUID to resolve to the current contact information:

    $ git check-mailmap --uuid=A
    A. U. Thor <ada@example.com>

Mailmap-sensitive commands can use this logic internally -- possibly
guarded under some new config setting.

** Criticisms

Main criticisms of this proposal that I can think of off-hand:

  1. As far as I know, the mailmap format is pretty well-established.
     I don't know how additions/extensions to the format will be
     interpreted by other tools.

     This could be mitigated with a config option or even a magic
     comment in the file itself noting its format version.

  2. This also assumes that any given email can belong to at most one
     person at a time.  This is true for us, but may not be generally
     true.  I don't know if this is a new assumption for mailmap.

     I don't know of any mitigation here.

  3. The current mailmap format of 'one line per pair' potentially
     opens up the format to ambiguities when resolving an email
     address -- ambiguities that become more apparent with the
     introduction of a UUID.  For example:

         A. U. Thor <foo@example.com> [uuid A] <ada@example.com>
         Ada Thor <ada@example.com> [uuid A] <foo@example.com>

     When evaluating `git check-mailmap --uuid=A`, which line is used?
     Perhaps this is not a new problem, but it is a disconcerting one.

     This might be resolved by allowing many aliases on a single line
     and restricting UUIDs to be unique in the file.  Since the
     introduction of UUIDs would be a change to the format that by
     definition would need to be opted-into (to provide the UUIDs),
     this would not be a breaking change in and of itself.

I am additionally unsure of how the following current behavior (from
gitmailmap(5)) should play into this proposal -- mostly because I
don't understand the use-case to carve out a specific author/committer
name for the mapping:

    > [...] and:
    >
    >     Proper Name <proper@email.xx> Commit Name <commit@email.xx>
    >
    > which allows mailmap to replace both the name and the email of a
    > commit matching both the specified commit name and email
    > address.

* Proposal: Valid-Before/Valid-After

Another potential idea is to record a transition point in the mailmap
file:

    A. U. Thor <foo@example.com> <ada@example.com> valid-before=<timestamp>
    A. U. Thor <ada@example.com>

This draws on the valid-before/-after pattern used in commit signing.
While I've signed a few commits in my day, I'm admittedly not versed
on that infrastructure/implementation.  I'd be curious if this idea
would have a champion to argue for it.  I personally prefer UUIDs.

** Criticisms

There are a few things I can identify:

  1. While it's potentially good enough to solve the immediate
     problem, this doesn't actually establish a tangible identity.

  2. Parsing this as a human being could become difficult if there are
     a few transition points.

* Summary

All of this is under the assumption that there is no viable approach
out there, which I'm not convinced of.  This proposal is only a
suggestion of how it *could* be standardized with development if no
standard practice currently exists.

So, two questions remain:

  1. Is there a standard approach out there already that was not
     discussed above?  (Are there statements I made about those
     approaches that are not true?)

  2. Assuming there is no suitable standard, is the above proposal
     worth investing development time, or are there fundamental flaws?

Thanks,
Sean Allred

---

[1]: See below.  I removed this from the main body to try and control
its length.

* Failed Plan: using mailmap over-time

(I'm going to say things in the course of this example that are 'wrong'
and/or 'working as designed', so bear with me!)

Consider the following.  At some point in the past, <foo@example.com>
belonged to Ada.  She wrote the following commit:

    $ git cat-file commit HEAD~1
    ...
    author A. U. Thor <foo@example.com> ...
    committer A. U. Thor <foo@example.com> ...

    $ git cat-file blob HEAD~1:.mailmap
    A. U. Thor <foo@example.com> <ada@example.com>

Somewhere down the line, Ada has left, <foo@example.com> transferred to
Roy, and he wrote the following commit:

    $ git cat-file commit HEAD
    ...
    author Roy G. Biv <foo@example.com> ...
    committer Roy G. Biv <foo@example.com> ...

    $ git cat-file blob HEAD:.mailmap
    A. U. Thor <ada@example.com>
    Roy G. Biv <foo@example.com> <roy@example.com>

If we check the mailmap to resolve <foo@example.com> from HEAD~1, we
get the wrong answer:

    $ git show HEAD~1
    ...
    Author: Roy G. Biv <foo@example.com>
    ...

It's only when we provide the version of the mailmap file that was
active at the time do we get the right answer:

    $ git -c mailmap.blob=HEAD~1:.mailmap show HEAD~1
    ...
    Author: A. U. Thor <foo@example.com>
    ...

So, if we can instruct git-show and friends to check the mailmap
version at the time of the commit, we get what appears to be the
desired behavior.  Great, right?

Not so fast.  This loses sight of one of the main purposes of
gitmailmap.  This idea failed.

You're at the end now; thanks for reading :-)

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-12 22:38 Dealing with corporate email recycling Sean Allred
@ 2022-03-13  0:03 ` Junio C Hamano
  2022-03-13  0:26   ` rsbecker
  2022-03-13 12:20 ` Philip Oakley
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Junio C Hamano @ 2022-03-13  0:03 UTC (permalink / raw)
  To: Sean Allred; +Cc: git, sallred, grmason, sconrad

Sean Allred <allred.sean@gmail.com> writes:

> As a baseline, we know the following statements are true:
>
>   1. A person's preferred name can change at any time.
>   2. A person's preferred email can change at any time.
>   3. Neither of these pieces of information are necessarily
>      identifying in a given codebase.

Another thing we know is

    4. People know that old e-mail addresses stay in archives and
       address books of people, and find it wise to avoid reusing an
       address somebody else (especially well-known ones) has been
       using, so that they do not get e-mails from total strangers
       and having to tell them that the intended recipient does not
       read the mailbox anymore.

>   1. Do nothing.  Leave it to the developer to determine the correct
>      contact information without assistance.
>
>      This doesn't really resolve the confusion, but it is technically
>      an option.
>
>   2. Use gitmailmap(5) functionality to resolve historical emails to
>      primary emails.
>
>      Sadly this doesn't actually solve the email recycling problem.
>      Since one email could be used by multiple developers, there's no
>      way (that I can see) to use a single mailmap file to resolve one
>      of these emails to a single person.


GNU arch (tla) had an interesting idea around this area and used
combination of time and e-mail address to identify a person.
one@corp--$date referred to the person who had control of the
address on the specified date, where $date can be abbreviated to
2022 or 202201 to mean 20220101.

The mailmap allows "Name e-mail" or "e-mail" to be mapped to
canonical "Name e-mail", but we should be able to coax "valid time
range" information encoded in each entry of the .mailmap format,
i.e. "if you see 'Name e-mail' between time X and Y, map that to...".


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Dealing with corporate email recycling
  2022-03-13  0:03 ` Junio C Hamano
@ 2022-03-13  0:26   ` rsbecker
  2022-03-13 14:01     ` Sean Allred
  0 siblings, 1 reply; 28+ messages in thread
From: rsbecker @ 2022-03-13  0:26 UTC (permalink / raw)
  To: 'Junio C Hamano', 'Sean Allred'
  Cc: git, sallred, grmason, sconrad

On March 12, 2022 7:04 PM, Junio C Hamano wrote:
>To: Sean Allred <allred.sean@gmail.com>
>Cc: git@vger.kernel.org; sallred@epic.com; grmason@epic.com;
>sconrad@epic.com
>Subject: Re: Dealing with corporate email recycling
>
>Sean Allred <allred.sean@gmail.com> writes:
>
>> As a baseline, we know the following statements are true:
>>
>>   1. A person's preferred name can change at any time.
>>   2. A person's preferred email can change at any time.
>>   3. Neither of these pieces of information are necessarily
>>      identifying in a given codebase.
>
>Another thing we know is
>
>    4. People know that old e-mail addresses stay in archives and
>       address books of people, and find it wise to avoid reusing an
>       address somebody else (especially well-known ones) has been
>       using, so that they do not get e-mails from total strangers
>       and having to tell them that the intended recipient does not
>       read the mailbox anymore.
>
>>   1. Do nothing.  Leave it to the developer to determine the correct
>>      contact information without assistance.
>>
>>      This doesn't really resolve the confusion, but it is technically
>>      an option.
>>
>>   2. Use gitmailmap(5) functionality to resolve historical emails to
>>      primary emails.
>>
>>      Sadly this doesn't actually solve the email recycling problem.
>>      Since one email could be used by multiple developers, there's no
>>      way (that I can see) to use a single mailmap file to resolve one
>>      of these emails to a single person.
>
>
>GNU arch (tla) had an interesting idea around this area and used
combination of
>time and e-mail address to identify a person.
>one@corp--$date referred to the person who had control of the address on
the
>specified date, where $date can be abbreviated to
>2022 or 202201 to mean 20220101.
>
>The mailmap allows "Name e-mail" or "e-mail" to be mapped to canonical
"Name
>e-mail", but we should be able to coax "valid time range" information
encoded in
>each entry of the .mailmap format, i.e. "if you see 'Name e-mail' between
time X
>and Y, map that to...".

Is there anything we could do around the new signature infrastructure
relating to this? I am NOT a fan of SSH keys without passphrases, but what
if we could use the coaxing above and map to SSH expiring keys then stitch
in signatures (a.k.a. sign the commits) to correspond to the users in the
given timeframe - then destroy the private keys to prevent further signing.
After that the Name/email becomes somewhat irrelevant from an integrity
standpoint.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-12 22:38 Dealing with corporate email recycling Sean Allred
  2022-03-13  0:03 ` Junio C Hamano
@ 2022-03-13 12:20 ` Philip Oakley
  2022-03-13 13:35   ` Sean Allred
  2022-03-13 15:51 ` Ævar Arnfjörð Bjarmason
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Philip Oakley @ 2022-03-13 12:20 UTC (permalink / raw)
  To: Sean Allred, git; +Cc: sallred, grmason, sconrad

On 12/03/2022 22:38, Sean Allred wrote:
> Hi all,
>
> We are currently replaying a 15-year SVN history into Git -- with
> contributions from thousands of developers -- and are faced with the
> challenge of corporate email recycling, departures, re-hires, and name
> changes causing identity issues.
Naming is a big issue [1,2].

Do you already have a map of those personal name and email name changes
that are causing conflicts, or are you hoping for a way of detecting
such changes? If you already know which names produce conflicts you are
more than half way there.

If you do know of the name conflicts, (e.g. when `John Doe` changed to
`Jane Doe2`, then acquired `Jane Doe`, before being put back to `Jane
Doe2`), do you have dates for the change over to map into the commit
dates (assuming no slop or author/committer date slip). At least with
the change-over dates you can apply mapping during the history transfer.

An alternate option is to simply stick with the fact that history is
messy, and use internal corporate knowledge for the few case that cause
the major issues. It some point it always gets to be a Gödel Grammar
(needing another rule).

Philip

[1]
https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
[2] https://acrl.ala.org/techconnect/post/names-are-hard/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 12:20 ` Philip Oakley
@ 2022-03-13 13:35   ` Sean Allred
  2022-03-14 11:59     ` Philip Oakley
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Allred @ 2022-03-13 13:35 UTC (permalink / raw)
  To: Philip Oakley; +Cc: git, sallred, grmason, sconrad


Philip Oakley <philipoakley@iee.email> writes:
> Do you already have a map of those personal name and email name changes
> that are causing conflicts, or are you hoping for a way of detecting
> such changes? If you already know which names produce conflicts you are
> more than half way there.
>
> If you do know of the name conflicts, (e.g. when `John Doe` changed to
> `Jane Doe2`, then acquired `Jane Doe`, before being put back to `Jane
> Doe2`), do you have dates for the change over to map into the commit
> dates (assuming no slop or author/committer date slip). At least with
> the change-over dates you can apply mapping during the history transfer.

Whether or not we have maintained such a list in real-time remains to be
seen, but we have been able to put together such a list using both SVN
history and a variety of other internal data sources.  It's worth noting
that, if we do have to use this generated mapping, a mapping of over 10k
entries is surely to have the odd mistake every now and then.  So it's
not *totally* trustworthy (and never could be).

> An alternate option is to simply stick with the fact that history is
> messy, and use internal corporate knowledge for the few case that cause
> the major issues. It some point it always gets to be a Gödel Grammar
> (needing another rule).

This is definitely on the table, though it seems a shame to not attempt
to use the information we've been able to compile.

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13  0:26   ` rsbecker
@ 2022-03-13 14:01     ` Sean Allred
  2022-03-13 14:20       ` rsbecker
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Allred @ 2022-03-13 14:01 UTC (permalink / raw)
  To: rsbecker; +Cc: 'Junio C Hamano', git, sallred, grmason, sconrad


<rsbecker@nexbridge.com> writes:
> Is there anything we could do around the new signature infrastructure
> relating to this? I am NOT a fan of SSH keys without passphrases, but what
> if we could use the coaxing above and map to SSH expiring keys then stitch
> in signatures (a.k.a. sign the commits) to correspond to the users in the
> given timeframe - then destroy the private keys to prevent further signing.
> After that the Name/email becomes somewhat irrelevant from an integrity
> standpoint.

Is this really possible?  Is it really as straightforward as splicing in
some text into the commit message to the effect of 'this commit is
signed' along with some signature artifact calculated pre-signing?

Though I'll note I *think* this would only solve the problem for the
committer field -- it's my current understanding that a commit can only
be signed by one signature.  (I have heard of systems that generate a
new key that is then signed by multiple signatures, then signing with
that new key -- but even if this is possible, it seems pretty involved
for such a common workflow.  This level of coordination might not be
possible for us -- especially given the merge workflows we've needed to
create to accommodate our current release process.)

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Dealing with corporate email recycling
  2022-03-13 14:01     ` Sean Allred
@ 2022-03-13 14:20       ` rsbecker
  2022-03-13 14:41         ` Sean Allred
  0 siblings, 1 reply; 28+ messages in thread
From: rsbecker @ 2022-03-13 14:20 UTC (permalink / raw)
  To: 'Sean Allred'
  Cc: 'Junio C Hamano', git, sallred, grmason, sconrad

On March 13, 2022 10:01 AM, Sean Allred wrote:
><rsbecker@nexbridge.com> writes:
>> Is there anything we could do around the new signature infrastructure
>> relating to this? I am NOT a fan of SSH keys without passphrases, but
>> what if we could use the coaxing above and map to SSH expiring keys
>> then stitch in signatures (a.k.a. sign the commits) to correspond to
>> the users in the given timeframe - then destroy the private keys to
prevent
>further signing.
>> After that the Name/email becomes somewhat irrelevant from an
>> integrity standpoint.
>
>Is this really possible?  Is it really as straightforward as splicing in
some text into the
>commit message to the effect of 'this commit is signed' along with some
signature
>artifact calculated pre-signing?
>
>Though I'll note I *think* this would only solve the problem for the
committer field
>-- it's my current understanding that a commit can only be signed by one
>signature.  (I have heard of systems that generate a new key that is then
signed by
>multiple signatures, then signing with that new key -- but even if this is
possible, it
>seems pretty involved for such a common workflow.  This level of
coordination
>might not be possible for us -- especially given the merge workflows we've
>needed to create to accommodate our current release process.)

(I am a little nervous about this advice, hoping others will chime in and
correct anything wrong here)

While this will change the commit hashes, AFAIK, the other metadata is
preserved, including date, author, and committer. Set up the specific
keys/settings in ssh-agent and the user.signingKey value, then:

git filter-branch --commit-filter 'git commit-tree -S "$@";'
<FROM-COMMIT>..<TO-COMMIT>

Others might have a better way of doing this or may tell me this will not
work. Test this before you do it. I have not done this operation before. You
do need to start from the oldest commit going forward otherwise I think that
filter-branch will (should!) invalidate child commits. I suspect this is
going to be a rather lengthy script to build and run.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 14:20       ` rsbecker
@ 2022-03-13 14:41         ` Sean Allred
  2022-03-13 15:02           ` rsbecker
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Allred @ 2022-03-13 14:41 UTC (permalink / raw)
  To: rsbecker; +Cc: 'Junio C Hamano', git, sallred, grmason, sconrad


<rsbecker@nexbridge.com> writes:
> (I am a little nervous about this advice, hoping others will chime in and
> correct anything wrong here)
>
> While this will change the commit hashes, AFAIK, the other metadata is
> preserved, including date, author, and committer. Set up the specific
> keys/settings in ssh-agent and the user.signingKey value, then:
>
> git filter-branch --commit-filter 'git commit-tree -S "$@";'
> <FROM-COMMIT>..<TO-COMMIT>
>
> Others might have a better way of doing this or may tell me this will not
> work. Test this before you do it. I have not done this operation before. You
> do need to start from the oldest commit going forward otherwise I think that
> filter-branch will (should!) invalidate child commits. I suspect this is
> going to be a rather lengthy script to build and run.

Given the size of our history (several orders of magnitude larger than
linux.git), using git-filter-branch after the fact is certainly not
ideal.  The replay already takes a week to run (we're IO-bound).  We'd
rather want to extend git-fast-import to allow signing commits instead
-- which comes back to our shared 'nervousness' about this approach in
general: I don't know that Git should endorse this as a standard option.

But yes -- hoping others can chime in with more thoughts :-)

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Dealing with corporate email recycling
  2022-03-13 14:41         ` Sean Allred
@ 2022-03-13 15:02           ` rsbecker
  2022-03-13 15:21             ` Sean Allred
  0 siblings, 1 reply; 28+ messages in thread
From: rsbecker @ 2022-03-13 15:02 UTC (permalink / raw)
  To: 'Sean Allred'
  Cc: 'Junio C Hamano', git, sallred, grmason, sconrad

On March 13, 2022 10:41 AM, Sean Allred wrote:
><rsbecker@nexbridge.com> writes:
>> (I am a little nervous about this advice, hoping others will chime in
>> and correct anything wrong here)
>>
>> While this will change the commit hashes, AFAIK, the other metadata is
>> preserved, including date, author, and committer. Set up the specific
>> keys/settings in ssh-agent and the user.signingKey value, then:
>>
>> git filter-branch --commit-filter 'git commit-tree -S "$@";'
>> <FROM-COMMIT>..<TO-COMMIT>
>>
>> Others might have a better way of doing this or may tell me this will
>> not work. Test this before you do it. I have not done this operation
>> before. You do need to start from the oldest commit going forward
>> otherwise I think that filter-branch will (should!) invalidate child
>> commits. I suspect this is going to be a rather lengthy script to build and run.
>
>Given the size of our history (several orders of magnitude larger than linux.git),
>using git-filter-branch after the fact is certainly not ideal.  The replay already takes
>a week to run (we're IO-bound).  We'd rather want to extend git-fast-import to
>allow signing commits instead
>-- which comes back to our shared 'nervousness' about this approach in
>general: I don't know that Git should endorse this as a standard option.
>
>But yes -- hoping others can chime in with more thoughts :-)

I have another reluctant suggestion, but it depends on your industry, regulations, and other factors. In some sectors, there is a requirement to keep only some period of time worth of history. In fact, in some settings, keeping user identifying information beyond, say 7 years, actually is problematic. Pruning your history may be not only an option but required. An alternative is to use filter-branch to essentially tokenize the identities of past authors and keep those in a electronic vault somewhere. I have customers who are interpreting GDPR-like rules just such as situation, where employees gone 7 years ago and cannot be retained, by name, in the repos. I am not personally happy about that, because my own repo-OCD demands that I know exactly who did what until the end of time, but according to them, it actually violates the local regulations. I'm sure you have had conversations with lawyers, yes? ☹


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 15:02           ` rsbecker
@ 2022-03-13 15:21             ` Sean Allred
  2022-03-13 19:57               ` Philip Oakley
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Allred @ 2022-03-13 15:21 UTC (permalink / raw)
  To: rsbecker; +Cc: 'Junio C Hamano', git, sallred, grmason, sconrad


<rsbecker@nexbridge.com> writes:
> I have another reluctant suggestion, but it depends on your industry,
> regulations, and other factors. In some sectors, there is a
> requirement to keep only some period of time worth of history. In
> fact, in some settings, keeping user identifying information beyond,
> say 7 years, actually is problematic. Pruning your history may be not
> only an option but required. An alternative is to use filter-branch to
> essentially tokenize the identities of past authors and keep those in
> a electronic vault somewhere. I have customers who are interpreting
> GDPR-like rules just such as situation, where employees gone 7 years
> ago and cannot be retained, by name, in the repos. I am not personally
> happy about that, because my own repo-OCD demands that I know exactly
> who did what until the end of time, but according to them, it actually
> violates the local regulations. I'm sure you have had conversations
> with lawyers, yes? ☹

I don't believe we've involved our legal team here (I'll follow up with
them internally), but that might be a spin-off discussion for folks who
know they're affected.  It would seem that the design of Git makes
purging history on an ongoing basis problematic -- you would always have
at least one unresolvable reference to a parent commit.  If this is a
real requirement from GDPR-like laws, either 'reasonable' VCS metadata
needs to be a specific carve-out in those laws -- but who the heck knows
what is 'reasonable' -- or as a project, Git needs to have an answer to
this situation and an ability to truncate history without otherwise
altering it.

It's also worth noting that even in the last five years, at our scale,
we've definitely run into the email-recycling problem already.

Being based in the U.S. and not having seen pitchforks about this yet,
I'd like to assume for the purpose of this discussion that we're keeping
all our history.

I think if the topic of legal implications of keeping history in
perpetuity is valuable to continue, we should spin it off into a
separate thread.  Personally I'm not seeing what we (Git) could
realistically do about it other than provide recommendations and paths
forward -- which might require considerable development.

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-12 22:38 Dealing with corporate email recycling Sean Allred
  2022-03-13  0:03 ` Junio C Hamano
  2022-03-13 12:20 ` Philip Oakley
@ 2022-03-13 15:51 ` Ævar Arnfjörð Bjarmason
  2022-03-13 17:22 ` brian m. carlson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-13 15:51 UTC (permalink / raw)
  To: Sean Allred; +Cc: git, sallred, grmason, sconrad


On Sat, Mar 12 2022, Sean Allred wrote:

> We are currently replaying a 15-year SVN history into Git -- with
> contributions from thousands of developers -- and are faced with the
> challenge of corporate email recycling, departures, re-hires, and name
> changes causing identity issues.
>
> * Background
>
> As you know (also to validate my own knowledge/assumptions), a Git
> commit stores identity as a name and an email.  The only means to
> validate this information is via signing; commits are otherwise taken
> at face-value.  This seems pretty core to Git's decentralized design.
> So to identify who is responsible for a commit, you have only the two
> name+email pairs.
>
> The problem in a nutshell: names and emails change over time.  The
> simple cases can be handled by gitmailmap, but there are more
> challenging cases:
>
>   - A commit author might have had some email <one@corp.net>, but then
>     was able to 'upgrade' to <two@corp.net> after a departure.
>
>   - It's even possible that this departure might 'boomerang' and
>     return to their old job, albeit now with a different email (since
>     they forfeited <two@corp.net> upon departure).
> [...skip a bunch of details...]
> You're at the end now; thanks for reading :-)

Aside from technical solutions and twists on mailmap, you haven't
*really* described what practical problem you're facing here.

> Somewhere down the line, Ada has left, <foo@example.com> transferred to
> Roy, and he wrote the following commit:

I.e. this, sure, that can happen, but what's the negative effect of that
in practice?

I've been involved in similar migrations in the past, and the primary
way to deal with it was to mostly ignore it, especially in a corporate
setting.

I.e. sure, you'll have some edge cases here and there, but the value of
knowing who exactly authored something tend to be proportional to how
recent the commit is.

If someone wrote something 10 years ago they're probably not even
working there anymore, or if they are will long since have forgotten
what they need to know to answer any specific questions etc.

The only people who tend to look at it are developers using "git blame"
or something, and usually humans are smart enough to spot that even if
it's foo@example.com they were expecting Roy, not Ada, or the other way
around.

Side note: To the extent that I've had to deal with this (in a corporate
setting) I found myself wanting git to have the exact opposite,
i.e. some feature where we'd just hide the author for anything any work
that's >5 years old or whatever.

Not for any privacy reason, but just because some UI's wouldn't really
communicate (in a way that people actually noticed) that the relevant
work was ancient, and someone who'd since long-moved-on would get
occasional interruptions due to ancient code they wrote but weren't
equipped to currently maintain.

Or similarly, to have anything >N years old "git blame" to the team
currently maintaining that thing, not to the person.

But I digress.

Having said that I think if you do need such a back-annotated history
you should look into "git notes" and/or "git replace". I.e. you could
have some lookup system maintain a mapping from OIDs to current IDs.

I've implemented a system like that in the past (in a MySQL table, but
whatever). I'd think this use-case of perfectly annotated old history is
probably obscure enough that that's the primary thing we should steer
people towards...

>   1. As far as I know, the mailmap format is pretty well-established.
>      I don't know how additions/extensions to the format will be
>      interpreted by other tools.

It's perfectly OK to change parts of that format in
backwards-"incompatible" ways, i.e. there's enough leeway in the
existing format definition and in-the-wild readers to have new readers
pick up new information that old readers will ignore.

I.e. we simply ignore things we can't map now, so one way to do it is to
start with something that produces an invalid (but harmless) mapping to
current readers, another is to borrow a trick from "/etc/sudoers" and
(ab)use the comment syntax.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-12 22:38 Dealing with corporate email recycling Sean Allred
                   ` (2 preceding siblings ...)
  2022-03-13 15:51 ` Ævar Arnfjörð Bjarmason
@ 2022-03-13 17:22 ` brian m. carlson
  2022-03-13 17:52   ` rsbecker
  2022-03-15  1:27 ` Sean Allred
  2022-03-18 21:22 ` Peter Krefting
  5 siblings, 1 reply; 28+ messages in thread
From: brian m. carlson @ 2022-03-13 17:22 UTC (permalink / raw)
  To: Sean Allred; +Cc: git, sallred, grmason, sconrad

[-- Attachment #1: Type: text/plain, Size: 2073 bytes --]

On 2022-03-12 at 22:38:56, Sean Allred wrote:
> * Proposal: UUIDs
> 
> To get what we want (i.e., the ability to run `git show HEAD~1`, know
> that Ada wrote it, and report her current contact information), we
> need some way of tracking identity over time.  A naive solution could
> be to extend the mailmap format as recognized by Git:
> 
>     $ git cat-file blob HEAD~1:.mailmap
>     A. U. Thor <foo@example.com> [uuid A] <ada@example.com>
> 
>     $ git cat-file blob HEAD:.mailmap
>     A. U. Thor <ada@example.com> [uuid A]
>     Roy G. Biv <foo@example.com> [uuid B] <roy@example.com>
> 
> Now, when I run `git show HEAD~1`, Git would determine the UUID of the
> email on the commit using the mailmap version in that tree:
> 
>     $ git -c mailmap.blob=HEAD~1:.mailmap check-mailmap --uuid "<foo@example.com>"
>     A
> 
> Then, we can use that UUID to resolve to the current contact information:
> 
>     $ git check-mailmap --uuid=A
>     A. U. Thor <ada@example.com>
> 
> Mailmap-sensitive commands can use this logic internally -- possibly
> guarded under some new config setting.

It's my intention to implement an approach where people's emails are
identified by a key fingerprint of some sort and then converted into the
proper email address by a mailmap that lives outside of the main
history.  That is, my email address might be
ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad@ssh-sha256.ns.git-scm.com,
and then we have a mailmap that converts between the two.  If you wanted
to have a UUID-based one, you could do
77c747a3-1599-4c8c-9569-f729c17632e6@uuid.ns.git-scm.com (assuming that
namespace were registered).

The benefit to the key part is that you can essentially prove that you
are who you say you are.  A UUID doesn't have the possibility.

This was discussed briefly at some sort of contributor summit we had at
some point, but I've been busy and haven't gotten to it yet.  It is on
my list of projects, however.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Dealing with corporate email recycling
  2022-03-13 17:22 ` brian m. carlson
@ 2022-03-13 17:52   ` rsbecker
  2022-03-13 19:47     ` rsbecker
  0 siblings, 1 reply; 28+ messages in thread
From: rsbecker @ 2022-03-13 17:52 UTC (permalink / raw)
  To: 'brian m. carlson', 'Sean Allred'
  Cc: git, sallred, grmason, sconrad

On March 13, 2022 1:22 PM, brian m. carlson wrote:
>On 2022-03-12 at 22:38:56, Sean Allred wrote:
>> * Proposal: UUIDs
>>
>> To get what we want (i.e., the ability to run `git show HEAD~1`, know
>> that Ada wrote it, and report her current contact information), we
>> need some way of tracking identity over time.  A naive solution could
>> be to extend the mailmap format as recognized by Git:
>>
>>     $ git cat-file blob HEAD~1:.mailmap
>>     A. U. Thor <foo@example.com> [uuid A] <ada@example.com>
>>
>>     $ git cat-file blob HEAD:.mailmap
>>     A. U. Thor <ada@example.com> [uuid A]
>>     Roy G. Biv <foo@example.com> [uuid B] <roy@example.com>
>>
>> Now, when I run `git show HEAD~1`, Git would determine the UUID of the
>> email on the commit using the mailmap version in that tree:
>>
>>     $ git -c mailmap.blob=HEAD~1:.mailmap check-mailmap --uuid
>"<foo@example.com>"
>>     A
>>
>> Then, we can use that UUID to resolve to the current contact information:
>>
>>     $ git check-mailmap --uuid=A
>>     A. U. Thor <ada@example.com>
>>
>> Mailmap-sensitive commands can use this logic internally -- possibly
>> guarded under some new config setting.
>
>It's my intention to implement an approach where people's emails are identified
>by a key fingerprint of some sort and then converted into the proper email
>address by a mailmap that lives outside of the main history.  That is, my email
>address might be
>ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad@ssh-
>sha256.ns.git-scm.com,
>and then we have a mailmap that converts between the two.  If you wanted to
>have a UUID-based one, you could do 77c747a3-1599-4c8c-9569-
>f729c17632e6@uuid.ns.git-scm.com (assuming that namespace were registered).
>
>The benefit to the key part is that you can essentially prove that you are who you
>say you are.  A UUID doesn't have the possibility.
>
>This was discussed briefly at some sort of contributor summit we had at some
>point, but I've been busy and haven't gotten to it yet.  It is on my list of projects,
>however.

This could require a global and security hardened tokenization or signing approach. Email fingerprints from one organization would have to be able to move to another organization easily - potentially as part of the git repo's metadata. I would not use the same key as is used for signing fingerprints (mostly out of paranoia), but this is conceptually similar to the public side of a key-pair. One would have to have access to the private key in order to be a committer/author. Unfortunately, as it stands today, that may be easily spoofed (--committer, --author), so that part of the code would have to change with safeguards on what can be supplied - something I think would be welcome. Keeping with a distributed philosophy is probably essential. Just my take on it.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Dealing with corporate email recycling
  2022-03-13 17:52   ` rsbecker
@ 2022-03-13 19:47     ` rsbecker
  2022-03-13 22:23       ` Sean Allred
  0 siblings, 1 reply; 28+ messages in thread
From: rsbecker @ 2022-03-13 19:47 UTC (permalink / raw)
  To: rsbecker, 'brian m. carlson', 'Sean Allred'
  Cc: git, sallred, grmason, sconrad

On March 13, 2022 1:53 PM, I wrote:
>To: 'brian m. carlson' <sandals@crustytoothpaste.net>; 'Sean Allred'
><allred.sean@gmail.com>
>Cc: git@vger.kernel.org; sallred@epic.com; grmason@epic.com;
>sconrad@epic.com
>Subject: RE: Dealing with corporate email recycling
>
>On March 13, 2022 1:22 PM, brian m. carlson wrote:
>>On 2022-03-12 at 22:38:56, Sean Allred wrote:
>>> * Proposal: UUIDs
>>>
>>> To get what we want (i.e., the ability to run `git show HEAD~1`, know
>>> that Ada wrote it, and report her current contact information), we
>>> need some way of tracking identity over time.  A naive solution could
>>> be to extend the mailmap format as recognized by Git:
>>>
>>>     $ git cat-file blob HEAD~1:.mailmap
>>>     A. U. Thor <foo@example.com> [uuid A] <ada@example.com>
>>>
>>>     $ git cat-file blob HEAD:.mailmap
>>>     A. U. Thor <ada@example.com> [uuid A]
>>>     Roy G. Biv <foo@example.com> [uuid B] <roy@example.com>
>>>
>>> Now, when I run `git show HEAD~1`, Git would determine the UUID of
>>> the email on the commit using the mailmap version in that tree:
>>>
>>>     $ git -c mailmap.blob=HEAD~1:.mailmap check-mailmap --uuid
>>"<foo@example.com>"
>>>     A
>>>
>>> Then, we can use that UUID to resolve to the current contact information:
>>>
>>>     $ git check-mailmap --uuid=A
>>>     A. U. Thor <ada@example.com>
>>>
>>> Mailmap-sensitive commands can use this logic internally -- possibly
>>> guarded under some new config setting.
>>
>>It's my intention to implement an approach where people's emails are
>>identified by a key fingerprint of some sort and then converted into
>>the proper email address by a mailmap that lives outside of the main
>>history.  That is, my email address might be
>>ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad@ssh-
>>sha256.ns.git-scm.com,
>>and then we have a mailmap that converts between the two.  If you
>>wanted to have a UUID-based one, you could do 77c747a3-1599-4c8c-9569-
>>f729c17632e6@uuid.ns.git-scm.com (assuming that namespace were registered).
>>
>>The benefit to the key part is that you can essentially prove that you
>>are who you say you are.  A UUID doesn't have the possibility.
>>
>>This was discussed briefly at some sort of contributor summit we had at
>>some point, but I've been busy and haven't gotten to it yet.  It is on
>>my list of projects, however.
>
>This could require a global and security hardened tokenization or signing approach.
>Email fingerprints from one organization would have to be able to move to
>another organization easily - potentially as part of the git repo's metadata. I would
>not use the same key as is used for signing fingerprints (mostly out of paranoia),
>but this is conceptually similar to the public side of a key-pair. One would have to
>have access to the private key in order to be a committer/author. Unfortunately,
>as it stands today, that may be easily spoofed (--committer, --author), so that part
>of the code would have to change with safeguards on what can be supplied -
>something I think would be welcome. Keeping with a distributed philosophy is
>probably essential. Just my take on it.

What about abstracting this into a map-email or map-identity hook of some kind? So, whenever there is a need to write an identity (committer, author, signed-off-by, etc.). That way, anyone who wants to, can implement whatever policy they want for replacing emails with some other value in the repo, and back again. It might be good to optimize it so that the hook is only invoked once per identity per request so that git log does not become insanely expensive.

Something like map-identity from <internal-value>  and map-identity to <external-value>, for example:

map-identity from "Randall S. Becker <rsbecker@nexbridge.com>"      > A056AAB2123

And

map-identity to A056AAB2123      >  Randall S. Becker <rsbecker@nexbridge.com>

Again, just a notion.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 15:21             ` Sean Allred
@ 2022-03-13 19:57               ` Philip Oakley
  2022-03-13 22:40                 ` Sean Allred
  0 siblings, 1 reply; 28+ messages in thread
From: Philip Oakley @ 2022-03-13 19:57 UTC (permalink / raw)
  To: Sean Allred, rsbecker
  Cc: 'Junio C Hamano', git, sallred, grmason, sconrad

On 13/03/2022 15:21, Sean Allred wrote:
> <rsbecker@nexbridge.com> writes:
>> I have another reluctant suggestion, but it depends on your industry,
>> regulations, and other factors. In some sectors, there is a
>> requirement to keep only some period of time worth of history. In
>> fact, in some settings, keeping user identifying information beyond,
>> say 7 years, actually is problematic. Pruning your history may be not
>> only an option but required. An alternative is to use filter-branch to
>> essentially tokenize the identities of past authors and keep those in
>> a electronic vault somewhere. I have customers who are interpreting
>> GDPR-like rules just such as situation, where employees gone 7 years
>> ago and cannot be retained, by name, in the repos. I am not personally
>> happy about that, because my own repo-OCD demands that I know exactly
>> who did what until the end of time, but according to them, it actually
>> violates the local regulations. I'm sure you have had conversations
>> with lawyers, yes? ☹
> I don't believe we've involved our legal team here (I'll follow up with
> them internally), but that might be a spin-off discussion for folks who
> know they're affected.  It would seem that the design of Git makes
> purging history on an ongoing basis problematic -- you would always have
> at least one unresolvable reference to a parent commit.  If this is a
> real requirement from GDPR-like laws, either 'reasonable' VCS metadata
> needs to be a specific carve-out in those laws -- but who the heck knows
> what is 'reasonable' -- or as a project, Git needs to have an answer to
> this situation and an ability to truncate history without otherwise
> altering it.
>
> It's also worth noting that even in the last five years, at our scale,
> we've definitely run into the email-recycling problem already.
>
> Being based in the U.S. and not having seen pitchforks about this yet,
> I'd like to assume for the purpose of this discussion that we're keeping
> all our history.
>
> I think if the topic of legal implications of keeping history in
> perpetuity is valuable to continue, we should spin it off into a
> separate thread.  Personally I'm not seeing what we (Git) could
> realistically do about it other than provide recommendations and paths
> forward -- which might require considerable development.
>
>
The GDPR isn't as onerous as some suggest, as it isn't a set of black
and white rules, rather in cases like these you need to have a real
strong reason for why data is retained etc, such as being part of the
verification and validation of the commit data. There have been various
discussions around this in many of the technical journals.

It maybe that your internal Git version could disable the particular
`format` option ('%ae'?) for the original name, so only the designated
('redacted') mailmap entry is shown to casual users (assumes the repo is
inside the corporate firewall). This would avoid invalidating the repos
validation capability, while meeting the needs of GDPR type regulations.

In the same vein, a local Git version could, being open source, add
allowances for your extra mailmap entry details, such as adding a post
fix " % <approxidate>" limits for the use of the particular name/email
combo to allow date ranges to emerge.

I noted that all the .mailmap examples in the man page have ">" as the
final character, but I haven't looked to see if the code always requires
that the last element of the entry is an <email> address, or whether it
currently barfs on extra elements.

--
Philip

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 19:47     ` rsbecker
@ 2022-03-13 22:23       ` Sean Allred
  0 siblings, 0 replies; 28+ messages in thread
From: Sean Allred @ 2022-03-13 22:23 UTC (permalink / raw)
  To: rsbecker; +Cc: 'brian m. carlson', git, sallred, grmason, sconrad


<rsbecker@nexbridge.com> writes:
> What about abstracting this into a map-email or map-identity hook of
> some kind? So, whenever there is a need to write an identity
> (committer, author, signed-off-by, etc.). That way, anyone who wants
> to, can implement whatever policy they want for replacing emails with
> some other value in the repo, and back again.

This is an interesting idea, but I'm afraid it might be difficult for
forges to implement support for as opposed to something built-in.  If
that's not as difficult as I might think it is, then perhaps this is a
viable option once fleshed out.

> It might be good to optimize it so that the hook is only invoked once
> per identity per request so that git log does not become insanely
> expensive.

I'll add that Windows (and our particular environment) makes this
troublesome as well.  Currently we see a base cost of 300ms for starting
up a process.  Given how many identities git-log and friends would be
chugging through, any hook would need to be capable of staying open --
feeding identities through stdin.  I'm not sure run_hooks supports that
right now.


--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 19:57               ` Philip Oakley
@ 2022-03-13 22:40                 ` Sean Allred
  2022-03-13 23:16                   ` Junio C Hamano
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Allred @ 2022-03-13 22:40 UTC (permalink / raw)
  To: Philip Oakley
  Cc: rsbecker, 'Junio C Hamano', git, sallred, grmason, sconrad


Philip Oakley <philipoakley@iee.email> writes:
> The GDPR isn't as onerous as some suggest, as it isn't a set of black
> and white rules, rather in cases like these you need to have a real
> strong reason for why data is retained etc, such as being part of the
> verification and validation of the commit data. There have been various
> discussions around this in many of the technical journals.

That's good to hear that this has already been discussed in the
community (though I'm hardly surprised now that you mention it -- I'm
sure it was and remains a hot topic!).

> It maybe that your internal Git version could disable the particular
> `format` option ('%ae'?) for the original name, so only the designated
> ('redacted') mailmap entry is shown to casual users (assumes the repo is
> inside the corporate firewall). This would avoid invalidating the repos
> validation capability, while meeting the needs of GDPR type regulations.

I do want to note that at present we're not primarily concerned with
GDPR, but I am following up on that internally to see if there are any
considerations we need to make.  This is certainly an interesting tactic
for repositories that are hosted in GDPR-effective states, though.

> In the same vein, a local Git version could, being open source, add
> allowances for your extra mailmap entry details, such as adding a post
> fix " % <approxidate>" limits for the use of the particular name/email
> combo to allow date ranges to emerge.

I'd prefer the ability to agree on a pattern and merge support for it
upstream.  This way, forges can pick up support, too.  Bonus points if
the forge doesn't necessarily have to do more work than it already does.

Your " % <approxidate>" suggestion sounds a lot like the 'Valid-Before/
Valid-After' proposal from my original post in this thread (admittedly
not my idea).  Is there a compelling reason to use this approach over
UUIDs?  I ask not to suggest there isn't a compelling reason, but mostly
to make sure we consider the best arguments (and drawbacks) for any/all
approaches.

> I noted that all the .mailmap examples in the man page have ">" as the
> final character, but I haven't looked to see if the code always requires
> that the last element of the entry is an <email> address, or whether it
> currently barfs on extra elements.

FYI mailmap does support comment syntax (starting with # through \n).
It's worth noting that Ævar suggested earlier that perhaps we could
"(ab)use the comment syntax".  I tend to prefer their other approach,
though:

    > I.e. we simply ignore things we can't map now, so one way to do it
    > is to start with something that produces an invalid (but harmless)
    > mapping to current readers, [...]

rather than use magic comments :-) Adapting to your suggestion, this
might look like the following:

    A. U. Thor <foo@example.com> <ada.example.com> <[ approxidate ]>

Would be curious to know what other mailmap readers exist and how they
would react to this.

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 22:40                 ` Sean Allred
@ 2022-03-13 23:16                   ` Junio C Hamano
  2022-03-13 23:23                     ` rsbecker
  2022-03-14 11:56                     ` Philip Oakley
  0 siblings, 2 replies; 28+ messages in thread
From: Junio C Hamano @ 2022-03-13 23:16 UTC (permalink / raw)
  To: Sean Allred; +Cc: Philip Oakley, rsbecker, git, sallred, grmason, sconrad

Sean Allred <allred.sean@gmail.com> writes:

> rather than use magic comments :-) Adapting to your suggestion, this
> might look like the following:
>
>     A. U. Thor <foo@example.com> <ada.example.com> <[ approxidate ]>

You'd probably want a timerange (valid-from and valid-to), instead
of one single timestamp?

Because at least three valid forms of mailmap entries should be
understood by the current generation of mailmap readers, i.e.

    Human Readable Name <e-mail@add.re.ss>
    Right Name <right@add.re.ss> <wrong@add.re.ss>
    Right Name <right@add.re.ss> Wrong Name <wrong@add.re.ss>

the extended entry format to record the validity timerange should
be chosen to cause parsers that are prepared to take these three
kinds of lines to barf and ignore.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Dealing with corporate email recycling
  2022-03-13 23:16                   ` Junio C Hamano
@ 2022-03-13 23:23                     ` rsbecker
  2022-03-14  0:19                       ` Junio C Hamano
  2022-03-14 11:56                     ` Philip Oakley
  1 sibling, 1 reply; 28+ messages in thread
From: rsbecker @ 2022-03-13 23:23 UTC (permalink / raw)
  To: 'Junio C Hamano', 'Sean Allred'
  Cc: 'Philip Oakley', git, sallred, grmason, sconrad

On March 13, 2022 7:16 PM, Junio C Hamano wrote:
>To: Sean Allred <allred.sean@gmail.com>
>Sean Allred <allred.sean@gmail.com> writes:
>
>> rather than use magic comments :-) Adapting to your suggestion, this
>> might look like the following:
>>
>>     A. U. Thor <foo@example.com> <ada.example.com> <[ approxidate ]>
>
>You'd probably want a timerange (valid-from and valid-to), instead of one
single
>timestamp?
>
>Because at least three valid forms of mailmap entries should be understood
by the
>current generation of mailmap readers, i.e.
>
>    Human Readable Name <e-mail@add.re.ss>
>    Right Name <right@add.re.ss> <wrong@add.re.ss>
>    Right Name <right@add.re.ss> Wrong Name <wrong@add.re.ss>
>
>the extended entry format to record the validity timerange should be chosen
to
>cause parsers that are prepared to take these three kinds of lines to barf
and
>ignore.

Could we not use SSH's ssh-keygen -V for this purpose when establishing
persistent identities independent of user/email? We already do this for
signed commits.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 23:23                     ` rsbecker
@ 2022-03-14  0:19                       ` Junio C Hamano
  0 siblings, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2022-03-14  0:19 UTC (permalink / raw)
  To: rsbecker
  Cc: 'Sean Allred', 'Philip Oakley',
	git, sallred, grmason, sconrad

<rsbecker@nexbridge.com> writes:

> Could we not use SSH's ssh-keygen -V for this purpose when establishing
> persistent identities independent of user/email? We already do this for
> signed commits.

Fingerprint of cryptographic key would be easy to use as an
identity, for which the person who claims ownership can easily
produce proof of ownership.  Various other "identitying strings"
like human readable name and e-mail addresses from different
validity periods can be all tied to such an identity.  Taking key
revocation into account, keys from different validity period may
have to be tied together in a same way.  "The person who used to
sign the commits with key A and the person who signs the commits
with key B are the same, and in real life, they are known as
A. U. Thor"

But proving that such a mapping is in a meaningful way is much
harder, I would imagine, but perhaps addresses and human readable
names do not matter as much.  Or continuity of identity, for that
matter.  I dunno.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 23:16                   ` Junio C Hamano
  2022-03-13 23:23                     ` rsbecker
@ 2022-03-14 11:56                     ` Philip Oakley
  2022-03-14 21:24                       ` Junio C Hamano
  2022-03-15  1:23                       ` Sean Allred
  1 sibling, 2 replies; 28+ messages in thread
From: Philip Oakley @ 2022-03-14 11:56 UTC (permalink / raw)
  To: Junio C Hamano, Sean Allred; +Cc: rsbecker, git, sallred, grmason, sconrad

On 13/03/2022 23:16, Junio C Hamano wrote:
> Sean Allred <allred.sean@gmail.com> writes:
>
>> rather than use magic comments :-) Adapting to your suggestion, this
>> might look like the following:
>>
>>     A. U. Thor <foo@example.com> <ada.example.com> <[ approxidate ]>
> You'd probably want a timerange (valid-from and valid-to), instead
> of one single timestamp?
I'm not so sure that the date range approach won't bring it's own
problems. What happens outside the date range? i.e. Do we then have
three identities: Before, During, and After, with only 'During' being
defined?

I more see a single date being used as a termination point for an
existing email sequence that defines a retrospective end point for the
mapping of the old email addresses to a single person. Future emails for
the same mailbox will be for a different 'current' person. This would
match the single linked list commit history view using the chronology
heuristic.

The key here being to have a final identity system in place so that you
can uniquely identify the old John Doe, from the newer John Doe`s at the
relevant time point in the mailmap.

>
> Because at least three valid forms of mailmap entries should be
> understood by the current generation of mailmap readers, i.e.
>
>     Human Readable Name <e-mail@add.re.ss>
>     Right Name <right@add.re.ss> <wrong@add.re.ss>
>     Right Name <right@add.re.ss> Wrong Name <wrong@add.re.ss>
>
> the extended entry format to record the validity timerange should
> be chosen to cause parsers that are prepared to take these three
> kinds of lines to barf and ignore.
The presence of a _sequence_ of name/email changes isn't well defined.
As I remember it we take the name/email updates in sequence and then
apply a last one wins approach. It's not clear what would be done when
we have two, or three different John Doe sequences all mixed in.


A broader issue for the corporate email mailbox systems is those that
are allocated to roles. So you may have Traning1@corp.com thru
Training9@corp.com (we had) and if that training includes practical low
hanging fruit examples from a project, it's difficult to disambiguate
those commits. More likely is say, having TestPC1 - TestPC9 that
included debug commits, perhaps even with pair programming test & debug
sessions, so allocation to individuals (rather than mailbox) becomes a
real problem. Hopefully that's rare in Sean's case.

Philip



 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-13 13:35   ` Sean Allred
@ 2022-03-14 11:59     ` Philip Oakley
  0 siblings, 0 replies; 28+ messages in thread
From: Philip Oakley @ 2022-03-14 11:59 UTC (permalink / raw)
  To: Sean Allred; +Cc: git, sallred, grmason, sconrad

On 13/03/2022 13:35, Sean Allred wrote:
> It's worth noting
> that, if we do have to use this generated mapping, a mapping of over 10k
> entries is surely to have the odd mistake every now and then.  So it's
> not *totally* trustworthy (and never could be).
I'd agree there. Adding role based mailboxes will further complicate
such mappings.

Philip

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-14 11:56                     ` Philip Oakley
@ 2022-03-14 21:24                       ` Junio C Hamano
  2022-03-14 22:25                         ` Philip Oakley
  2022-03-15  1:23                       ` Sean Allred
  1 sibling, 1 reply; 28+ messages in thread
From: Junio C Hamano @ 2022-03-14 21:24 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Sean Allred, rsbecker, git, sallred, grmason, sconrad

Philip Oakley <philipoakley@iee.email> writes:

> On 13/03/2022 23:16, Junio C Hamano wrote:
>> Sean Allred <allred.sean@gmail.com> writes:
>>
>>> rather than use magic comments :-) Adapting to your suggestion, this
>>> might look like the following:
>>>
>>>     A. U. Thor <foo@example.com> <ada.example.com> <[ approxidate ]>
>> You'd probably want a timerange (valid-from and valid-to), instead
>> of one single timestamp?
> I'm not so sure that the date range approach won't bring it's own
> problems. What happens outside the date range? i.e. Do we then have
> three identities: Before, During, and After, with only 'During' being
> defined?

I have been assuming that the default is "what the commit has is
correct".

> I more see a single date being used as a termination point for an
> existing email sequence that defines a retrospective end point for the
> mapping of the old email addresses to a single person.

Implicitly specifying the valid-from date (which is either the
beginning of time, or the newest of valid-until time for the same
identifying string that is older than the valid-until date for the
entry in question) is fine.  I do not see fundamental difference
between the approach you suggest and having an explicit valid-from
date.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-14 21:24                       ` Junio C Hamano
@ 2022-03-14 22:25                         ` Philip Oakley
  0 siblings, 0 replies; 28+ messages in thread
From: Philip Oakley @ 2022-03-14 22:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Sean Allred, rsbecker, git, sallred, grmason, sconrad

On 14/03/2022 21:24, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>> On 13/03/2022 23:16, Junio C Hamano wrote:
>>> Sean Allred <allred.sean@gmail.com> writes:
>>>
>>>> rather than use magic comments :-) Adapting to your suggestion, this
>>>> might look like the following:
>>>>
>>>>     A. U. Thor <foo@example.com> <ada.example.com> <[ approxidate ]>
>>> You'd probably want a timerange (valid-from and valid-to), instead
>>> of one single timestamp?
>> I'm not so sure that the date range approach won't bring it's own
>> problems. What happens outside the date range? i.e. Do we then have
>> three identities: Before, During, and After, with only 'During' being
>> defined?
> I have been assuming that the default is "what the commit has is
> correct".
That default is only true when there are no date limitations because of
email re-use. i.e. singleton persons with unique emails do fit that
default, which should be the majority.

If an old email has been reused, then that default becomes false, which
was Sean's starting point. In the corporate case, two (or more) distinct
individuals have used the same commit|author email address, and the hope
is, for a way of providing a disambiguation of those persons, based on
their email and the commit date.

>
>> I more see a single date being used as a termination point for an
>> existing email sequence that defines a retrospective end point for the
>> mapping of the old email addresses to a single person.
> Implicitly specifying the valid-from date (which is either the
> beginning of time, or the newest of valid-until time for the same
> identifying string that is older than the valid-until date for the
> entry in question) is fine.  I do not see fundamental difference
> between the approach you suggest and having an explicit valid-from
> date.
With the first case we guarantee that we have named cover for all of the
chronology via bisection, while the trisection can leave gaps without
any allocation to a person,  or possibly overlaps.

A more convoluted case would be where three persons share the same
emails in a rollover fashion, so the mailmap's simple name/email
handover becomes knotted and intertwined in the handovers
(Joe3->Joe2->Joe1).

P.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-14 11:56                     ` Philip Oakley
  2022-03-14 21:24                       ` Junio C Hamano
@ 2022-03-15  1:23                       ` Sean Allred
  2022-03-15 11:15                         ` Philip Oakley
  1 sibling, 1 reply; 28+ messages in thread
From: Sean Allred @ 2022-03-15  1:23 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Junio C Hamano, rsbecker, git, sallred, grmason, sconrad


Philip Oakley <philipoakley@iee.email> writes:
> A broader issue for the corporate email mailbox systems is those that
> are allocated to roles. So you may have Traning1@corp.com thru
> Training9@corp.com (we had) and if that training includes practical low
> hanging fruit examples from a project, it's difficult to disambiguate
> those commits. More likely is say, having TestPC1 - TestPC9 that
> included debug commits, perhaps even with pair programming test & debug
> sessions, so allocation to individuals (rather than mailbox) becomes a
> real problem. Hopefully that's rare in Sean's case.

Yep, this wouldn't happen for us.  Lots of other processes depend on
there being an individual making the commit.

I'd also be surprised if this didn't cause process problems for other
folks, too.

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-12 22:38 Dealing with corporate email recycling Sean Allred
                   ` (3 preceding siblings ...)
  2022-03-13 17:22 ` brian m. carlson
@ 2022-03-15  1:27 ` Sean Allred
  2022-03-18 21:22 ` Peter Krefting
  5 siblings, 0 replies; 28+ messages in thread
From: Sean Allred @ 2022-03-15  1:27 UTC (permalink / raw)
  To: git; +Cc: sallred, grmason, sconrad, Philip Oakley, rsbecker, Junio C Hamano


(CC others who have been involved in the conversation; I started a
separate thread off the original post since this takes an entirely
separate direction.)

I wanted to provide a possible approach for other organizations that may
run into this issue in the future.  If you have the opportunity that we
do in rewriting history, we found a nice workaround by way of
'unique-ifying' the email address using the '+' syntax supported by most
email providers -- notably for us, we tested that it does work with
Exchange.

As I went through previously, the problem is that our organization
recycles emails like <Sean@corp.net>.  If the owner of <Sean@corp.net>
leaves, I have the opportunity to take ownership of <Sean@corp.net>
after a waiting period.  This presents the core problem: <Sean@corp.net>
could be two different people at two different points in time.

Our first approach to solving this problem was to collaborate with our
internal IT folks to see if we could additionally generate a totally
unique email address based on that employee's ID.  For various reasons
with which I'm not familiar enough to elaborate, this wasn't really
possible -- we could not be provided a totally static email address for
any given person.  After bouncing the problem around internally for a
while, we reached out to this list :-) and I think a lot of cool
conversation is happening that I want to continue!  It seems there is
still value in finding a generalizable solution.

---

As for solutions that are *not* necessarily generalizable, what we were
able to come up with is to use a 'unique-ified' email of the form
<Sean+314159@corp.net>, where 314159 might be my employee ID.  That way,
even if another Sean might inherit <Sean@corp.net>, they can never
inherit <Sean+314159@corp.net>.  The use of this unique-ified email
address will be enforced via pre-receive -- possibly checking for its
existence in the mailmap as well.

Using this unique-ified email address, we can construct a mailmap like
the following:

    Sean Allred <Sean@corp.net> <Sean+314159@corp.net>

This uses the already-built functionality of gitmailmap to interpret the
committer email of <Sean+314159@corp.net> to the friendlier
<Sean@corp.net> while still maintaining the requirement (for us) that
the email on the commit be a real, usable email.

When I eventually get hit by a bus and give up <Sean@corp.net>, the
mailmap can be updated to

    Sean Allred <devnull@corp.net> <Sean+314159@corp.net>
    Sean Allblue <Sean@corp.net> <Sean+271828@corp.net>

(possibly using some support list instead of <devnull@corp.net>).

This has proven itself in a few informal design chats already and is
going to be the approach we take for our replay.  I look forward to a
future situation where we might be able to use SSH key validity periods
instead :-) but that functionality will surely only be available after
our migration in a few months' time.

I hope this helps others!

--
Sean Allred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-15  1:23                       ` Sean Allred
@ 2022-03-15 11:15                         ` Philip Oakley
  0 siblings, 0 replies; 28+ messages in thread
From: Philip Oakley @ 2022-03-15 11:15 UTC (permalink / raw)
  To: Sean Allred; +Cc: Junio C Hamano, rsbecker, git, sallred, grmason, sconrad

On 15/03/2022 01:23, Sean Allred wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>> A broader issue for the corporate email mailbox systems is those that
>> are allocated to roles. So you may have Traning1@corp.com thru
>> Training9@corp.com (we had) and if that training includes practical low
>> hanging fruit examples from a project, it's difficult to disambiguate
>> those commits. More likely is say, having TestPC1 - TestPC9 that
>> included debug commits, perhaps even with pair programming test & debug
>> sessions, so allocation to individuals (rather than mailbox) becomes a
>> real problem. Hopefully that's rare in Sean's case.
> Yep, this wouldn't happen for us.  Lots of other processes depend on
> there being an individual making the commit.
>
> I'd also be surprised if this didn't cause process problems for other
> folks, too.
I was in equipment engineering where independent dedicated test
equipment was more common, rather than pure software, so the potential
for  role based commits from TestPC1 - TestPC9 was far more likely (for
cases where it's the hardware that needs to be understood, not the
engineers' choice of code ;-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Dealing with corporate email recycling
  2022-03-12 22:38 Dealing with corporate email recycling Sean Allred
                   ` (4 preceding siblings ...)
  2022-03-15  1:27 ` Sean Allred
@ 2022-03-18 21:22 ` Peter Krefting
  5 siblings, 0 replies; 28+ messages in thread
From: Peter Krefting @ 2022-03-18 21:22 UTC (permalink / raw)
  To: Sean Allred; +Cc: git, sallred, grmason, sconrad

On Sat, 12 Mar 2022, Sean Allred wrote:

> We are currently replaying a 15-year SVN history into Git -- with
> contributions from thousands of developers -- and are faced with the
> challenge of corporate email recycling, departures, re-hires, and name
> changes causing identity issues.

I have performed a couple of imports of old version history into Git, from
various version control systems, some of them with history dating to before
the corporation even had e-mail addresses for employees. In those cases I
found that the easiest option was just to use whatever user identification
was available in the old version control system -- Git does not explicitely
require a valid e-mail address in the author and committer header.

For Subversion import, for instance, I used "Name <login>" where "login" was
the Subversion committer ID, and "Name" was from a mapping file I created
for the repository. Where records were sketchy and Name information was not
available, I would just use "<login>".

When it comes to name changes, I have had scripts map login + date to name.
For instance, I changed my last name when I married, so I would have my old
(I don't know what the masculine equivalent of "maiden name" is in English)
mapped up until a specific date, and my current name afterwards.

-- 
\\// Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-03-18 21:29 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-12 22:38 Dealing with corporate email recycling Sean Allred
2022-03-13  0:03 ` Junio C Hamano
2022-03-13  0:26   ` rsbecker
2022-03-13 14:01     ` Sean Allred
2022-03-13 14:20       ` rsbecker
2022-03-13 14:41         ` Sean Allred
2022-03-13 15:02           ` rsbecker
2022-03-13 15:21             ` Sean Allred
2022-03-13 19:57               ` Philip Oakley
2022-03-13 22:40                 ` Sean Allred
2022-03-13 23:16                   ` Junio C Hamano
2022-03-13 23:23                     ` rsbecker
2022-03-14  0:19                       ` Junio C Hamano
2022-03-14 11:56                     ` Philip Oakley
2022-03-14 21:24                       ` Junio C Hamano
2022-03-14 22:25                         ` Philip Oakley
2022-03-15  1:23                       ` Sean Allred
2022-03-15 11:15                         ` Philip Oakley
2022-03-13 12:20 ` Philip Oakley
2022-03-13 13:35   ` Sean Allred
2022-03-14 11:59     ` Philip Oakley
2022-03-13 15:51 ` Ævar Arnfjörð Bjarmason
2022-03-13 17:22 ` brian m. carlson
2022-03-13 17:52   ` rsbecker
2022-03-13 19:47     ` rsbecker
2022-03-13 22:23       ` Sean Allred
2022-03-15  1:27 ` Sean Allred
2022-03-18 21:22 ` Peter Krefting

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).