All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mailsplit: Remove any '>' characters used to escape From_ lines in mbox.
       [not found] <87hbldjo0s.fsf@yoom.home.cworth.org>
@ 2010-06-08 20:02 ` Carl Worth
  2010-06-08 20:02   ` [PATCH 2/2] Add test from From_-line escaping Carl Worth
  2010-06-08 20:47 ` Make "git am" properly unescape lines matching ">>*From " Carl Worth
  2010-06-08 20:50 ` H. Peter Anvin
  2 siblings, 1 reply; 8+ messages in thread
From: Carl Worth @ 2010-06-08 20:02 UTC (permalink / raw)
  To: git; +Cc: Carl Worth

In order to encode an email message in an mbox, a client must notice any
lines in the email body that look like so-called From_ lines, (that is
lines begin with "From "), and add a preceding '>' character.

>From Jonathan de Boyne Pollard[*] we learn of two long-standing (since 1995
at least) conventions used for this escaping. The original "mboxo" format
does only the escaping described above, which leads to unavoidable
corruption of some messages. The newer "mboxrd" format also adds a '>' to
any line originally beginning with one or more '>' characters followed by
"From ". This ensures that the original email can be extracted without
corruption.

Git wasn't formerly un-escaping these lines in any case, so invocations of
"git am" would lead to errant '>' characters in the commit message. Here,
we now fix git-mailsplit to perform the necessary un-escaping. We assume
mboxrd format, since designing for the original mboxo format would
guarantee corruption in at least some cases.

[*] http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/mail-mbox-formats.html

Signed-off-by: Carl Worth <cworth@cworth.org>
---
 builtin/mailsplit.c |   26 +++++++++++++++++++++++++-
 1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/builtin/mailsplit.c b/builtin/mailsplit.c
index cdfc1b7..a3fb9f7 100644
--- a/builtin/mailsplit.c
+++ b/builtin/mailsplit.c
@@ -46,6 +46,30 @@ static int is_from_line(const char *line, int len)
 static struct strbuf buf = STRBUF_INIT;
 static int keep_cr;
 
+/* Write the line in 'buf' to 'output', but if we are splitting an mbox,
+ * then remove the first '>' from any line that begins with one or more
+ * '>' characters followed by "From ".
+ *
+ * Return 0 if successful, 1 for any write error.
+ */
+static int write_buf_unescaping(FILE *output, int is_mbox)
+{
+	const char *line = buf.buf;
+	size_t len = buf.len;
+
+	if (is_mbox && *line == '>') {
+		const char *s = line;
+		while (*s == '>')
+			s++;
+		if (strncmp (s, "From ", 5) == 0) {
+			line = line + 1;
+			len = len - 1;
+		}
+	}
+
+	return fwrite(line, 1, len, output) != len;
+}
+
 /* Called with the first line (potentially partial)
  * already in buf[] -- normally that should begin with
  * the Unix "From " line.  Write it into the specified
@@ -76,7 +100,7 @@ static int split_one(FILE *mbox, const char *name, int allow_bare)
 			strbuf_addch(&buf, '\n');
 		}
 
-		if (fwrite(buf.buf, 1, buf.len, output) != buf.len)
+		if (write_buf_unescaping(output, !is_bare))
 			die_errno("cannot write output");
 
 		if (strbuf_getwholeline(&buf, mbox, '\n')) {
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] Add test from From_-line escaping.
  2010-06-08 20:02 ` [PATCH 1/2] mailsplit: Remove any '>' characters used to escape From_ lines in mbox Carl Worth
@ 2010-06-08 20:02   ` Carl Worth
  0 siblings, 0 replies; 8+ messages in thread
From: Carl Worth @ 2010-06-08 20:02 UTC (permalink / raw)
  To: git; +Cc: Carl Worth

As implemented in the previous commit. We test that when applying from an
mbox that all escaped From_ lines are properly unescaped. We also test that
when applying from an email message the unescaping does not occur.

Signed-off-by: Carl Worth <cworth@cworth.org>
---
 t/t4152-am-From_.sh |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 64 insertions(+), 0 deletions(-)
 create mode 100755 t/t4152-am-From_.sh

diff --git a/t/t4152-am-From_.sh b/t/t4152-am-From_.sh
new file mode 100755
index 0000000..02821ee
--- /dev/null
+++ b/t/t4152-am-From_.sh
@@ -0,0 +1,64 @@
+#!/bin/sh
+
+test_description='git am properly unescaping From_ lines'
+
+. ./test-lib.sh
+
+cat >msg <<EOF
+From_ lines
+
+This is a commit message that contains a From_ line, which is line
+that begins with the characters "From ". Get ready for it, now...
+From this time forward, we'll have no From_-line bugs.
+
+Additionally, we'll also test lines that are escaped versions of From_
+lines. These are lines that begin with one or more '>' characters that
+are then followed by the characters "From ". We want to ensure that
+none of these intentional '>' characters get swallowed. Let's try that
+with three variations, (with 1, 2, and 3 leading '>' characters):
+
+>From now on (with one leading '>')
+>>From there to here (with two leading '>' characters)
+>>>From Here to Eternity (with three leading '>' characters)
+
+EOF
+
+test_expect_success setup '
+	echo hello >file &&
+	git add file &&
+	test_tick &&
+	git commit -m first &&
+	git tag first &&
+	echo world >>file &&
+	git add file &&
+	test_tick &&
+	git commit -s -F msg &&
+	git tag second &&
+	git format-patch --stdout first | sed -e "1{p;d};s/^\(>*From \)/>\1/" > From_ &&
+	{
+		echo "X-Fake-Field: Line One" &&
+		echo "X-Fake-Field: Line Two" &&
+		echo "X-Fake-Field: Line Three" &&
+		git format-patch --stdout first | sed -e "1d"
+	} > From_.eml
+'
+
+test_expect_success 'am unescapes From_ lines from mbox' '
+	git checkout first &&
+	git am From_ &&
+	! test -d .git/rebase-apply &&
+	test -z "$(git diff second)" &&
+	test "$(git rev-parse second)" = "$(git rev-parse HEAD)" &&
+	test "$(git rev-parse second^)" = "$(git rev-parse HEAD^)"
+'
+
+test_expect_success 'am does not unescape From_ lines from email' '
+	git checkout first &&
+	git am From_.eml &&
+	! test -d .git/rebase-apply &&
+	test -z "$(git diff second)" &&
+	test "$(git rev-parse second)" = "$(git rev-parse HEAD)" &&
+	test "$(git rev-parse second^)" = "$(git rev-parse HEAD^)"
+'
+
+test_done
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Make "git am" properly unescape lines matching ">>*From "
       [not found] <87hbldjo0s.fsf@yoom.home.cworth.org>
  2010-06-08 20:02 ` [PATCH 1/2] mailsplit: Remove any '>' characters used to escape From_ lines in mbox Carl Worth
@ 2010-06-08 20:47 ` Carl Worth
  2010-06-08 20:54   ` H. Peter Anvin
  2010-06-08 20:50 ` H. Peter Anvin
  2 siblings, 1 reply; 8+ messages in thread
From: Carl Worth @ 2010-06-08 20:47 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 2620 bytes --]

On Tue, 08 Jun 2010 12:57:23 -0700, Carl Worth <cworth@cworth.org> wrote:
> I'm adding support to notmuch[1] to more easily pipe a thread full of
> But I noticed that "git am" wasn't removing any of these added '>'
> characters, so I was getting corrupted commit messages.

I've also noticed that format-patch is generating bogus mbox files
without any escaping. (The only way it gets away with this is that
mailsplit only treats "From " lines as separators if they end with
something that looks quite a bit like the output of asctime.)

This does mean that without changing format-patch, the patched "git am"
could corrupt a commit message. This could happen if the commit message
originally contained a line matching "^From " which would previously be
passed through directly but will now be un-escaped to "From ".

This does seem less likely than a message containing a line matching
"^From " (which is the case that gets corrupted with an unpatched "git
am") so one option would be to ignore this, and apply my patch. That's
what I recommend for now.

Alternately, we could fix format-patch to add the correct, (and
reversible), escaping that is now expected by git-am.

Any attempt to add escaping to format-patch should recognize that many
users use the output of format-patch directly as content handed to their
MUA. Such users will *not* want escaping, (they are effectively treating
the format-patch output as a bare email message, not an mbox).

So if someone were to attempt this, I'd suggest first changing
format-patch to actually generate bare email messages when generating
files containing only a single message. This is instead of the invalid
mbox files it is generating now. This would be as simple as not emitting
the initial "From " line.

Then, when generating an actual mbox with multiple files, format-patch
should do the correct escaping, (which is now expected by "git am"), and
all of these cases of potential commit-message corruption should be
eliminated.

The other thing that would need to be fixed in this approach is to fix
"git send-email" to do the right thing with a bare email message. From a
quick glance at the code, it appears to be looking for an initial "From
" line, even though it doesn't appear to handle an mbox with multiple
messages. It looks for this line to distinguish an email message from
some custom "send lots of email" format. It should be simple to instead
distinguish a bare email message from the "send lots of email" format by
a first line which looks like an email header.

-Carl

-- 
carl.d.worth@intel.com

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Make "git am" properly unescape lines matching ">>*From "
       [not found] <87hbldjo0s.fsf@yoom.home.cworth.org>
  2010-06-08 20:02 ` [PATCH 1/2] mailsplit: Remove any '>' characters used to escape From_ lines in mbox Carl Worth
  2010-06-08 20:47 ` Make "git am" properly unescape lines matching ">>*From " Carl Worth
@ 2010-06-08 20:50 ` H. Peter Anvin
  2010-06-08 21:52   ` Carl Worth
  2 siblings, 1 reply; 8+ messages in thread
From: H. Peter Anvin @ 2010-06-08 20:50 UTC (permalink / raw)
  To: Carl Worth; +Cc: git, Junio C Hamano

On 06/08/2010 12:57 PM, Carl Worth wrote:
> I'm adding support to notmuch[1] to more easily pipe a thread full of
> patches to "git am". So I added support for notmuch to format a thread
> (or any search) as an mbox.
> 
> When I did that, I was careful to escape lines from the bodies of email
> messages that begin with zero or more '>' characters followed
> immediately by "From " (From_ lines) by adding an initial '>'. [2]
> 
> But I noticed that "git am" wasn't removing any of these added '>'
> characters, so I was getting corrupted commit messages.
> 
> I'll follow up this message with a patch that fixes that by making
> git-mailsplit un-escape these lines. It's careful to do this only when
> processing an actual mbox, using the existing detection of a bare email
> message and not doing any un-escaping in that case.
> 
> I'll also follow up with a new test for both cases, (using "git am" with
> both an mbox with escaped From_ lines and an email message without
> escaped From_ lines).
> 

The problem with that is that it is not universally applied.  For what
I've seen, some mbox-based programs simply rely on there being a
Content-Length: header and don't need From lines to be escaped at all
(and don't do anything useful if they are), some do the leading > trick
(usually not reversably at all).

As far as I can tell, the Content-Length: is the most reliably handled
format and probably is what we should use.  This is the "mboxcl2" format
in your list.[*]  Unfortunately "mboxcl2" and "mboxrd" cannot be
distinguished from each other by inspection, which is a major defect of
both formats.

The statement that "the entire "mbox" family of mailbox formats is
gradually becoming irrelevant, and of only historical interest" is also
pretty silly -- mbox is still the preferred format for moving groups of
email from MUA to MUA, even if it is no longer used for active live
spool storage.  But, of course, you knew that already.

	-hpa

[*] There are apparently some MTA/MUAs which simply bypass the entire
problem by base64-encoding any email that contains /^From /, just as if
it contained NUL bytes.  It's a heavyweight, but thoroughly unambiguous
way of dealing with the problem.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Make "git am" properly unescape lines matching ">>*From "
  2010-06-08 20:47 ` Make "git am" properly unescape lines matching ">>*From " Carl Worth
@ 2010-06-08 20:54   ` H. Peter Anvin
  2010-06-08 21:30     ` Carl Worth
  0 siblings, 1 reply; 8+ messages in thread
From: H. Peter Anvin @ 2010-06-08 20:54 UTC (permalink / raw)
  To: Carl Worth; +Cc: git, Junio C Hamano

On 06/08/2010 01:47 PM, Carl Worth wrote:
> On Tue, 08 Jun 2010 12:57:23 -0700, Carl Worth <cworth@cworth.org> wrote:
>> I'm adding support to notmuch[1] to more easily pipe a thread full of
>> But I noticed that "git am" wasn't removing any of these added '>'
>> characters, so I was getting corrupted commit messages.
> 
> I've also noticed that format-patch is generating bogus mbox files
> without any escaping. (The only way it gets away with this is that
> mailsplit only treats "From " lines as separators if they end with
> something that looks quite a bit like the output of asctime.)
> 

At the same time, it would be a fairly major lose to not be able to
generate individual messages easily.  I have personally considered the
fact that git format-patch produces something-vaguely-like mboxes rather
than individual plain RFC 2822 messages to be a bug; fixable by "tail"
but annoying.

	-hpa

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Make "git am" properly unescape lines matching ">>*From "
  2010-06-08 20:54   ` H. Peter Anvin
@ 2010-06-08 21:30     ` Carl Worth
  0 siblings, 0 replies; 8+ messages in thread
From: Carl Worth @ 2010-06-08 21:30 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: git, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

On Tue, 08 Jun 2010 13:54:32 -0700, "H. Peter Anvin" <hpa@zytor.com> wrote:
> On 06/08/2010 01:47 PM, Carl Worth wrote:
> > I've also noticed that format-patch is generating bogus mbox files
> > without any escaping. (The only way it gets away with this is that
> > mailsplit only treats "From " lines as separators if they end with
> > something that looks quite a bit like the output of asctime.)
> 
> At the same time, it would be a fairly major lose to not be able to
> generate individual messages easily.  I have personally considered the
> fact that git format-patch produces something-vaguely-like mboxes rather
> than individual plain RFC 2822 messages to be a bug; fixable by "tail"
> but annoying.

I totally agree. I said as much later on in the message. We should fix
format-patch to not emit the "From " line when generating files for
individual messages, (and we should fix send-email to accept such a bare
file).

That much is easy to agree on since it involves using mbox less, so the
whole "which mbox format to use?" question goes away, (for these uses at
least).

-Carl

-- 
carl.d.worth@intel.com

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Make "git am" properly unescape lines matching ">>*From "
  2010-06-08 20:50 ` H. Peter Anvin
@ 2010-06-08 21:52   ` Carl Worth
  2010-06-08 22:10     ` H. Peter Anvin
  0 siblings, 1 reply; 8+ messages in thread
From: Carl Worth @ 2010-06-08 21:52 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: git, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 3718 bytes --]

On Tue, 08 Jun 2010 13:50:08 -0700, "H. Peter Anvin" <hpa@zytor.com> wrote:
> On 06/08/2010 12:57 PM, Carl Worth wrote:
> > When I did that, I was careful to escape lines from the bodies of email
> > messages that begin with zero or more '>' characters followed
> > immediately by "From " (From_ lines) by adding an initial '>'. [2]
...
> The problem with that is that it is not universally applied.

Right. And since I can't fix this universe, I'd like to at least start
with getting notmuch and git to use the same thing. Currently, git is
using a non-standard not-quite-safe mbox format while notmuch doesn't
yet emit anything like mbox. So we have a nice opportunity to fix these
two projects to at least work well together, (if we can agree on a
format).

> As far as I can tell, the Content-Length: is the most reliably handled
> format and probably is what we should use.  This is the "mboxcl2" format
> in your list.[*]  Unfortunately "mboxcl2" and "mboxrd" cannot be
> distinguished from each other by inspection, which is a major defect of
> both formats.

What do you mean by "most reliably handled format"?

Of the four mbox formats listed on the page I cited[*], "mboxo" and
"mboxcl" are easy to discard as they both irreversibly corrupt messages.

That leaves both "mboxrd" and "mboxcl2" as candidates. Either of these
formats is reliable if both the reader and writer use the same
format. When the reader and writer don't agree, then there are problems
as follows ("W:" indicates writing, "R:" indicates reading expecting a
particular format):

W:mboxrd  then R:mboxcl2 -> Reader may corrupt by failing to remove '>'
			    Reader must give up/guess without CL headers
			    Guessing is at least unlikely to mis-split messages

W:mboxcl2 then R:mboxrd  -> Reader may corrupt by erroneously removing '>'
			    Reader may mis-split messages on "From " in content

I preferred to implement mboxrd over mboxcl2 for several reasons:

  1. The mboxrd writer implementation is much simpler. This format
     affords a simple streaming implementation where mboxcl2 requires
     knowing the length of the message in advance.

  2. The mboxrd format is robust in the face of file changes that
     invalidate the Content-Length headers, (for example, a person
     can hand-edit an mboxrd file without invalidating it, but cannot do
     the same with an mboxcl2 file).

  3. The mboxrd reader implementation is much simpler. An mboxcl2 reader
     necessarily has special-cases that an mboxrd implementation does
     not. What to do if there is no Content-Length header? What to do if
     the Content-Length header appears wrong? etc. Recovery code for
     these cases might well be to fallback to something like an mboxrd
     implementation, which demonstrates the increased complexity here.

As can be seen in my patch, doing an mboxrd reader in git-mailsplit was
quite simple. An mboxcl2 reader would be quite a bit more complicated,
but with no actual benefit in reliability, (assuming that the reader
matches the writer).

> The statement that "the entire "mbox" family of mailbox formats is
> gradually becoming irrelevant, and of only historical interest" is also
> pretty silly -- mbox is still the preferred format for moving groups of
> email from MUA to MUA, even if it is no longer used for active live
> spool storage.  But, of course, you knew that already.

Indeed. Though I was surprised to recently find that postfix does still
by default deliver to /var/mail/$user in "mboxo" format (ugh).

-Carl

[*] http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/mail-mbox-formats.html

-- 
carl.d.worth@intel.com

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Make "git am" properly unescape lines matching ">>*From "
  2010-06-08 21:52   ` Carl Worth
@ 2010-06-08 22:10     ` H. Peter Anvin
  0 siblings, 0 replies; 8+ messages in thread
From: H. Peter Anvin @ 2010-06-08 22:10 UTC (permalink / raw)
  To: Carl Worth; +Cc: git, Junio C Hamano

On 06/08/2010 02:52 PM, Carl Worth wrote:
> 
> What do you mean by "most reliably handled format"?
> 

I have to say there is definitely part of me that thinks that using
base64 or quoted-unprintable for "^From "-containing mails might really
be the best solution... as much as I normally hate that crap.

	-hpa

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-06-08 22:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87hbldjo0s.fsf@yoom.home.cworth.org>
2010-06-08 20:02 ` [PATCH 1/2] mailsplit: Remove any '>' characters used to escape From_ lines in mbox Carl Worth
2010-06-08 20:02   ` [PATCH 2/2] Add test from From_-line escaping Carl Worth
2010-06-08 20:47 ` Make "git am" properly unescape lines matching ">>*From " Carl Worth
2010-06-08 20:54   ` H. Peter Anvin
2010-06-08 21:30     ` Carl Worth
2010-06-08 20:50 ` H. Peter Anvin
2010-06-08 21:52   ` Carl Worth
2010-06-08 22:10     ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.