From: Eric Blake <eblake@redhat.com>
To: Markus Armbruster <armbru@redhat.com>, qemu-devel@nongnu.org
Cc: marcandre.lureau@redhat.com, mdroth@linux.vnet.ibm.com
Subject: Re: [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors
Date: Mon, 27 Aug 2018 12:18:42 -0500 [thread overview]
Message-ID: <4550a64e-1218-2a58-bed1-f0a9f0efa30b@redhat.com> (raw)
In-Reply-To: <20180827070021.11931-5-armbru@redhat.com>
On 08/27/2018 02:00 AM, Markus Armbruster wrote:
> When the lexer chokes on an input character, it consumes the
> character, emits a JSON error token, and enters its start state. This
> can lead to suboptimal error recovery. For instance, input
>
> 0123 ,
>
> produces the tokens
>
> JSON_ERROR 01
> JSON_INTEGER 23
> JSON_COMMA ,
>
> Make the lexer skip characters after a lexical error until a
> structural character ('[', ']', '{', '}', ':', ','), an ASCII control
> character, or '\xFE', or '\xFF'.
>
> Note that we must not skip ASCII control characters, '\xFE', '\xFF',
> because those are documented to force the JSON parser into known-good
> state, by docs/interop/qmp-spec.txt.
>
> The lexer now produces
>
> JSON_ERROR 01
> JSON_COMMA ,
So the lexer has now completely skipped the intermediate input, but the
resulting error message need only point at the start of where input went
wrong, and skipping to a sane point results in fewer error tokens to be
reported. Makes sense.
>
> Update qmp-test for the nicer error recovery: QMP now report just one
s/report/reports/
> error for input %p instead of two. Also drop the newline after %p; it
> was needed to tease out the second error.
That's because pre-patch, 'p' is one of the input characters that
requires lookahead to determine if it forms a complete token (and the
newline provides the transition needed to consume it); now post-patch,
the 'p' is consumed as part of the junk after the error is first
detected at the '%'.
And to my earlier complaint about 0x1a resulting in JSON_ERROR then
JSON_INTEGER then JSON_KEYWORD, that sequence is likewise now identified
as a single JSON_ERROR at the 'x', with the rest of the attempted hex
number (invalid in JSON) silently skipped. Nice.
>
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
> qobject/json-lexer.c | 43 +++++++++++++++++++++++++++++--------------
> tests/qmp-test.c | 6 +-----
> 2 files changed, 30 insertions(+), 19 deletions(-)
>
> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
> index 28582e17d9..39c7ce7adc 100644
> --- a/qobject/json-lexer.c
> +++ b/qobject/json-lexer.c
> @@ -101,6 +101,7 @@
>
> enum json_lexer_state {
> IN_ERROR = 0, /* must really be 0, see json_lexer[] */
> + IN_RECOVERY,
> IN_DQ_STRING_ESCAPE,
> IN_DQ_STRING,
> IN_SQ_STRING_ESCAPE,
> @@ -130,6 +131,28 @@ QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
> static const uint8_t json_lexer[][256] = {
> /* Relies on default initialization to IN_ERROR! */
>
> + /* error recovery */
> + [IN_RECOVERY] = {
> + /*
> + * Skip characters until a structural character, an ASCII
> + * control character other than '\t', or impossible UTF-8
> + * bytes '\xFE', '\xFF'. Structural characters and line
> + * endings are promising resynchronization points. Clients
> + * may use the others to force the JSON parser into known-good
> + * state; see docs/interop/qmp-spec.txt.
> + */
> + [0 ... 0x1F] = IN_START | LOOKAHEAD,
And here's where the LOOKAHEAD bit has to be separate, because you are
now assigning semantics to the transition on '\0' that are distinct from
either of the two roles previously enumerated as possible.
> + [0x20 ... 0xFD] = IN_RECOVERY,
> + [0xFE ... 0xFF] = IN_START | LOOKAHEAD,
> + ['\t'] = IN_RECOVERY,
> + ['['] = IN_START | LOOKAHEAD,
> + [']'] = IN_START | LOOKAHEAD,
> + ['{'] = IN_START | LOOKAHEAD,
> + ['}'] = IN_START | LOOKAHEAD,
> + [':'] = IN_START | LOOKAHEAD,
> + [','] = IN_START | LOOKAHEAD,
> + },
It took me a couple of reads before I was satisfied that everything is
initialized as desired (range assignments followed by more-specific
re-assignment works, but isn't common), but this looks right.
Reviewed-by: Eric Blake <eblake@redhat.com>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
next prev parent reply other threads:[~2018-08-27 17:22 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-27 7:00 [Qemu-devel] [PATCH 0/6] json: More fixes, error reporting improvements, cleanups Markus Armbruster
2018-08-27 7:00 ` [Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F' Markus Armbruster
2018-08-27 16:50 ` Eric Blake
2018-08-28 4:28 ` Markus Armbruster
2018-08-27 7:00 ` [Qemu-devel] [PATCH 2/6] json: Clean up how lexer consumes "end of input" Markus Armbruster
2018-08-27 16:58 ` Eric Blake
2018-08-28 4:28 ` Markus Armbruster
2018-08-27 7:00 ` [Qemu-devel] [PATCH 3/6] json: Make lexer's "character consumed" logic less confusing Markus Armbruster
2018-08-27 17:04 ` Eric Blake
2018-08-27 7:00 ` [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors Markus Armbruster
2018-08-27 17:18 ` Eric Blake [this message]
2018-08-28 4:35 ` Markus Armbruster
2018-08-27 7:00 ` [Qemu-devel] [PATCH 5/6] json: Eliminate lexer state IN_ERROR Markus Armbruster
2018-08-27 17:20 ` Eric Blake
2018-08-27 17:29 ` Eric Blake
2018-08-28 4:40 ` Markus Armbruster
2018-08-28 15:01 ` Eric Blake
2018-08-28 15:04 ` Eric Blake
2018-08-31 7:08 ` Markus Armbruster
2018-08-31 7:06 ` Markus Armbruster
2018-08-27 7:00 ` [Qemu-devel] [PATCH 6/6] json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP Markus Armbruster
2018-08-27 17:25 ` Eric Blake
2018-08-28 4:41 ` Markus Armbruster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4550a64e-1218-2a58-bed1-f0a9f0efa30b@redhat.com \
--to=eblake@redhat.com \
--cc=armbru@redhat.com \
--cc=marcandre.lureau@redhat.com \
--cc=mdroth@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.