Iterative parsing by kddnewton · Pull Request #989 · ruby/json

kddnewton · 2026-06-01T17:47:17Z

Switch the recursive descent to an iterative algorithm by keeping a stack of "frames". Each frame represents parsing a container (the points at which you would recurse). Within frames there are "phases"; each phase maps to effectively a token that its looking for next.

This does not get all of the way to streaming parsing (that would require partial-token support), but it gets most of the way there, and is all required pre-work.

On my machine there's no noticeable slow-down. (There's also no noticeable speed-up.) The only really noticeable side-effect is that if you pass max_nesting: false, this will not crash anymore from running out of stack space.

As opposed to a recursive loop. We do this by keeping a stack of frames (very similar to how the stack of values was already stored). Each frame represents the state of a container. Since there are only 2 in JSON, it doesn't have to get too complex.

Each frame in the iterative parser now holds an enum describing its "phase", in order to support suspending parsing.

byroot · 2026-06-01T19:44:36Z

is that if you pass max_nesting: false, this will not crash anymore from running out of stack space.

Yep. That's why I toyed with this idea.

Your implementation is extremely close to what I had in mind. Looks pretty good at first glance, but I'll need to find some block of time to review this carefully.

Thanks a lot!

kou · 2026-06-02T01:05:19Z

Great! Could you share benchmark results with/without this? We can use https://github.com/ruby/json/tree/master/benchmark .

Copilot

Pull request overview

This PR refactors the C extension JSON parser from a recursive-descent implementation to an iterative state machine driven by an explicit “frame stack” of container parse states. This is foundational work toward streaming-style parsing and also avoids C call-stack exhaustion when parsing very deeply nested JSON with max_nesting disabled.

Changes:

Introduces a new json_frame_stack (with spill-to-heap and Ruby TypedData lifecycle management) to track container parsing state without recursion.
Replaces the recursive json_parse_any implementation with an iterative loop using per-frame “phases” (VALUE/KEY/COLON/COMMA/DONE) to drive parsing.
Wires the frame stack into JSON_ParserState and initializes/cleans it up in cParser_parse.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+// resume purely from the frame stack. A JSON_FRAME_ROOT frame sits at the
+// bottom of the stack, so the stack is never empty mid-parse and the document
+// itself is just another frame whose value, once parsed, leaves its phase DONE.
+static VALUE json_parse_any(JSON_ParserState *state, JSON_ParserConfig *config)


byroot · 2026-06-02T06:37:13Z

At least on my machine, there is a 5-15% regression on most benchmarks.

But we might be able to reclaim some of that.

== Parsing mixed utf8 (5002001 bytes)
before:     3141.5 i/s
 after:     2803.9 i/s - 1.12x  slower

== Parsing mostly utf8 (1668001 bytes)
before:     2956.7 i/s
 after:     2870.0 i/s - same-ish: difference falls within error

== Parsing lots_unescape (40301 bytes)
before:     5971.8 i/s
 after:     5690.9 i/s - 1.05x  slower

== Parsing some_unescape (153301 bytes)
before:    37239.2 i/s
 after:    37003.2 i/s - same-ish: difference falls within error

== Parsing more_unescape (306301 bytes)
before:    16759.3 i/s
 after:    15419.9 i/s - 1.09x  slower

== Parsing small nested array (121 bytes)
before:  1414439.9 i/s
 after:  1243099.9 i/s - 1.14x  slower

== Parsing small hash (65 bytes)
before:  3654066.0 i/s
 after:  3286767.3 i/s - 1.11x  slower

== Parsing test from oj (258 bytes)
before:   639163.7 i/s
 after:   572470.2 i/s - 1.12x  slower

== Parsing integers (10001 bytes)
before:   119149.1 i/s
 after:   103511.2 i/s - 1.15x  slower

== Parsing twitter_escaped.json (562408 bytes)
before:      946.4 i/s
 after:      891.9 i/s - 1.06x  slower

== Parsing activitypub.json (58160 bytes)
before:    14002.3 i/s
 after:    13166.3 i/s - 1.06x  slower

== Parsing twitter.json (567916 bytes)
before:     1492.6 i/s
 after:     1364.4 i/s - 1.09x  slower

== Parsing citm_catalog.json (1727030 bytes)
before:      753.9 i/s
 after:      701.1 i/s - 1.08x  slower

== Parsing float parsing (2251051 bytes)
before:      290.9 i/s
 after:      275.4 i/s - 1.06x  slower

byroot · 2026-06-02T07:00:58Z

By introducing just a couple "computed gotos," the one benchmark that got hit the most can be made 15% faster: 3f15d4d

== Parsing integers (10001 bytes)
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +PRISM [arm64-darwin25]
Warming up --------------------------------------
               after    11.867k i/100ms
Calculating -------------------------------------
               after    118.988k (± 0.5%) i/s    (8.40 μs/i) -    605.217k in   5.086353s

Comparison:
before:   103475.2 i/s
 after:   118988.4 i/s - 1.15x  faster

And it's now on par with master:

== Parsing integers (10001 bytes)
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +PRISM [arm64-darwin25]
Warming up --------------------------------------
               after    11.981k i/100ms
Calculating -------------------------------------
               after    119.833k (± 0.5%) i/s    (8.34 μs/i) -    611.031k in   5.099014s

Comparison:
before:   119626.1 i/s
 after:   119833.2 i/s - same-ish: difference falls within error

And that's just a quick hack, I couldn't add a computed goto for the main issue in that benchmark which is JSON_PHASE_COMMA.

I think if we split the COMMA phase to have OBJECT_COMMA and ARRAY_COMMA we can save more.

This also makes twitter.json back to less than 3% slower:

== Parsing twitter.json (567916 bytes)
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +PRISM [arm64-darwin25]
Warming up --------------------------------------
               after   146.000 i/100ms
Calculating -------------------------------------
               after      1.394k (± 1.1%) i/s  (717.47 μs/i) -      7.008k in   5.028029s

Comparison:
before:     1436.5 i/s
 after:     1393.8 i/s - 1.03x  slower

# second try

== Parsing twitter.json (567916 bytes)
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +PRISM [arm64-darwin25]
Warming up --------------------------------------
               after   145.000 i/100ms
Calculating -------------------------------------
               after      1.402k (± 1.7%) i/s  (713.33 μs/i) -      7.105k in   5.068227s

Comparison:
before:     1433.5 i/s
 after:     1401.9 i/s - same-ish: difference falls within error

JSON_PHASE_ARRAY_COMMA and JSON_PHASE_OBJECT_COMMA Allows to remove one conditional.

Saves having to go through the dispatch loop again.

Take less arguments so it's easier to read.

byroot · 2026-06-02T08:08:44Z

After the last change, most benchmarks are now on par if not a little bit faster. A few are still a little bit slower but I'll see what I can do. Also There's always a bit of variance, so 1.04x might not be very significative:

== Parsing mixed utf8 (5002001 bytes)
before:     3151.8 i/s
 after:     2795.7 i/s - 1.13x  slower

== Parsing mostly utf8 (1668001 bytes)
before:     3027.4 i/s
 after:     2869.1 i/s - 1.06x  slower

== Parsing lots_unescape (40301 bytes)
before:     5910.0 i/s
 after:     5934.9 i/s - same-ish: difference falls within error

== Parsing some_unescape (153301 bytes)
before:    35750.6 i/s
 after:    37184.6 i/s - same-ish: difference falls within error

== Parsing more_unescape (306301 bytes)
before:    16661.9 i/s
 after:    16800.5 i/s - same-ish: difference falls within error

== Parsing small nested array (121 bytes)
before:  1423876.0 i/s
 after:  1478526.9 i/s - 1.04x  faster

== Parsing small hash (65 bytes)
before:  3649885.1 i/s
 after:  3508125.7 i/s - 1.04x  slower

== Parsing test from oj (258 bytes)
before:   643706.6 i/s
 after:   609923.1 i/s - 1.06x  slower

== Parsing integers (10001 bytes)
before:   119193.6 i/s
 after:   132792.5 i/s - 1.11x  faster

== Parsing twitter_escaped.json (562408 bytes)
before:      942.2 i/s
 after:      931.2 i/s - same-ish: difference falls within error

== Parsing activitypub.json (58160 bytes)
before:    13917.0 i/s
 after:    13262.0 i/s - 1.05x  slower

== Parsing twitter.json (567916 bytes)
before:     1481.9 i/s
 after:     1485.0 i/s - same-ish: difference falls within error

== Parsing citm_catalog.json (1727030 bytes)
before:      755.5 i/s
 after:      744.3 i/s - 1.02x  slower

== Parsing float parsing (2251051 bytes)
before:      293.6 i/s
 after:      305.1 i/s - 1.04x  faster

byroot · 2026-06-02T08:14:52Z

I just realized I made a mistake when benchmarking the initial version, the regression was 5-15% across the board. I updated my previous comment.

kddnewton · 2026-06-02T13:42:48Z

Well that's the last time I try to benchmark things on my machine, lol. I genuinely thought they were about equal, sorry about that.

byroot · 2026-06-02T15:45:38Z

😂 no worries. To be fair I never took the time to cleanup and commit the benchmark harness I'm using.

kddnewton added 2 commits June 1, 2026 12:47

Make the JSON parse loop iterative

0efe6c1

As opposed to a recursive loop. We do this by keeping a stack of frames (very similar to how the stack of values was already stored). Each frame represents the state of a container. Since there are only 2 in JSON, it doesn't have to get too complex.

JSON iterative parsing phases

05d56b9

Each frame in the iterative parser now holds an enum describing its "phase", in order to support suspending parsing.

kddnewton mentioned this pull request Jun 1, 2026

Add support for parsing chunked data #983

Open

kou requested a review from Copilot June 2, 2026 01:05

Copilot started reviewing on behalf of kou June 2, 2026 01:05 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

byroot added 3 commits June 2, 2026 09:21

parser.c: Split JSON_PHASE_COMMA

cc5bedd

JSON_PHASE_ARRAY_COMMA and JSON_PHASE_OBJECT_COMMA Allows to remove one conditional.

json_parse_any: introduce computed gotos

78c2585

Saves having to go through the dispatch loop again.

parser.c: Refactor json_frame_stack_push

8748a7d

Take less arguments so it's easier to read.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterative parsing#989

Iterative parsing#989
kddnewton wants to merge 5 commits into
masterfrom
flatten

kddnewton commented Jun 1, 2026

Uh oh!

byroot commented Jun 1, 2026

Uh oh!

kou commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

byroot commented Jun 2, 2026 •

edited

Loading

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

kddnewton commented Jun 2, 2026

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kddnewton commented Jun 1, 2026

Uh oh!

byroot commented Jun 1, 2026

Uh oh!

kou commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

byroot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

kddnewton commented Jun 2, 2026

Uh oh!

byroot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

byroot commented Jun 2, 2026 •

edited

Loading