I’m implementing a function that should parse a lot (multiple terabytes) of JSON data with schema known before processing.
Iterator<JsonObject> parseJson(Schema schema, Iterator<String> data) {
var parser = createParser(schema);
return new Iterator<JsonObject>() {
@Override
public boolean hasNext() {
return data.hasNext();
}
@Override
public JsonObject next() {
return parser.parse(data.next());
}
};
}
I want to leverage the fact that schema is known to parse it as fast as possible. Are there some existing solutions, like using Jackson Blackbird but constructing it from my custom schema object instead of a class? Will it be worth it even though the result is not a class but a generic JSON object?
3
Answers
i don’t know about existing solutions, here is something similar to what you’ve tried, but using streams, maybe it helps:
I’m not aware of any.
It is worth noting the following from the Blackbird README file:
"Up to 20% better" is not a huge improvement, especially when that is qualified with "in some cases".
I don’t think that an improvement of "up to 20%" is going to be worth the effort of developing your own solution. (However that’s just my opinion. For all I know, a 20% improvement could be a deal-maker … for you.)
Blackbird and the older Afterburner got much of their performance improvement by avoiding reflection when operating on (reading or constructing) POJOs. If you are using a
JSONObject
representation or similar instead of POJOs, you no longer have that aspect to (potentially) optimize.When I first read this question, I thought "Hey, why not translate your custom schemas into grammars suitable for ANTLR or JavaCC or similar, and then generate fast parsers?"
But it won’t work. The problem is tokenization. Unfortunately in the JSON language, string literals are used both as primitive values and as attribute names in objects. And a conventional grammar notation cannot express something like "a string token whose value is "xyz".
I guess you might be able to fudge it with a context sensitive lexer, but my guess that would have performance downsides that would tend to negate the benefits of an efficient PGS generated parser.
Another approach might be to treat individual characters as tokens. That would be horrible if you were writing the grammar by hand, but for a generated grammar (that nobody needs to read) it might be practical. (Or not. For example, the generated grammars could turn out to be unsuitable for a PGS. It gets technical …)
However … it might work. But you won’t find out until someone tries it and then measures how much performance improvement they actually got for all of the developer effort expended.
Instead of implementing a JSON parser manually, I prefer to parse the JSON with a library like Jackson.
mapper.readTree
would be useful for parsing JSON instantly even if the length of String is large.The output will be: