JSON Lines format: Why jsonl is better than a regular JSON for web scraping

CSV and JSON formats introduction

Comma Separated Values (CSV) format is a common data exchange format used widely for representing sets of records with identical list of fields.

JavaScript Object Notation (JSON) nowadays became de-facto of data exchange format standard, replacing XML, that was a huge buzzword in the early 2000’s. It is not only self-describing, but also human readable.

Let’s look examples of both formats.

Here is a list of families represented as CSV data:

id,father,mother,children
1,Mark,Charlotte,1
2,John,Ann,3
3,Bob,Monika,2

CSV looks a lot simpler than JSON array analog shown below:

[
{"id":1,"father":"Mark","mother":"Charlotte","children":1},
{"id":2,"father":"John","mother":"Ann","children":3},
{"id":3,"father":"Bob","mother":"Monika","children":2},
]

But CSV is limited to store two-dimensional, untyped data. There is no any way to store nested structures or types of values like names of children in plain CSV.

[
{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]},
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]},
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]},
]

Representing nested structures in JSON files is easy, though.

Why not just surround the whole data with a regular JSON array so the file itself is valid json?

In order to insert or read a record from a JSON array you have to parse the whole file, which is far from ideal.
Since every entry in JSON Lines is a valid JSON it can be parsed/unmarshaled as a standalone JSON document. For example, you can seek within it, split a 10gb file into smaller files without parsing the entire thing.
1. No need do read the whole file in memory before parse. 2. You can easily add further lines to the file by simply appending to the file. If the entire file were a JSON array then you would have to parse it, add the new line, and then convert back to JSON.
So it is not practical to keep a multi-gigabyte as a single JSON array. Taking into consideration that Dataflow kit users would require to store and parse big volumes of data we’ve implemented export to JSONL format.

JSON lines (jsonl), Newline-delimited JSON (ndjson), line-delimited JSON (ldjson) are three terms expressing the same formats primarily intended for JSON streaming.

Let’s look into what JSON Lines is, and how it compares to other JSON streaming formats.

JSON Lines vs. JSON

Exactly the same list of families expressed as a JSON Lines format looks like this:

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

JSON Lines essentially consists of several lines where each individual line is a valid JSON object, separated by newline character `\n`.

It doesn’t require custom parsers. Just read a line, parse as JSON, read a line, parse as JSON… and so on.

Actually it is already very common in industry to use jsonl

Click on the link below to find more details about JSON lines specification.

JSON Lines vs. JSON text sequences

Let’s compare JSON text sequence format and associated media type “application/json-seq” with NDJSON. It consists of any number of JSON texts, all encoded in UTF-8, each prefixed by an ASCII Record Separator (0x1E), and each ending with an ASCII Line Feed character (0x0A).

Let’s look at the list of Persons mentioned above expressed as JSON-sequence file:

<RS>{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}<LF>
<RS>{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}<LF>
<RS>{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}<LF>

<RS> here is a placeholder for non-printable ASCII Record Separator (0x1E). <LF> represents the line feed character.

The format looks almost identical to JSON Lines excepting this special symbol at the beginning of each record.

As these two formats so similar you may wonder why they both exist?

JSON text sequences format is used for a streaming context. So this format does not define corresponding file extension. Though JSON text sequences format specification registers the new MIME media type application/json-seq. It is error-prone to store and edit this format in a text editor as the non-printable (0x1E) character may be garbled.

You may consider using JSON lines as an alternative consistently.

JSON Lines vs. Concatenated JSON

Another alternative to JSON Lines is concatenated JSON. In this format each JSON text is not separated from each other at all.

Here is concatenated JSON representation of an example above:

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

Concatenated JSON isn’t a new format, it’s simply a name for streaming multiple JSON objects without any delimiters.

While generating JSON is not such a complex task, parsing this format actually requires significant effort. In fact, you should implement a context-aware parser that detects individual records and separates them from each other correctly.

Pretty printed JSON formats

If you have large nested structures then reading the JSON Lines text directly isn’t recommended. Use the jq tool to make viewing large structures easier:

grep . families.jsonl | jq

As a result you will see pretty printed JSON file:

{
"id": 1,
"father": "Mark",
"mother": "Charlotte",
"children": [
"Tom"
]
}
{
"id": 2,
"father": "John",
"mother": "Ann",
"children": [
"Jessika",
"Antony",
"Jack"
]
}
{
"id": 3,
"father": "Bob",
"mother": "Monika",
"children": [
"Jerry",
"Karol"
]
}

Conclusion

The complete JSON Lines file as a whole is technically no longer valid JSON, because it contains multiple JSON texts.
The fact that every new line means a separate entry makes the JSON Lines formatted file streamable. You can read just as many lines as needed to get the same amount of records.

More by Dmitry Narizhnykh

Topics of interest

More Related Stories