json2tsv: a JSON to TSV converter

Last modification on

json2tsv reads JSON data from stdin. It outputs each JSON type to a TAB-Separated Value format per line.

TAB-Separated Value format

The output format per line is:

nodename<TAB>type<TAB>value<LF>

Control-characters such as a newline, TAB and backslash (\n, \t and \\) are escaped in the nodename and value fields. Other control-characters are removed.

The type field is a single byte and can be:

  • a for array
  • b for bool
  • n for number
  • o for object
  • s for string
  • ? for null

Filtering on the first field "nodename" is easy using awk for example.

Clone

git clone git://git.codemadness.org/json2tsv

Download releases

Releases are available at: https://codemadness.org/releases/json2tsv/.

Build and install

$ make
# make install

Features

  • Accepts all valid JSON.
  • Designed to work well with existing UNIX programs like awk and grep.
  • Straightforward and not much lines of code: about 425 lines of C.
  • Few dependencies: C compiler (C99), libc.
  • No need to learn a new (meta-)language for processing data.
  • The parser supports codepoint decoding (\u2303) and UTF-16 surrogates (\ud83d\ude02) to UTF-8.
  • It does not output control-characters to the terminal for security reasons by default (but it has a -r option if needed).
  • On OpenBSD it supports pledge(2) for syscall restriction, pledge("stdio", NULL).

Cons

  • For the tool there is additional overhead by processing and filtering data from stdin after parsing.
  • The parser does not do complete validation on numbers.
  • The parser accepts some bad input such as invalid UTF-8 (see RFC8259 - 8.1. Character Encoding). json2tsv reads from stdin and does not do assumptions about a "closed ecosystem" as described in the RFC.
  • The parser accepts some bad JSON input and "extensions" (see RFC8259 - 9. Parsers).
  • The parser is not the fastest possible JSON parser (but also not the slowest). For example: for easy-of-use, at the cost of performance all strings are decoded, even though they may be unused.

Why Yet Another JSON parser?

I wanted a tool that makes parsing JSON easier and work well from the shell, similar to jq.

sed and grep often work well enough for matching some value using some regex pattern, but it is not good enough to parse JSON correctly or to extract all information: just like parsing HTML/XML using some regex is not good (enough) or a good idea :P.

I didn't want to learn a new specific meta-language which jq has and wanted something simpler. While it is more efficient to embed this query language for data aggregation, it is also less simple. In my opinion it is simpler to separate this and use pattern-processing by awk or an other filtering/aggregating program.

For the parser, there are many JSON parsers out there, like the efficient jsmn parser, however a few parser behaviours I want to have are:

  • jsmn buffers data as tokens, which is very efficient, but also a bit annoying as an API as it requires another layer of code to interpret the tokens.
  • jsmn does not handle decoding strings by default. Which is very efficient if you don't need parts of the data though.
  • jsmn does not keep context of nested structures by default, so again requires writing custom utility functions for nested data.

I went for a parser design that uses a single callback per "node" type and keeps track of the current nested structure in a single array and emits that.

Examples

An usage example to parse posts of the JSON API of reddit.com and format them to a plain-text list using awk:

#!/bin/sh
curl -s -H 'User-Agent:' 'https://old.reddit.com/.json?raw_json=1&limit=100' | \
json2tsv | \
awk -F '\t' '
function show() {
	if (length(o["title"]) == 0)
		return;
	print n ". " o["title"] " by " o["author"] " in r/" o["subreddit"];
	print o["url"];
	print "";
}
$1 == ".data.children[].data" {
	show();
	n++;
	delete o;
}
$1 ~ /^\.data\.children\[\]\.data\.[a-zA-Z0-9_]*$/ {
	o[substr($1, 23)] = $3;
}
END {
	show();
}'

An example to list the VODS from a channel on the streaming website twitch.tv using the Twitch.tv Helix API (requires an API key):

#!/bin/sh
# Client-ID / API key.
clientid="clientidhere"

# request(url)
request() {
	curl \
		-s \
		-H "Client-ID: ${clientid}" \
		-H "User-Agent:" \
		"$1"
}

# getuseridbyname(username)
getuseridbyname() {
	request "https://api.twitch.tv/helix/users?login=$1" | \
		json2tsv | \
		awk -F "\t" '$1 == ".data[].id" { print $3; exit; }'
}

# getvodsbyuserid(userid) {
getvodsbyuserid() {
	request "https://api.twitch.tv/helix/videos?first=100&user_id=$1" | \
		json2tsv | \
		awk -F "\t" '
$1 == ".data[].title" {
	print "Title:        " $3;
}
$1 == ".data[].published_at" {
	print "Published at: " $3;
}
$1 == ".data[].url" {
	print "Url:          " $3;
}
$1 == ".data[].duration" {
	print "Duration:     " $3;
	print "==="; # assume ordered in this way and last.
}'
}

username="$1"
if test x"${username}" = x""; then
	echo "usage: $0 <username>" >&2
	exit 1
fi

userid=$(getuseridbyname "${username}")
if test x"${userid}" = x""; then
	echo "user not found" >&2
	exit 1
fi

getvodsbyuserid "${userid}"

References