I have hands, how can I mine OONI data?

OONI Explorer and the OONI API provide access to the data OONI gathers. As these tools currently present some limitations, we publish this post to share other options for researchers interested in digging through OONI data. These options are either currently available or can be made available upon request.

There is another important caveat: The data uploaded by OONI Probes is made available in daily batches, so it may take up to 48 hours for the measurements to become available in OONI Explorer, OONI API and other sources of OONI data.

Raw data

Thanks to the Amazon Open Data program, the whole OONI dataset can be fetched from the ooni-data Amazon S3 bucket.

To access the S3 buckets we recommend you either use the AWS Command Line Interface or the AWS SDK for Python (boto).

There are two prefixes available within the bucket:

In each of these prefixes you will find one directory (a “daily bucket”) for every day of the year (example: s3://ooni-data/autoclaved/jsonl/2017-11-23). It’s worth noting that the “daily bucket” dates are not necessarily the dates of the measurements, but the dates of when particular sets of measurements were processed by the pipeline.

There is also the legacy sanitised/ prefix in the bucket, but it hasn’t been updated since 2017-10-02 and is scheduled for removal.

You can find information about the base data formats inside of ooni-spec/data-formats.

The data format for every test is also specified inside of ooni-spec/test-specs.

Note: The JSON schema is not entirely enforced on data ingestion, so there may be some slight difference between the schema specification and the actual data.

The command-line lz4 tool supporting LZ4 format is packaged as liblz4-tool for Debian 9 (stretch) and Ubuntu 16.04 (xenial), older versions may fail with the following error message: Error 64 : Does not support stream size.

jsonl

Each file inside of the jsonl “daily bucket” has the following format:

${UTCTIMESTAMP}-${COUNTRY_CODE}-AS${AS_NUMBER}-${TEST_NAME}-${REPORT_ID}-0.2.0-probe.json

Example:

20171123T012056Z-YE-AS30873-ndt-20171122T162015Z_AS30873_AkAI5sg9XxaVlGlGaGO1fiab5M03iu7ntXEiT2uN5ojtBXIdzr-0.2.0-probe.json

You can list the files related to a particular date using the AWS CLI:

aws s3 ls s3://ooni-data/autoclaved/jsonl/2017-11-23/

jsonl.tar.lz4

The jsonl.tar.lz4 “daily buckets” contain one or more LZ4 compressed files for every OONI Probe test type.

The file format is: ${TEST_NAME}.${NUM}.tar.lz4 for small sets of files.

Example:

...
2017-11-24 02:29:10    1485166 tcp_connect.0.tar.lz4
2017-11-24 02:29:01      18463 telegram.0.tar.lz4
2017-11-24 02:29:01      86817 vanilla_tor.0.tar.lz4
2017-11-24 02:29:01   10949785 web_connectivity.00.tar.lz4
2017-11-24 02:29:02   10151077 web_connectivity.01.tar.lz4
2017-11-24 02:29:02   10698714 web_connectivity.02.tar.lz4
2017-11-24 02:29:04   10664260 web_connectivity.03.tar.lz4
...

The format is the same as jsonl (with the .lz4 extension) for larger files (currently 64MB).

Example:

...
2017-11-24 02:28:57   23481332 20171122T190819Z-RO-AS8708-web_connectivity-20171122T190820Z_AS8708_btrNW56GZOToKz1RAIxqRBuEsjAVeI4lp3Rt0qd4owWqUcYdTY-0.2.0-probe.json.lz4
2017-11-24 02:28:58   27249867 20171122T230746Z-US-AS20001-web_connectivity-20171122T230747Z_AS20001_FvWLlFFUg2K7UCY9BCZsdv3qp2DvhPPl2WHFmvmgUJ7sYaWOrJ-0.2.0-probe.json.lz4
...

You can list the files related to a particular date using the AWS CLI:

aws s3 ls s3://ooni-data/autoclaved/jsonl.tar.lz4/2017-11-23/

You should run the aws command with the --no-sign-request option, to disable signing of requests (ex. aws --no-sign-request s3 ls s3://ooni-data/autoclaved/jsonl.tar.lz4/2017-11-23/).

A gzip compressed newline separated JSON index file (index.json.gz) is also available in every “daily bucket” root (example: s3://ooni-data/autoclaved/jsonl.tar.lz4/2017-11-23/index.json.gz). The file stores metadata to make the tarballs seekable. You don’t have to parse information on LZ4 frames unless you want to seek() to the specific measurement, as decompressing tarballs with tar -I lz4 --extract … should work.

Every row of the index file contains a JSON document with metadata that is useful to find the measurements you care about inside of a given daily bucket.

The ordering of the rows inside of the index file matters!

Each document has a type key that can be one of:

Metadata as PostgreSQL dump

If you don’t need the whole dataset, you may get a PostgreSQL (9.6) database dump that holds some metadata about every measurement collected and you can run SQL queries on it. That may be much faster if you need some aggregate statistics or you need to know which subset of measurements you need to download for further processing.

The most sizable data that is removed is the response body of web pages. The uncompressed database size is ~230 gigabytes (as of 2018-02-03) including some indexes growing at ~0.75 gigabytes per day. The dump compressed for data transfer is around three times smaller. Let us know if that’s useful for you!

Metadata as a service

If the metadata database is enough (or useful) for your analysis and you know how to run SQL queries, but you have no resources to run your own PostgreSQL server instance, you can request access to the sandboxed read-only OONI metadata DB instance.

OONI API

The OONI API is nice for cursory analysis or some integrations with other systems (e.g. OONI Explorer relies on it), but it’s currently not possible to run any dataset scan that runs for more than a minute using the OONI API. So the API is not the best option if the query does heavy scanning of metadata. Also, implementation of pagination in the OONI API next_url is far from perfect and may fail with greater than zero offsets.

Rule of thumb: If the OONI API is slow for you (i.e. your request takes more than half a minute) or you need more than a couple thousands of API requests to achieve your goal, you should consider sending SQL queries directly to some instance of metadata DB as you’ll likely be able to achieve your goal significantly faster that way.

Outro

This post does not represent the desired state of the OONI API and OONI data availability, but highlights current limitations and possible alternative methods to achieve various research goals.