As data leaks move into the terabytes, journalists need tools to search, analyse and collaborate on their investigations. We will cover the technical lessons learnt over two years of development at the Guardian as we built our platform in both the cloud and running entirely air-gapped offline.
We will introduce GIANT, the Guardian’s new platform for searching, analysing and collaborating on data leak backed investigations.
With the size of leaks increasing (Edward Snowden: 55,000 files, the Paradise Papers: 13.4 million), the Guardian has built its own platform for analysis which has already seen success on several projects, most notably the Daphne Project which continues the work of the journalist Daphne Caruana Galizia.
In the talk we will cover how we designed our data model to effectively handle “any” possible file type and scale up to terabytes of stored data. We will discuss how using Neo4j we are able to reconstruct the threads of conversation between individuals and companies identified in the data and the surprising limits that come with using a graph database as our storage system of record.
We will also dive into our use of Elasticsearch, in particular how best to support leaks containing multiple languages and how we were able to add full Russian and Arabic language support to an existing dataset whilst the journalists continued their investigation using the tool.
We will also discuss our extractors, the system of plugins that process the files when we receive them. We will cover the lessons learned as we moved from calling in-process code in the JVM to Docker and containerisation to not only take advantage of the wide ecosystem of open source processing tools but also effectively scale out our computation both in AWS and also in our completely offline air-gapped deployment for more sensitive data.
Finally, we will also discuss the value of direct working relations between developers and journalists. This leads us to a change in how we developed our tooling, moving more towards building a secure platform upon which other more specialist tools can be written. We will show a great example of this with “Laundrette”, a new tool that lets data journalists add structure to hundreds of thousands of documents quickly.