Marine Science IoT: Where Big Data meets Linked Data
Room B - Thu 15:15
Speakers
The Marine Institute is the State agency responsible for marine research, technology development and innovation in Ireland. We carry out environmental, fisheries, and aquaculture surveys and monitoring programme's to meet Ireland's national and international legal requirements.
Curating marine environmental data, has traditionally meant countless hours re-formatting raw data outputs from field instruments into standard data formats. That practice is just about sustainable when data is collected in a discrete way. However, a new model for converting these raw outputs was needed when we deployed a sub-sea observatory in Galway Bay, tethered to the shore by fibre-optic cable and returning measurements including temperature, salinity, and current velocity every second.
Treating the incoming raw data sources like conductivity-temperature-depth sensors or fluorometers as application logs, we pipe this data through Logstash, using the Grok and Mutate filters to produce a structured, fully standardized data format.
Once into this JSON semi-standard format, the Mutate filter can be applied to create a full-standard output.This output is pushed to an Apache Kafka message queue, and from there is stored in an Apache Cassandra database which is made available to the general public at http://erddap.marine.ie. In the geospatial data world, the main standards body is the Open Geospatial Consortium (OGC). One of the OGC standards is targeted at observations – Observations & Measurements – and has recently been translated to JSON which we target from the Logstash Mutate filter.
One of the ideas that has been kicking around in the marine environmental data management community for the last few years is "Born Connected." This takes the Born Digital idea of data collected by computer rather than on a physical record and extends it into the Semantic Web and Linked Data worlds – that data arrive from the instrument with web addresses to parameter definitions; unit definitions; and other documents built right into the data files. The Born Connected approach will allow for the better integration of observed data directly into weather forecast models. It also allows users to more easily discover data relevant to their needs, and gives researchers the ability to automatically create reports from data. However, making this scale has been an issue, and how to collect together the patterns used to give birth to the connected data hasn't been addressed.
By beginning a GitHub repository for our Grok patterns [2], they can be reused by any other data collecting organisation to parse their raw instrument data using Logstash. Similarly, GitHub pull requests will allow others to contribute to the register of Grok patterns and for a comprehensive archive to be created.
We still have some work to do in processing different data types in this way, including 2-D data (such as profiles of current velocity through the ocean) and in processing binary raw data feeds but we're looking forward to tackling those challenges.