GIOD Project Description (1997-2000)

The GIOD (Globally Interconnected Object Databases) joint project between Caltech, CERN and HP addressed the data storage and access problems posed by the next generation of particle collider experiments which are due to start at CERN in 2008.

The following information is archived material.

 

The data rates from the experiments' online systems will be of order 150 to 1500 MBytes/sec (each event's data is ~1 MByte), giving rise to a yearly accumulation of several PetaBytes. The raw data from the online systems will be reconstructed to particle tracks, energy clusters, etc. in near-real time by large processor farms based on commodity hardware. We expect farms of ~107 MIPS will be required. The reconstructed data (around 100 kBytes per event) will be stored (perhaps with the raw data) in ODBMS.

 

Object data from around 109 particle collisions will need to be made available each year to collaborating physicists. This will require replication of significant fractions of the ODBMS amongst "regional centres" (which serve outlying collaborating institutes), which are scattered across the globe. 

The project ended in 2000, having investigated the scalability of commercial ODBMS, and models of organising the data to optimise access and analysis for the end-user physicist. Some serious challenges were identified in devising a system architecture that allows sufficient flexibility which at the same time prevents inadvertent abuse!

In the project we used several then "leading edge" hardware and software systems, namely the Caltech HP Exemplar, a 256-PA8000 CPU SMP machine of some ~ 0.1 TIPS, the High Performance Software System (HPSS) from IBM, the Objectivity/DB Object Database Management System, the Java 3D API from Sun Microsystems, the Versant ODBMS, and various high speed Local Area and Wide Area networks.

Overview

A data thunderstorm is gathering on the horizon with the next generation of particle physics experiments. The amount of data is overwhelming. Even though the prime data from the CERN CMS detector will be reduced by a factor of more than 107, it will still amount to over a Petabyte (1015 bytes) of data per year accumulated for scientific analysis. The task of finding rare events resulting from the decays of massive new particles in a dominating background is even more formidable. Particle physicists have been at the vanguard of data-handling technology, beginning in the 1940's with eye scanning of bubble-chamber photographs and emulsions, through decades of electronic data acquisition systems employing real-time pattern recognition, filtering and formatting, and continuing on to the PetaByte archives generated by modern experiments. In the future, CMS and other experiments now being built to run at CERN’s Large Hadron Collider expect to accumulate of order of 100 PetaBytes within the next decade.

The scientific goals and discovery potential of the experiments will only be realized if efficient worldwide access to the data is made possible. Particle physicists are thus engaged in large national and international projects that address this massive data challenge, with special emphasis on distributed data access. There is an acute awareness that the ability to analyze data has not kept up with its increased flow. The traditional approach of extracting data subsets across the Internet, storing them locally, and processing them with home-brewed tools has reached its limits. Something drastically different is required. Indeed, without new modes of data access and of remote collaboration we will not be able to effectively “mine” the intellectual resources represented in our distributed collaborations. Thus the projects we are working on explore and implement new ideas in this area that until now have only been discussed in a theoretical context. These ground-breaking projects include:

To be as realistic as possible, the projects make use of large existing data sets from high energy and nuclear physics experiments. They will help to answer some important questions that include:

  • How are we going to integrate the querying algorithms and other tools to speed up access to the distributed data?
  • How are we going to cluster the data optimally for fast access?
  • How can we optimize the clustering and querying of data distributed across continents?
  • What dynamical re-clustering strategies should be used?
  • How do we compromise between fully ordered (sequential) organization, and totally “anarchic”, random arrangements of the data?

The use of OO languages and Object persistency is fundamental in our current thinking: these technologies allow us to define, implement and store the physics objects and inter-relationships that we deal with. We can then express the highly complicated queries on the object store in order to extract the events and features of interest.

These research directions will very likely be taken up in other branches of science, and in large corporations: the ability to rapidly mine scientific data, and the use of smart query engines will be a fundamental part of daily research and education in the 21st century.

Other material describing GIOD