ModNet: Need to specify number of nodes
at each institute. Have to specify which parameter gets fixed: network
bandwidth or number of CPUs. Can specify a node that simulates the Exemplar,
with a high bandwidth between 256 CPUs.
NILE: Analysis tasks for the next
generation. Tasks need to "check-in" with a snapshot of their current
state. One task is to actually build the executable that will run. Sub-jobs run
in a 20 minute time slice. Jobs can be scheduled on an idle machine that might
not have any of the data local. Federation consists of a meta-data database,
and 5000 databases containing data, with 100 million objects. A collection of
databases corresponds to a "cluster". Databases distributed across 20
disks on 6 nodes. Each database has a container with the encapsulation of the
legacy event record. Another contains metadata on each event record. Each
metadata tag is about 250 Bytes per event. The event size is 5 kBytes. (cf. 100
bytes for 100 kBytes.) Event data is no. of tracks, thrust, total visible
energy etc.: it is reconstructed data. There is a bit array, like a mask, that
the Tau people use to set features. Each database is 20 Mbytes. Each also has
run objects with a one to many association with the tag data for events in the
run. There are many run objects in the database. Things like the beam energy,
magnetic field, are also stored.
A "FileAtom" is an
encapsulation of a database. It gives the location, physics info like the
luminosity, first run number last number etc.. FileAtoms are meta-data in the
database. Within the atom are data-items, which point to individual databases.
There are 5000 databases, with potentially 5000 locks: there is a lockserver
running on one of the nodes, but there has been no problem with locking.
The Cluster object contains a set of
DataObjects, with member functions "copyto", "moveto" and
so on. These are management tools, and are in fact rarely used. When used, they
take care of moving the databases to the best places in the system.
Query has form that is resolved into a
collection of FileAtoms.
It takes three days to populate the
databases with the legacy data: 1 Mbyte/second. Indexes are created in a second
pass: this is very slow. The indexes are on run number and event number. The
next version will build indexes on physics quantities, and allow selections on
e.g. energy level from the user.
Objy page size is the maximum: 65k which
gives about a 12% overhead. There are a couple of scalability problems. One is
with the data location manager, a replicated object that can sit on more than
one node. These all share the same information, which is being reported to a
master who ensures consistency. This is a bottleneck, partly due to the
overhead of the communication system, ISIS. Other scalability problem with
event selection. Random event access, or sequential event access. Despite this,
have never seen the system bogged down. Maximum concurrent users about 6. The
scheduling algorithm is primitive: we suggest looking at LSF.
The Resource Reporter collects
statistics on the loads on the systems. This information is available to the
Site Manager, whose job it is to allocate jobs to the nodes. The jobs are
allocated via proxy processes for each of the nodes that advertise what
resources (e.g. databases, files) are available on their node.
Can scan 20000 events per second
maximum, looking through the tag. The limit at this speed is the Objectivity
overhead.
Each event record is 492 attributes. The
5kBytes is a compressed record … the actual size is much larger. Each analysis
job used only about 50 attributes.