Ideas on Data Analysis methods

Data analysis tasks are characterised by large CPU time per event
Implies a small data flow rate (most of the time is spent calculating, rather than doing I/O)
Thus we pre-stage and filter the required data first
- The staging and filtering is a part of re-clustering the events either for individual user, or analysis group needs, and is done in a separate process so that:
  - free up any central tape or disk drive asap
  - minimize contention with other users, associated with central access of data
  - start moving "small" rather than"large" datasets around, asap.
  - The data to be reclustered and moved can be determined from a single job for the user or analysis group
When processing, we want the access to the data to belocal in time as well as space: This means reducing the probability of having to open a container and fill a cache, only to have the cache flushed before many objects in the container are de-referenced.
- Thus, to reference raw data or lower level hits (perhaps due to an unusual condition or analysis requirement), then the analysis causes a flag (or store a tag) to be set, which defers the access to the required container until later (when sufficient other analysis tasks require access to that container).
- We avoid handling the unusual condition in quasi-real time, during the fast analysis on smaller, or more easily reachable, container(s).
We move the events into 300 MBytes of local memory on the processor on which they will be analysed
- Moving 300 Mbytes to a local memory takes only 3000 seconds at 1 Mbps.
- A "service" for 100 users, with some 100 Mbps should be sufficient.
- Moving the data to the user in this way, takes advantage of many desktops for processing.
- We can use similar considerations for data movement over a LAN, at say 10 Mbps/user (5 minutes to get 30k events). Need 1 Gbps to run such a service.
- This avoids contention with other users for the same data
- Users could get many "chunks"of data per day.
- We can fit 30000 events of size 10 kBytes into the memory
We then start the analysis task on the processor, assumed to be 2000 MIPS.
- We assume the time to analyse the event is proportional to the time taken to reconstruct it: 20000 MIPS-seconds per raw 1 MByte event.
- So we need 200 MIPS-seconds for our 10 kBytes events
- The analysis runs at the rate of 10 events per second
- The analysis is complete in 3000 seconds