ITR/ACS Distributed Symbolic
for Knowledge Discovery in Scientific Databases
Roy Williams (PI), Julian Bunn, Thomas Sterling
California Institute of Technology
We present a proposal for the design, prototyping and construction of a Lisp-based scientific inference engine using symbolic computation methods, and targeted at mining large datasets from astronomy and particle physics. The concept includes computing hardware and software that can be used as a scientist-friendly system for inferring previously unknown facts from multiple, independent analyses of the data. The work will be applied to scientific computation, automated data analysis, information abstraction, and direct manipulation of symbolic knowledge. The objectives of the proposed two year research project, involving collaboration with experts from industry, academia, and the National labs, are 1) to develop a distributed variant of Common Lisp for PC clusters deriving the parallelism directly from the high level semantics of this mature symbolic language, and 2) to employ this system for knowledge extraction in large databases from astronomy and particle physics.
In the current paradigm of scientific data analysis, computers work with real numbers and produce symbolic knowledge -- clusters, relationships, patterns, correlations, hierarchies, etc. -- which are then represented visually so that humans can extract knowledge. This latter process (scientific visualization) has proven exceptionally valuable by enabling scientists to recognize high level abstraction visually from raw low level data. Here, we propose a counterpart that enables computers themselves to organize, analyze, and manipulate the higher order abstractions. We are also motivated by the very practical observation that emerging low cost PC cluster technologies (e.g. Beowulf) can allow massive amounts of DRAM memory to be aggregated at modest price for symbolic computation.
Common Lisp incorporates semantics that are readily exploited for their intrinsic parallel properties including explicit functional constructs, object oriented structures, and mapping functions. These can be augmented by the Futures construct, a powerful synchronization primitive employed in parallel Lisps. A set of parallel operators built on top of existing Common Lisp using MPI as an underlying transport layer will be prototyped on a Beowulf to demonstrate feasibility and as a platform for experiments with scientific applications. A product of this research will be a distributed Cluster Common Lisp released as open source to the scientific computing community.
Two knowledge discovery areas will be studied: the general area of data mining and a specific example in particle physics. We propose to make a general approach to extracting knowledge from multiple independent clustering analyses of a scientific dataset. Each clustering algorithm sees the data from a different point of view and by combining the results of many, viewpoint independent knowledge may be extracted. We will leverage ideas from the business community, mining association rules from these clusters with algorithms from market basket analysis. In this way, we will derive the “deep clusters” to which most clustering algorithms are sensitive, the “useful attributes” in the database that represent independent information, and the “interesting outliers” that do not fit any classification scheme. The database we use will be from a large-area astronomical sky survey.
High energy particle physics involves a sequence of mappings from raw instrument data to a set of intermediate objects like tracks and energy clusters, and then a further mapping to hypotheses about particles and their properties, finally leading to identification and analysis of phenomena such as jets, transverse energy, missing energy, Higgs events, and missing events.
Symbolic attacks on these two areas will be formulated and implemented on the distributed Common Lisp prototype environment to derive experience and an understanding of the potential role of high-performance symbolic processing for scientific computation.
We present a proposal for the design, prototyping and construction of a Lisp-based scientific inference engine using symbolic computation methods, and targeted at mining large datasets from astronomy and particle physics. The concept includes computing hardware and software that can be used as a scientist-friendly system for inferring previously unknown facts from multiple, independent analyses of the data. The work will be applied to scientific computation, automated data analysis, information abstraction, and direct manipulation of symbolic knowledge. The key elements of the research include the implementation of a distributed Common Lisp engine on a powerful PC cluster; and the application of novel symbolic processing for knowledge discovery in astronomy and high-energy physics.
In the current paradigm of scientific data analysis, computers work with real numbers and produce symbolic knowledge -- clusters, relationships, patterns, correlations, hierarchies, etc.-- which are then represented visually so that humans can extract knowledge. This process (scientific visualization) has proven exceptionally valuable by enabling scientists to recognize high level abstraction visually from raw low level data. Here, we propose a counterpart that enables computers themselves to organize, analyze, and manipulate the higher order abstractions. We are also motivated by the very practical observation that emerging low cost PC cluster technologies (e.g. Beowulf) can allow massive amounts of DRAM memory to be aggregated at modest price for symbolic computation.
Scientific databases are growing rapidly in many ways, the most obvious of which is raw size, and it is this that creates a demand for high-performance processing and storage. But databases are also growing in heterogeneity: there are new types of data, data created as fusion of existing data, data derived from other data, simulated data. As scientists replace files with online databases, more types of metadata are available also. One of the challenges facing the IT community is to create ways of harvesting scientific knowledge and unifying theory.
We would like to think of data-mining in two stages: numerical and symbolic. First the raw data is analyzed with pattern-matching, clustering, transformations, model-fitting, and other kinds of numerically-intensive processing. The result is a set of qualitative, or symbolic, data objects and relationships: "This entity belongs to this class", "This entity looks like this other entity". At this point, the analysis mechanisms are usually informal, moving from observation to knowledge: "It seems like every one of these is also one of those", "Each time I run the pattern matcher, the matches come out different".
We believe that the time is ripe to bring formal systematics to this analysis of symbolic information. We propose to do data-mining not with the data itself, but with the symbolic results that have been created from the data. We propose to do this by leveraging powerful technology from elsewhere at each stage:
· We will use cheap household workstations combined into a high-performance parallel computer.
· We will leverage a powerful, yet robust, software architecture -- Lisp -- proved through many years of funding during the 1970's and 80's.
· We will leverage ideas from the commercial world by mining clusters for association rules.
· We will leverage sophisticated symbol-extraction algorithms that have been developed for complex events in particle-physics experiments.
Common Lisp incorporates semantics that are readily exploited for their intrinsic parallel properties including explicit functional constructs, object oriented structures, and mapping functions. These can be augmented by the Futures construct, a powerful synchronization primitive employed in parallel Lisps (e.g. Hewitt, Halstead, Zippel). A set of parallel operators built on top of existing Common Lisp using MPI as an underlying transport layer will be prototyped on a Beowulf cluster to demonstrate feasibility and as a platform for experiments with scientific applications. A product of this research will be a distributed Common Lisp released as open source to the scientific computing community.
Two knowledge discovery areas will be studied: astronomy and particle physics. Symbolic attacks on these two areas will be formulated and implemented on the distributed Common Lisp prototype environment to derive experience and an understanding of the potential role of high-performance symbolic processing for scientific computation.
We propose to derive a general approach to extracting knowledge from multiple independent clustering analyses of a scientific dataset. Each clustering algorithm sees the data from a different point of view and by combining the results of many, viewpoint independent knowledge may be extracted. We will leverage ideas from the business community, mining association rules from these clusters with algorithms from market basket analysis.
The primary application area of the association rule technology will be astronomy, specifically the analysis of catalogs from all-sky surveys. Such surveys are proliferating, not as images but as catalogs of celestial sources, each with attributes such as magnitudes in different wavebands, measures of the spread and ellipticity of the source, and associations of the source to the same source mapped in other surveys. The development of tools for mining this vast lode of astronomical data are far behind the recording technology.
High energy particle physics involves a sequence of mappings from raw instrument data to a set of intermediate objects like tracks and energy clusters, and then a further identification mapping to event particles (e.g. photons, electrons, hadrons) and properties (e.g. missing energy, jets), and then a final interpretation of the event type (e.g. a Higgs decay) and subsequent determination of the physics underlying the processes.
While the ITR program description demands a single primary focus for the proposal (we chose Advanced Computational Science), we believe our proposal to be relevant to almost the whole breadth of the program.
We believe that Lisp will work well from both the programmer and the machine perspective. Lisp is a mature, stable language, which in spite of its age has most of the features that coerce programmers into good software, for example objects and automatic garbage collection. Lisp also exposes parallelism sufficiently to enable good speedup even on large parallel machines.
· Human-Computer Interface
The user-interface to our system will provide computer support for decisions and ways of codifying and confirming insight and knowledge. We will provide a new means to understand the high level abstractions on top of the raw data, a fusion of numeric, visual, and symbolic representations. In another sense, we will provide the Lisp language as a scripting language – new to the current generation of scientists – and potentially a fruitful interaction paradigm between a GUI and a computing service.
· Information Management
We propose new ways of finding and using the information that comes from clustering or other analysis programs. This will use the symbolic data that represent scientific classifications, that has been traditionally only encoded in the brains of expert scientists.
· Advanced Computational Science
Our proposal involves many new features in computational science. The use of Lisp for high-performance parallel computing is unusual, but in the context of this work it is natural; it replaces message-passing with automatic, hierarchical parallelism. We are breaking new ground in scientific data analysis by bringing in new methodologies and algorithms. Our research is directly applicable to the sciences (astronomy, particle physics), where it will be applied. We expect that it will prove valuable for other areas of science and technology as well.
· Scalable Information Infrastructure
The machine that we propose will be implemented as a tree-structure of homogeneous processors. If this research is successful, there is no reason not to extend the architecture to a large distributed network of symbolic processors.
· Revolutionary Computing
The software that we propose is based to a large extent on parallel associative arrays; such structures fit naturally onto such post-CMOS technologies as holographic storage.
Accumulation rates for scientific data are increasing rapidly. The data themselves are increasingly heterogeneous. This is well known, but we advance a further tenet: that the data analysis methods are also growing and diversifying.
We propose to derive a general approach to extracting knowledge or inferring previously unknown facts from multiple, independent analyses of a scientific dataset. This is similar in spirit to ensemble learning, where models are combined by a voting strategy. However, we propose to leverage the large body of existing business algorithms, which will result in much more valuable results than would a simple voting strategy. These business algorithms and associated methods allow the derivation of association rules by inspection of the data objects (we are particularly struck by the success of analysis of “shopping basket” association methodologies for supermarket or Web/online stores).
Both a scientific dataset and a supermarket consist of a set of entities (e.g. rows from a relational database table) each with certain numerical or categorical attributes. In the case of the scientific data, we derive knowledge by observing classifications of the entities in subsets that make sense. Examples of important subsets are: outliers from the main groups, tight clusters that are well-separated, clusters that remain clusters in different projections of the data, clusters that are grouped around lines or planes of a hyperspace, clusters that a human has picked out, and many other types of subset. Statisticians, epidemiologists, and other scientific disciplines have produced many algorithms for detecting such significant subsets of a dataset, and such practitioners have been unified into a new field of data-mining.
Clustering algorithms include K-means, simulated annealing, decision trees and hierarchical clustering, neural nets, Bayesian clustering, model-based clustering, dynamic models, minimum covariance determinant, and many more. Often these algorithms require inputs that specify a model of the dataset: how many clusters are to be found, discrimination parameters, a temperature schedule for simulated annealing, scaling models, and so on.
Our contention is that there is an analogy between clusters of science objects and clusters of items that customers purchase at a supermarket -- the so-called market baskets. A simple inspection of the items contained in a shopping basket gives no clue as to the motivation of the customer in selecting those objects. Only by an exhaustive analysis of the shopping baskets of many customers, and the same customer over several shopping trips, can inferences be made about the motivation for certain purchases. For example, mining a large amount of shopping basket data may show that hot dogs and hot dog rolls frequently appear in the same basket, and that, if they do, mustard is very likely to also appear. Is this knowledge useful? (Yes, because it tells us what a frequent motivation for buying mustard is, and therefore how as mustard manufacturers we might want to pitch our advertising). As an example from particle physics, by looking at many events which contain a pair of high transverse momentum photons, we observe that those events also frequently contain a large amount of missing energy. We can then infer that a neutrino is often present in prompt di-photon events. This inference is of great value to theorists, as it may suggest the occurrence of a hitherto unforeseen physics process. We thus intend to use the same data-mining approach: to generate sets of items and the association rules that connect them -- if a cluster contains these entities, then it probably also contains these. As with the market baskets, we expect to pare down the set of such rules by demanding high confidence and support and that there is not a simpler rule with the same content.
Each clustering algorithm sees the data from a different point of view: by combining the results of many, we hope to arrive at knowledge that is independent of point of view.
Data analysis in current and future generations of particle physics experiments presents a major computational challenge, due to the enormous quantity of event data collected or planned for collection. Traditionally, and for the foreseeable future, physics analysis of the data is achieved using software that operates by iterating over all available events and selecting those that match an event topology of interest. This selection is achieved by the application of “cuts” which are intended to maximize the signal to background ratio in candidate events that remain. As an example, we show a set of cuts in Table 1, taken from analysis work on the SLAC “BaBar” experiment.
5.20 < Mll < 5.35 GeV/c2
18.104.22.168 Ntrack > 4
c2vertex < 4
|cosqthrust| < 0.7
R2 < 0.5
5.275 < M < 5.287 GeV/c2
-0.038 < DE < 0.050 GeV/c2
Total Signal Efficiency
Total Background Efficiency
Table 1: Showing a set of "cuts" used for selection of B events at the BaBar experiment
The cuts are expressed as a set of conditions on symbolic information available in each event. For example, the cut “Ntrack > 4” imposes the condition that there are more than four particle tracks in the event. The cut “5.20 < Mll < 5.35 GeV/c2” selects events where the invariant mass of the di-lepton pair is in the specified range. The symbolic information is obtained by “reconstruction” from the raw event data: the binary data that comes from the detector instrument itself. Tracks are thus sets of space points that fit the hypothesis of a particle trajectory through the detector. The space points are derived from digital signals in a tracking detector together with positional calibration data for that detector. The Invariant mass of a particle system is a mathematical quantity derived from the particle momenta.
After all the cuts have been applied, the remaining set of events are used as a collection for detailed study, usually involving quite complex mathematical treatment. In some cases (especially for rare or anomalous events) each event in the collection may be examined using an “event viewer” by a trained physicist. The results and conclusions from these detailed studies are usually presented in papers submitted to scientific journals.
We have explained the steps by which events from particle physics experiments are selected and analyzed using traditional computational methods. These methods tend to be very CPU and I/O intensive, with total event samples of 10,000,000,000 (10 billion) per annum expected at the next generation of experiments, and aggregate CPU requirements of >100 SpecInt95-seconds per event for the analysis.
Our contention in this proposal is that an alternative, and complementary, analysis approach is of great potential benefit. By loading our parallel Lisp machine with the symbolic information from a large number of events, we can replicate the analysis tasks (and results) obtained in the traditional system but with significant speedup due to the inherent parallelism of the system, and the embarrassingly parallel nature of the events themselves. Moreover, with the symbolic event information loaded in memory in the parallel Lisp machine, categorizations, correlations, fact inferences and knowledge extraction all become in principle straightforward.
Initially, we will work with simulated data: Monte Carlo event data that has been generated using a known and understood physics model. The simulated data are pre-processed to account for the detector acceptance and response, and massaged into a form that emulates in all detail the data obtained from the real detector. By using simulated data, we have a priori knowledge of the physics (no surprises) and a good handle on what the possible inferences could be from the data presented to the system. Once there is confidence with how this data sample is treated by the system, real data will be used.
We thus define the following deliverables:
· A procedure for loading higher level (“reconstructed”) information from a significant number of simulated events into the Parallel Lisp Machine.
· An analysis task that selects a subset of the simulated events according to a set of cuts, and computes a quantity of interest that can be compared to the traditional analysis result.
· Development of an inference task that attempts to infer correlations between the selected subset of events, correlations that could be used against the whole event sample as a faster means of selecting the subset.
Symbolic computing can be distinguished from numeric computing and data oriented computing in that the values manipulated are often those representing higher level abstractions or relationships, rather than basic numbers or string fields. The vast majority of scientific and commercial applications are of the first two classes and are easily represented in Fortran, C, C++, and data base languages (e.g. SQL, Oracle). However, a small but growing realm of computing requiring a higher order of "intelligence" is emerging that will rely on the availability of symbolic processing. Even in the fields of science and commercial enterprise, symbolic computing will open new opportunities for extracting knowledge from data, building object abstractions, and establishing relationships among abstractions which then may be directly manipulated symbolically. Crafting symbolic algorithms, while possible in such languages as C and Fortran, is most readily facilitated through the application of languages devised to directly represent symbolic objects and tasks such as Prolog or Lisp. The proposed research considers the potential application of a variant of Lisp for symbolic computing as a means of information extraction and knowledge representation for scientific data.
Common Lisp represents the culmination of more than two decades of experience in symbolic computing semantics and the merger of a number of different dialects such as Zeta-Lisp, Maclisp, and Scheme. Problems in artificial intelligence, expert systems, theorem proving, symbolic algebra, natural language processing, game theory, cognitive science, pattern recognition and image processing, fuzzy logic, and robotics have all been conducted using Lisp, although other languages have been used as well for each of these such as Prolog, OPS-5, Maple, and Mathematica. As a useful tool for capturing the requirements of these classes of problems, Common Lisp incorporates a wide range of high level semantics including data abstraction, first class processes, associative data and data properties, objects and method classes, and automatic memory management and garbage collection. It is particularly well suited for representing information classification and interrelationships. Its advanced memory management features enables rapid prototyping of complex functional systems.
Early work in such parallel symbolic computing systems as Multilisp on shared memory multiprocessors (e.g. MIT/Harris Concert, BBN TC-2000) exposed the importance of incorporating parallel language constructs suitable for symbolic computation. Multilisp employed a Scheme-like semantic structure and explored the power and utility of the Futures construct for managing system and application concurrency. The Futures construct (originally devised for the Actor model by Hewitt et al) is important because it distinguishes between operations that used the actual data values of a structure (e.g. leaves of a tree structure) and the meta-data that defines the organization of the data in the structure. If the actual value is not required, but only the pointer information linking it to other data, then a task sequence can continue even as a concurrent computation produces the end value. This form of parallelism is extremely powerful, especially in symbolic class computations involving substantial meta-data manipulation. Besides Actors and Multilisp, Futures have been supported in various forms such as in the J-Machine developed by Bill Dally (then of MIT) and the Tera MTA multithreaded computer architecture.
Lisp applications can be memory intensive. The handling of large data sets such as those related to large scale scientific applications can impose extreme demands on memory size. Identifying relationships and associations requires cross comparisons among all n-way subsets of data elements. The resulting meta-data representing the organization, cross-relationships, and abstractions of the base data can, in certain cases, exceed the size of the original data. Access and manipulation of this data can additionally demand high memory bandwidth. This is because, in many cases, the ratio of operations to accesses is relatively low compared to conventional numeric intensive applications. Systems combining the properties of high memory capacity, high memory bandwidth, and low cost are particularly well suited to supporting Lisp applications and symbolic computing.
PC cluster systems which include Beowulf-class systems have evolved as an alternative strategy to achieving scalable moderate and high end performance for some (even many) application and algorithm types. Beowulf-class systems comprise an ensemble of low cost mass market PC compute nodes (either single processor or small SMP) integrated by means of COTS LAN or SAN technologies (e.g. Ethernet, ATM, Myranet) and employing an open source Unix-like operating system (e.g. BSD, Linux) with a community standard message passing library (e.g. PVM, MPI). While loosely coupled relative to such commercial systems as the SGI Origin 2000, the Cray T3E, and the HP Exemplar, Beowulfs have been shown to deliver good performance at exceptional price-performance for many (but not all) algorithms. Beowulf systems won the Gordon Bell Prize for price-performance in 1997 and 1998. For many applications favoring such clusters, cost advantage can approach an order of magnitude. These systems are proliferating with thousand-plus processor systems being implemented and many smaller ones acquired for industry, commerce, academia, and government. Even some high schools are employing them as a stimulus for education. These low cost systems, although widely employed for a broad range of applications, have not found much usage in the domain of symbolic computing in recent years. This is in stark contrast to a plethora of research projects in the 1980s dedicated to pursuing parallel symbolic processing with distributed versions of such languages as Lisp and Prolog.
The principal motivation for implementing PC clusters has been to bring many processors to a single workload, whether as a throughput accelerator using job-stream parallelism or accelerating a single application using inter-process/processor message passing for synchronization and data sharing for compute intensive problems. A similar driver can be to realize large arrays of low cost memory, even if distributed for memory intensive problems. Comparing a workstation of a decade ago to a PC of today yields a memory cost ratio improvement of approximately a factor of a hundred. Today, a moderate size Beowulf costing less than $50K will have ten to a hundred times the memory capacity of the largest shared memory supercomputers of the last decade costing 10s of millions of dollars. Beowulf-class systems as multiple-memory systems may be as important for memory intensive computations as they are as multiple-processor systems for compute intensive applications. The opportunity for exploiting large memory distributed Beowulfs for a concurrent Lisp-like execution environment may make possible a scale of symbolic computing unavailable until now. Its potential impact for high-level data codification and information extraction for scientific computing is as yet largely unexplored. The challenge of applying low cost clusters to symbolic scientific computation for knowledge abstraction and manipulation is the focus of the proposed research.
One goal of the proposed research project is to advance the scientific process through symbolic computing by developing a new generation of distributed cluster symbolic processing tools and applying them to specific scientific applications, thereby deriving abstract knowledge from basic data. The result of this research will be a library of software tools for cluster symbolic computing and a set of techniques for their application to knowledge extraction for science problems.
The second principal objective of this applied research project is to devise and implement effective means of accomplishing scalable symbolic processing for scientific computation through a Common Lisp-like programming model augmented for distributed memory low cost clusters like Beowulf. The selection of Lisp is driven in part by its demonstrated value for representing symbolic algorithms providing the necessary abstractions and mechanisms conducive to the manipulation of symbolic structures and execution of symbolic functions. The result of this research will be a succession of increasingly sophisticated Cluster Common Lisps (Common Lisp augmented subsets) providing stages of mechanism and semantic control for managing distributed activities across nodes of a PC cluster. These tools will be made available (over the web of course) to the general symbolic computing and Beowulf communities at no cost. This tool set will be applied to example problems of scientific information abstraction, extraction, and manipulation presented above. The last objective is to provide quantitative evaluation of the effectiveness of this approach to accomplishing scalable memory systems for symbolic processing as represented by the science problems being investigated.
The strategy for accomplishing the research goals of the proposed project is driven by the opportunity for breaking new ground to advance the frontier of handling scientific data on the one hand, and the practical aspects of carrying out systems work with modest budget and timeframe on the other. To bridge the gap between the disciplines of symbolic computing and scientific data processing, existing symbolic processing tools will be applied to two specific problems in science by experts in the related areas with experience in scientific data processing: association rules in astronomical databases, and event classification in experimental particle physics.
We will develop the software and methodology for symbolic scientific data processing in tandem with the system software for it implementation.
These ideas of symbolic processing for scientific data are still gestating. The basis of data-mining through association rules is provided by using associative arrays as a way to store subsets of data catalogs, and relationships between elements of such catalogs.
As a first example of the use of the use of Cluster Common Lisp, we propose to implement a code for document vector analysis of natural language and genomic (DNA) sequences. Document vector techniques provide a similarity measure of character sequences that facilitates clustering and retrieval; a normalized histogram is made of all subsequences of some fixed length from the given sequence; the subsequence length ("word size") is generally between 3 and 10.
Scalar products of such vectors provide a good metric for documents and other sequences, enough to discern which language the document uses, what the document is about, and even the author of the document. The evaluation of these techniques is underway at Caltech for comparisons of genomic sequences; especially interesting is when the word size is large, as this probes the more distant correlations in the sequence.
Creating and using a document vector of large word size from a long document or DNA sequence is a computationally intensive process. It requires an excellent implementation of an associative array -- the mapping from a given subsequence to the corresponding histogram weight. We will prove our Lisp system by creating an efficient, parallel associative array, and use it for genomics research.
Associative arrays are also the central data structure for computation involving subsets of large sets -- the keys of the array are ID numbers of the members of the set. Indeed, relations between members are represented as associative arrays where the key is a pair of such IDs. Thus we can use the document vector research as a stepping stone to more general research on symbolic computing.
To provide a symbolic computing framework coordinated across cooperating processor nodes of a PC cluster, an incremental layered approach built on top of existing tools will be pursued to minimize redevelopment and limit work to extensions for parallelism. Also, existing and stable Beowulf-class systems will be used for major experiments (a small development Beowulf of 4 nodes will be used for this project as well) reducing costs and avoiding undue manpower overhead.
At the lowest level, PC cluster hardware is distributed memory, presenting a fragmented memory address space. Using explicit message passing, information can be copied between nodes (processors, memory, communications, and support hardware plus Unix-like operating system) under the control of interacting sequential processes (send/receive semantics). At the highest level, Lisp (like many other languages) assumes a flat unified memory name space. Some languages for distributed computing like HPF and BSP bridge this gap by a discipline that within their respective constraints allows the programmer to treat the application variables as a common name space while the explicit messages are generated by underlying compile time functions.
The strategy for developing a new generation of a Cluster Common Lisp is to exploit existing Common Lisp constructs exhibiting intrinsic parallelism augmented with the Futures construct. This provides a semantic basis that minimizes the corruption of the implicit computing model and simplifies parallel programming. This is not to say that the programmer is unaware of the implications of synchronization and global side-effects, but rather that these issues are already part of the consideration of using the existing language. A cluster name space handler and distributed scheduler will be developed as part of the runtime system to augment the existing Common Lisp environments and interfaced to the MPI and MPI-2 message passing libraries developed under the MPICH project at Argonne National Labs.
The Cluster Common Lisp approach is to employ MPI as a transport layer without directly exposing the message passing semantics to the applications programmers. The strategy for accomplishing this is to leverage specific semantic properties of Common Lisp that lend themselves to general parallelization. These include functional semantics, operations with out-of-order argument evaluation, mapping operators, object-oriented message driven abstractions, and (from Multilisp) the Futures synchronization construct. An existing single-node Common Lisp system such as the commercially available Allegro Common Lisp will be used as the logical platform on each cluster node upon which to build the distributed computing tool set and library.
Challenges of name space management and granularity control will be two major issues addressed in the proposed research. A distributed data structure directory will be developed. The global name-space handler will be supported across cluster nodes and a coherence mechanism will be devised to keep distributed copies consistent. This will allow references to non-local data structures to be automatically resolved without explicit programmer intervention through code to manage distributed partitions. However, locality pragmas will be included to indicate preferential placement of structures relative to one another.
A compile time analyzer will estimate granularity of task workload and identify those tasks that cause global mutable side-effects. The granularity analyzer will crudely estimate the degree of work to determine whether a task can be efficiently dispatched to a remote node, i.e. is the overhead less than the work to be performed. The simple side-effect analyzer will identify those tasks that are purely functional and therefore can be processed independently. An inter-node data structure copy handler will support the copy semantics of many of the Lisp constructs across the distributed system. An implementation of the Futures construct that was developed at MIT as part of the Multilisp project will be created for the cluster environment. All elements will be organized as a portable library. Additional advances will be identified that would require direct modification to the low-level compiler of runtime system. Finally, the original scientific information extraction codes that will first be implemented on a single node will be modified to use the Cluster Common Lisp on a Beowulf system and the performance effectiveness evaluated.
Lisp incorporates a large number of operators that use copy semantics to accomplish their designated actions. These functional or value oriented instructions preclude side-effects and can be performed on any node by copying the data structures and values. Not all such operations can be performed efficiently and determining when it is warranted to do so is one of the issues to be addressed by the proposed research. Value oriented computation was employed in such parallel languages as VAL, Id Nouveux, SISAL, and Haskel. Besides the standard numeric functional operators, a large number of functional data structure operators are included. These operators and their data can be copied to separate nodes to carry out the work and the resulting value returned. This assumes that the overhead of distributing the work is commensurate with the quantity of work to be remotely accomplished.
Overhead and granularity of task and data distribution will determine the efficiency of the global parallel system. A compile time analysis tool will be developed to categorize the granularities of individual operations and structures. This will require a side-looking runtime component as some of the logical elements will be runtime determined. A scheduler will be developed that will distribute tasks based in part on the granularity of the work to be performed. If the overhead of work distribution is greater than the useful work to be performed, then the task will be kept local. Only when the useful work is sufficiently great to warrant incurring the necessary overhead will the scheduler decide to farm on the actions to another node. There are a complex set of strategies related to this problem and the capabilities of the scheduler and analyzer will be implemented incrementally.
Lisp incorporates a number of operators for which their arguments may be evaluated out of order. Such semantics intrinsic to Lisp provide a natural way to delineate parallelism within Lisp applications. Some of these like LET provide parallel and sequential versions of the same construct. The set of mapping functions are another example. Some operators like AND and OR impose a sequential ordering of their arguments. But cluster common lisp will include non-ordered versions of these. Others, like COND expect a selection of the first satisfied condition clause. But an eager evaluation version can permit parallel execution of multiple condition clauses and consequent actions up to but not including side-effects.
The Common Lisp Object System (CLOS) is a sophisticated object oriented set of language constructs. These permit encapsulation of data and methods and the use of message driven semantics. CLOS provides another natural structure for parallelism and the distribution of concurrency of computation. Futures, derived from the Actors model and employed as a basis of Multilisp provides a powerful parallel construct that permits manipulation of meta-data structures at the same time that data values are being computed. While not a formal element of the Common Lisp language, it has been extensively studied in the context of Multilisp and is easily added.
A powerful aspect of Lisp (as well as Java) is its automatic memory garbage collection. Stop and Copy and Ephemeral garbage collection algorithms have been successfully applied to early Lisp systems and machines. The project will initially employ the single-node garbage collection services of the base software system but will explore the problem of a distributed garbage collection scheme. It is recognized that true implementation of real distributed garbage collector may require access to low level runtime code and have to be conducted outside the scope of this project.
One area outside the scope of this project is the need for a distributed debugger. A commercial grade system would require this as well as a distributed performance analyzer. These will be left to future research work since the approach taken is highly sensitive to the specific characteristics of the distributed computing methodology employed. Nonetheless, the resulting tool set will be incomplete without these user tools. Another are of future work is the management of mass storage. Lisp treats data storage as a single flat memory. While there are separate instructions for explicitly managing files in secondary storage, it provides no effective means of controlling virtual memory and the demand paging mechanisms that support it. This can result in poor performance for applications with very large data sets such as those contemplated for scientific processing. This issue will be considered but is not part of the work to be performed during the course of this investigation.
We will purchase a small, 4 node development testbed to allow complete control. The front end with 24 Gbyte disk, the others with 6Gbyte each. Each node would have large memory (1 Gbyte each). Such a system would be approximately $2200 per node. The node interconnect (Myranet) would be another $1.2K per node. Also screen and other peripherals, sales tax. The total would be about $15000.
We would like to be able to travel for workshops, conferences, and meetings with possible future collaborators. We estimate expenses for domestic travel at about $2500, so this budget provides for 6 such trips per year.
· Acquire and install development PC cluster. Install software components including commercial Common Lisp system (probably Allegro Common Lisp), MPI-2 and parallel file system.
· Begin Lisp coding of parallel associative array for DNA document vector.
· Implement and proof-of-concept for market basket algorithms in Lisp (not on cluster).
· Interface MPI-2 to Common Lisp framework and implement the low-level library routines for performing concurrent process execution through message passing.
· Coding of algorithms for clustering in astronomical and particle-physics databases.
· Select semantic constructs for parallel forms including functional, out-of-order evaluation, mapping, and object forms. Extend the set to include variations of sequential constructs for parallel operation.
· Develop low-level mechanisms of structure copy, remote task dispatch, and global distributed name-space table.
· Implement a core set of parallel Lisp constructs to test and evaluate speedup characteristics for simple applications.
· Implement simple kernel application codes to verify system functionality and perform initial performance characteristics and overhead costs.
· Begin trials of parallel system with associative array testbed -- genomic data and/or large language texts
· First large trial of system's parallel file system with scientific data.
· Apply clustering analysis to stored data set and evaluate results.
· Analyze and document performance of the development PC cluster for this application and the systems software so far developed.
· Load large astronomical dataset into parallel file system (approx 100 million sources of 1 Kbyte each).
· Continue development of clustering suite for creating symbolic data from astronomical and particle-physics data.
· Continue development of Lisp-based association-rule software.
· Begin specification and implementation of Problem-Solving Environment for the scientific inference engine.
· Develop analysis tool for identification functional and side-effect tasks.
· Develop compile time and runtime analysis tool for categorizing task and data object granularities.
· Develop advanced scheduler that employs side-effect and granularity data in determining distributed dispatch policy and actions.
· Develop Futures construct implementation.
· Replicate the development system on a large Beowulf of N nodes. Repeat all measurements with the scientific analysis code, and perform detailed performance characterization and efficiency study for evaluation.
· Release Cluster Common Lisp tools to community with documentation via Internet/Web.
· Load sample data set of simulated HEP data (approximately 1 million events of 1 MByte each).
· Implement software in Cluster Common Lisp that can analyze subsets of the data for correlations and features, according to the applied event "cuts", which are taken from the traditional analysis software.
· Ensure that known features and correlations in the event data can be identified by the system.
· Explore new findings identified by the system, and validate them against the known properties of the simulated HEP data.
· Complete documentation and evaluation of the end-to-end system.
· Release entire system (Cluster Common Lisp, data clustering suite, association rules, problem-solving environment) as separate open-source software components.
Database Projects and Sky Surveys
Digital Sky Project, http://www.digital-sky.org/
The Two Micron All Sky Survey, http://www.ipac.caltech.edu/2mass/
The Sloan Digital Sky Survey, http://www.sdss.org/
The Digitized Palomar Observatory Sky Survey, http://dposs.caltech.edu/
Atlas, A Toroidal LHC Apparatus, http://www.cern.ch/Atlas/
CMS: Compact Muon Solenoid, http://cmsinfo.cern.ch/Welcome.html
3. Aggarwal CC, Yu PS, Data Mining Techniques for Associations, Clustering and Classification, in Proc. Third Pacific-Asia Conference Knowl. Disc. Data Mining, PAKDD-99, Beijing, ed. N Zhong and L Zhou. Springer Lect. Notes in Comp. Sci., 1574.
4. Agrawal R, Shafer JC, Parallel mining of association rules, IEEE Trans. Knowl. Data Engng. 8 (1996) 962-969.
5. Franz, Inc., Allegro Common LISP User Guide Version 4.3, March 1996.
6. Gabriel RP, Performance and Evaluation of Lisp Systems, MIT Press, 1985
7. Graham P, On Lisp, Prentice Hall, 1994.
8. Gropp W, Lusk E, Skjellum A; Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, 1994.
9. Halstead RH, New Ideas in Parallel lisp - Language Design, Implementation, and Programming Tools, Lect. Notes Comp. Sci. 441 (1990) 2-57.
10. Halstead RH Jr., Multilisp: A Language for Concurrent Symbolic Computation. ACM Trans. Prog. Lang. Sys., 7 (1985) 501-538.
11. Han JW, Fu WJ , Mining multiple-level association rules in large databases, IEEE Trans. Knowl. Data Engng., 11 (1999) 798-805.
12. Shen L, Shen H, Cheng L, New algorithms for efficient mining of association rules, Info. Sci. 118 (1999) 251-268.
13. Snir M, Otto SW, Huss-Lederman S, Walker DW, Dongarra J, MPI: The Complete Reference. MIT Press, 1996.
14. Sterling TL, Salmon J, Becker DJ, Savarese DF, How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters, MIT Press, 1999.
15. Steele GL Jr., Common LISP: The Language (2nd Edition). Digital Press, 1990.
16. Winston PH, Horn BKP; LISP (3rd Edition). Addison-Wesley, 1989, 1984, 1981.