Proposal to use the Caltech HP/Convex SPP 2000 in the Monte Carlo simulation of Hadronic Backgrounds to the Higgs Boson decay H ® g g

 

 

J.J.Bunn, H.Newman, S.Shevchenko, R.Wilkinson

(California Institute of Technology)

Abstract

We propose to use the Caltech SPP 2000 to run a full Monte Carlo simulation of over one million high-energy proton-proton collisions. This study, the first of its kind, will determine the feasibility of discovering the Higgs boson in its di-photon decay mode H ® g g when using the Compact Muon Solenoid (CMS) detector at CERN's Large Hadron Collider (LHC). The detailed detector simulation, which is part of this study, will allow us to optimize the CMS detector design so that the high resolution and background suppression capabilities of the detector are maintained over the full solid angle.

Summary of Proposed Research

The discovery of the Higgs boson is currently one of the main goals of experimental High-Energy Physics (HEP) research, and one of the main justifications for constructing the Large Hadron Collider (LHC) at CERN. The Higgs boson is the key missing element in the otherwise highly successful Electroweak Theory, which unifies the fundamental electromagnetic and weak interactions. It is the Higgs particles that are thought to be responsible for the masses of the heavy Z and W particles that carry the weak interaction, and ultimately for all particle masses. The detection of one or more types of Higgs particles, or the discovery of an alternative Higgs sector of the theory as predicted by Supersymmetry, would have a pivotal influence on our understanding of the fundamental forces of nature throughout the next decades.

The LHC is scheduled to begin running in 2005, and involves four experiments ALICE, ATLAS, CMS and LHC-B. The two major experiments, ATLAS and CMS, each involve collaborations of around 1600 physicists. The authors are all members of the CMS collaboration.

Because of the importance of the Higgs discovery, the CMS design incorporates a precision crystal calorimeter with unique capabilities for the detection and measurement of these particles, particularly through their decays into final states containing photons, electrons and muons. It is the aim of this study to verify the detector design and its discovery potential in time for the finalization of the design and the start of construction. Because Higgs production and decay is extremely rare compared to the standard production of two or more "jets" of particles in the detector, resulting from hard scatters between the quarks and gluons making up the protons, filtering out the backgrounds is a key issue. Until now, the computational power required for a detailed study of the backgrounds, with sufficient statistics to isolate, study and correct areas of weakness in the detector design, has not been possible. We therefore propose to use a large set of simulated events produced on the HP Exemplar at Caltech to accurately gauge the level of backgrounds to be expected in CMS at the LHC. If needed, we will also use these simulations to correct or optimize the design of CMS in some regions, in a way that was not possible in the past.

The high precision data from the present generation of experiments indicates indirectly that if the Elecctroweak Theory is correct, the mass of the Higgs (or of the lightest Higgs if there are several types) is likely to be in the range of 100 GeV/c2. If the mass of the Higgs is » 80-140 GeV/c2 the most promising channel for its discovery is the rare decay to two photons: H ® g g . A large and poorly understood background to this signal comes from misidentifying as photons hadron jets which are often produced in the proton-proton collisions. The background is particularly likely to fake a pair of high energy photons if the jets include two isolated neutral mesons such as pizeros, which each decay to two photons. If the meson has a large enough energy, the photons resulting from the meson decay are difficult to distinguish from a true high energy, isolated photon. In these cases, neural-net and other sophisticated pattern recognition algorithms are used to help reject the background, based on details of the energy sharing profile among the crystals in the calorimeter. These rely on a detailed knowledge of the detector geometry and response to electromagnetic showers in the crystals, including the small gaps between crystals, and the difficult regions of the detector between detector modules where supports and services tend to degrade the performance.

Previous studies of this background using Monte Carlo simulation of dijet events have obtained estimates for the rate at which each jet would be misidentified as a photon. To estimate the diphoton background, the rates were simply multiplied together, because of the limited computational power available. One problem with this approach is that correlations within an event invalidate the estimate. A second is that non-Gaussian tails in the resolution have not been adequately simulated. A third, related problem, is that the full detector, including all of the less than ideal geometrical regions, has not been adequately sampled. All of these shortcomings will be addressed in this study. As a result, we expect to verify CMS’ ability to detect the H ® g g signal, and to optimize the design to make the Higgs discovery possible during the early running phase of the LHC.

We thus propose to simulate a large enough sample of dijet events to directly measure the rate of diphoton misidentification. The simulated data we generate would be made available to the entire CMS collaboration.

These studies are part of the GIOD project, a joint effort by Caltech's Center for Advanced Computing Research (CACR), Caltech's CMS group, Hewlett-Packard, and CERN, to study the feasibility of using Regional Computing Centers to solve LHC-scale data access and processing challenges. From the technical point of view, this study will also study the suitability of a shared memory processor such as the Exemplar to be the central server for a distributed (Object) database system, while simultaneously performing as a server for a diverse set of computationally intensive jobs.

Methodology

Physics Background: The detection of the signal H ® g g is hampered by background physics processes which also contain two prompt photons, or one real and one "fake" photon from meson decay (as described above), or two "fake" photons. These background events include diphoton production from quark annihilation and gluon fusion, diphoton production from quark Bremsstrahlung, jet events where neutral particles in the jet decay, and jet events with both a jet and a prompt photon. The goal is to reduce the rates at which the background events are selected so that we may distinguish the Higgs signal’s sharp diphoton invariant mass peak from the smoothly falling background mass spectrum.

Predictions of the rates at which the background events occur can be compared with the expected rate at which the Higgs decays into two photons. This is accomplished using sophisticated simulation models of fundamental particle interactions (such as PYTHIA) that encode the latest experimental and theoretical data, together with a detailed simulation of the CMS experiment and its response to particles passing through it. For real photons that are well-isolated from other particles, it is relatively straightforward to calculate the total number of signal and background events expected during a run of the CMS detector, by multiplying the rates by the integrated luminosity of the collider. The simulation of fake photons from meson decays, from poorly measured particle trajectories, and unusual patterns of energy deposition in the electromagnetic calorimeter is more involved and requires the full detector to be simulated. It is the latter procedure that takes the vast majority of the computer time.

For this study, we will concentrate on the background events from dijets, since the production rate of this process is very high compared to the Higgs signal, and the suppression of the background by several orders of magnitude is poorly understood. The rate prediction for the dijet background corresponds to 200,000,000 jets events satisfying a preselection. Of these, approximately 45,000 will contain two real photons that satisfy the Higgs selections. These selections impose the required event topology and kinematics consistent with H ® g g decay. To make a reasonable study, we choose to simulate a factor 1000 less events than will occur in the run: 200,000 events yielding » 45 selected as Higgs. In addition we will measure the background rate coming from events with one or more fake photons directly from the study. (The previous, very approximate, studies indicate this rate should be less than 45.) This must be done in each of several mass ranges in which the Higgs may lie, so that we can understand the behaviour of the background contamination from dijets as a function of mass. We thus propose to generate this number of events in each of five possible Higgs mass ranges.

Software Infrastructure: We will use the standard CMS simulation program CMSIM, a Fortran-77 code that we have made parallel at the granularity of single events using the MPI library. The code is structured on the de-facto standard detector and material simulation program, GEANT, which is part of the CERN Program Library.

GEANT is a large, complex and mature Fortran-77 code that has been extensively optimized on a large number of computer platforms. The first version was written in the early 80's. Originally designed for HEP experiments, it has also found applications outside this domain in the areas of medical and biological sciences, radio-protection and astronautics. Its principal application in HEP is the tracking of particles through an experimental setup for the simulation of detector response. Certain of the GEANT algorithms such as those that search for boundaries between geometrical regions, and those that explore which volumes are adjacent to the current one, are time consuming and more logical than computational. On the other hand, the parts dealing with shower simulation are floating point intensive. GEANT thus contains a mixture of integer and floating point arithmetic. It is known that, to a very good approximation, the GEANT performance scales with the SPECInt rating of a particular hardware platform.

The input to the GEANT-based CMSIM is a set of parameters which describe the structure, geometry and materials in the CMS detector, together with a set of events generated by a separate program called CMGEN, based on PYTHIA. The CMGEN program simulates the physics of the high-energy proton-proton collisions. For each simulated collision, the CMGEN output is a set of parameters that describe the starting positions and energies of particles which are produced in the collision. Using these particle parameters, and the information about the CMS detector, the CMSIM program calculates, for each event, the trajectories of the particles, their interactions in the materials of the detector, and thence the expected read-out digitizations from the online data acquisition system.

With that information, we are able to reconstruct the composition and topology of the original event, and to calculate the photon/jet misidentification rate, as required.

Preliminary Results and Progress

In preparing the software for the Exemplar environment we have investigated various levels of compiler optimizations, both for the CMSIM library and for CERNLIB, the CERN Program Library. CERNLIB is required by CMSIM for mathematical, random number, GEANT and other subroutines. The majority of CERNLIB we re-compiled with the fort77 "-03" switch. This, and rewriting a crucial array copy routine, resulted in an executable performance improvement of a factor 1.8 compared with our initial version, which had been compiled and linked on a HP-UX workstation. During the course of the work porting CMSIM to the Exemplar, we encountered several minor software bugs and platform specific problems, which were all fixed rapidly. The porting effort took roughly two programmer-days. From timing measurements we have made on other platforms. we are satisfied that the performance of CMSIM on the Exemplar is excellent.

Following the port, we tested the software with three types of physics events. In the first test, we generated and passed through CMSIM single photon events in batches of several thousand events per run. In these tests we accumulated approximately 2 million events, over approximately 10 days of running on the SPP 2000. Our timing results show 0.5 minutes on a single SPP 2000 CPU to generate each event in CMGEN and process it using CMSIM. (The time spent in CMGEN ortis in fact negligible, regardless of the complexity of the physics, when compared to the time spent in CMSIM.)

In the second test, we simulated the more complex events H ® m m m m . Each of these events consumed approximately 2 minutes in simulation time. For one of the test runs, which executed in a 64-CPU node of the machine, we produced a graph of the elapsed time distribution for 100 events, which is reproduced in Figure 1.

Figure 1: Showing the elapsed time to simulate 100 events in each of 64 jobs on the Caltech SPP2000

We finally made initial timing measurements for the target dijet background events, which indicate 4.5 minutes per event per CPU is required. These events take considerably longer due to the simulation of the electromagnetic showers in the calorimeters.

The tests were made with the assistance of staff at Caltech's CACR. As they are preliminary in nature, the results have not been published. Since we have already developed and optimized for the SPP 2000 the CMSIM and CMGEN programs described above, the development time to prepare for the simulation task will be minimal.

Justification of the Request

Monte Carlo simulation is just one of a spectrum of compute-intensive tasks employed in High Energy Physics research. It is characterized by low I/O needs, and intense CPU activity. Proving the suitability of the SPP 2000 for running large production simulations is a milestone in the GIOD project, as it helps to demonstrate how large systems such as this will operate in the running phase of the experiments, given a workload of different computational tasks. Monte Carlo simulation is also a suitable task to have running "in the background" on a large machine such as the Exemplar: it poses little or no demand on the I/O subsystem. In fact, it is expected that a majority of the simulation will be run in this way during the running periods of the LHC experiments. Thus we also wish to run background simulation on the Exemplar as part of our investigation.

In consequence, we request both dedicated system time in production runs, and system time at low priority for background runs. The relative proportions are detailed in the tables below:

 

Table 1: Production Runs per Quarter on the HP/Convex SPP 2000

Higgs Mass

90 GeV/c2

110 GeV/c2

130 GeV/c2

140 GeV/c2

160 GeV/c2

Events from Production Runs

75,000

+

75,000

75,000

+

75,000

50,000

+

50,000

40,000

+

40,000

40,000

+

40,000

Production run length

20 hours

+

20 hours

22 hours

+

22 hours

16 hours

+

16 hours

14 hours

+

14 hours

16 hours

+

16 hours

SPP 2000

requirement

10 runs of 14 - 22 hours each on dedicated 256 CPUs

» 180 hours

Temporary disk space » 100 Gbyte per run

Table 2: Background Runs per Quarter on the HP/Convex SPP 2000

Higgs Mass

90 GeV/c2

110 GeV/c2

130 GeV/c2

140 GeV/c2

160 GeV/c2

Events from Background Runs

150,000

150,000

100,000

80,000

80,000

SPP 2000 requirement

» 180 hours Þ 1/12 of the machine continuously over the Quarter

Temporary disk space » 100 Gbyte continually available spool area

The intention is to move the result data across the network to a server in Caltech’s HEP department, where they will be analysed and stored on tape.

We anticipate discovering areas of the problem domain that will require further investigation. For this purpose, we are requesting a further, but smaller, allocation of system time on the Exemplar in the Quarter following this study.

 

 

Local Computing Environment

The Caltech CMS group owns a variety of workstations running Windows NT, HP-UX, and AIX which will be used to analyze the results.

In particular, we have recently installed an HP C-200 workstation, which will be attached via ATM fibre directly to the SPP-2000. This workstation will be used to verify the correctness of the simulation jobs as they complete on the SPP-2000, and to carry out other post-processing tasks. We have also acquired 14 disks each of 9 Gbytes, which will be used to store the simulation data. It is intended that these disks be attached to the Exemplar, with the agreement and assistance of CACR staff, and used as a spool area for the result data.

Other Supercomputer Support

In the past, we have used a five-processor DEC Alpha EV5 server for some small Monte Carlo simulation runs. The speed and number of the processors coupled with demands of other users and the inconvenience of the location of the machine (at FermiLab), make this server of very marginal value for the proposed research.

During development of the CMSIM program on the Caltech SPP 2000, we were given access to a smaller Exemplar at HP in Richardson, Texas. We made use of this machine over one weekend when the Caltech machine was unavailable due to system maintenance work.

Qualifications

Dr. Julian J. Bunn, co-leader of the GIOD project, is a Visiting Faculty Associate at Caltech and a senior staff member in the Information Technology Division at the European Laboratory for Particle Physics, CERN. He has been closely involved in the design and implementation of large software systems for HEP since 1981.

Professor Harvey Newman is Professor of Physics at Caltech, and is both co-leader of the GIOD Project, and chairman of the CMS experiment's Software and Computing Board, which sets computing policy for the entire collaboration. He is also a member of the Steering Committee of the Network Task Force commissioned by ICFA. Along with S. Shevchenko and R. Zhu, he has been active in design and performance simulations of the CMS electromagnetic calorimeter. He also led the design and development of the precision crystal calorimeter for the GEM experiment at the former SSC.

Dr. Sergey Shevchenko, a Senior Research Fellow at Caltech, is investigating the effects of the CMS design on its potential to discover the Higgs boson. He has developed a neural network algorithm to separate photons from their largest background, neutral pions. He also contributed to the design study, testing and development of the crystal calorimeter for the GEM experiment at the former SSC..

Dr. Richard Wilkinson, an Assistant Scientist at Caltech, has optimized and tailored the CMSIM Monte Carlo simulation for the Caltech SPP 2000. Using this code, he has managed the simulation on the SPP 2000 of over one million single-particle interactions.

 

 

 

 

Other Members of the GIOD Project

Other participants in the GIOD project are P. Messina (CACR), J. Patton (CACR), and R. Williams (CACR), E. Arderiu-Ribera (CERN), K. Holtman (CERN), and V. Innocente (CERN), and A. Kirkby (Caltech/HEP).