Proposal to use the Caltech HP/Convex SPP 2000 in the
Monte Carlo simulation of Hadronic Backgrounds to the Higgs Boson decay H ® g g
J.J.Bunn,
H.Newman, S.Shevchenko, R.Wilkinson
(California
Institute of Technology)
Abstract
We propose to use the Caltech SPP 2000
to run a full Monte Carlo simulation of over one million high-energy
proton-proton collisions. This study, the first of its kind, will determine the
feasibility of discovering the Higgs boson in its di-photon decay mode H ® g g when
using the Compact Muon Solenoid (CMS) detector at CERN's Large Hadron Collider
(LHC). The detailed detector simulation, which is part of this study, will
allow us to optimize the CMS detector design so that the high resolution and
background suppression capabilities of the detector are maintained over the
full solid angle.
Summary
of Proposed Research
The discovery of the Higgs boson is
currently one of the main goals of experimental High-Energy Physics (HEP)
research, and one of the main justifications for constructing the Large Hadron
Collider (LHC) at CERN. The Higgs boson is the key missing element in the
otherwise highly successful Electroweak Theory, which unifies the fundamental
electromagnetic and weak interactions. It is the Higgs particles that are
thought to be responsible for the masses of the heavy Z and W particles that
carry the weak interaction, and ultimately for all particle masses. The
detection of one or more types of Higgs particles, or the discovery of an
alternative Higgs sector of the theory as predicted by Supersymmetry, would
have a pivotal influence on our understanding of the fundamental forces of
nature throughout the next decades.
The LHC is scheduled to begin running in
2005, and involves four experiments ALICE, ATLAS, CMS and LHC-B. The two major
experiments, ATLAS and CMS, each involve collaborations of around 1600
physicists. The authors are all members of the CMS collaboration.
Because of the importance of the Higgs
discovery, the CMS design incorporates a precision crystal calorimeter with
unique capabilities for the detection and measurement of these particles,
particularly through their decays into final states containing photons,
electrons and muons. It is the aim of this study to verify the detector design
and its discovery potential in time for the finalization of the design and the
start of construction. Because Higgs production and decay is extremely rare
compared to the standard production of two or more "jets" of
particles in the detector, resulting from hard scatters between the quarks and
gluons making up the protons, filtering out the backgrounds is a key issue.
Until now, the computational power required for a detailed study of the backgrounds,
with sufficient statistics to isolate, study and correct areas of weakness in
the detector design, has not been possible. We therefore propose to use a large
set of simulated events produced on the HP Exemplar at Caltech to accurately
gauge the level of backgrounds to be expected in CMS at the LHC. If needed, we
will also use these simulations to correct or optimize the design of CMS in
some regions, in a way that was not possible in the past.
The high precision data from the present
generation of experiments indicates indirectly that if the Elecctroweak Theory
is correct, the mass of the Higgs (or of the lightest Higgs if there are
several types) is likely to be in the range of 100 GeV/c2. If the
mass of the Higgs is » 80-140 GeV/c2 the most promising
channel for its discovery is the rare decay to two photons: H ® g g . A
large and poorly understood background to this signal comes from misidentifying
as photons hadron jets which are often produced in the proton-proton
collisions. The background is particularly likely to fake a pair of high energy
photons if the jets include two isolated neutral mesons such as pizeros, which
each decay to two photons. If the meson has a large enough energy, the photons
resulting from the meson decay are difficult to distinguish from a true high
energy, isolated photon. In these cases, neural-net and other sophisticated
pattern recognition algorithms are used to help reject the background, based on
details of the energy sharing profile among the crystals in the calorimeter.
These rely on a detailed knowledge of the detector geometry and response to
electromagnetic showers in the crystals, including the small gaps between
crystals, and the difficult regions of the detector between detector modules
where supports and services tend to degrade the performance.
Previous studies of this background
using Monte Carlo simulation of dijet events have obtained estimates for the
rate at which each jet would be misidentified as a photon. To estimate the
diphoton background, the rates were simply multiplied together, because of the
limited computational power available. One problem with this approach is that
correlations within an event invalidate the estimate. A second is that
non-Gaussian tails in the resolution have not been adequately simulated. A
third, related problem, is that the full detector, including all of the less
than ideal geometrical regions, has not been adequately sampled. All of these
shortcomings will be addressed in this study. As a result, we expect to verify
CMS’ ability to detect the H ® g g signal, and to optimize the design to make the
Higgs discovery possible during the early running phase of the LHC.
We thus propose to simulate a
large enough sample of dijet events to directly measure the rate of diphoton
misidentification. The
simulated data we generate would be made available to the entire CMS
collaboration.
These studies are part of the GIOD
project, a joint effort by Caltech's Center for Advanced Computing Research
(CACR), Caltech's CMS group, Hewlett-Packard, and CERN, to study the
feasibility of using Regional Computing Centers to solve LHC-scale data access
and processing challenges. From the technical point of view, this study will
also study the suitability of a shared memory processor such as the Exemplar to
be the central server for a distributed (Object) database system, while
simultaneously performing as a server for a diverse set of computationally
intensive jobs.
Methodology
Physics Background: The detection of the signal H ® g g is
hampered by background physics processes which also contain two prompt photons,
or one real and one "fake" photon from meson decay (as described
above), or two "fake" photons. These background events include
diphoton production from quark annihilation and gluon fusion, diphoton
production from quark Bremsstrahlung, jet events where neutral particles in the
jet decay, and jet events with both a jet and a prompt photon. The goal is to
reduce the rates at which the background events are selected so that we may
distinguish the Higgs signal’s sharp diphoton invariant mass peak from the
smoothly falling background mass spectrum.
Predictions of the rates at which the
background events occur can be compared with the expected rate at which the
Higgs decays into two photons. This is accomplished using sophisticated
simulation models of fundamental particle interactions (such as PYTHIA) that
encode the latest experimental and theoretical data, together with a detailed
simulation of the CMS experiment and its response to particles passing through
it. For real photons that are well-isolated from other particles, it is
relatively straightforward to calculate the total number of signal and
background events expected during a run of the CMS detector, by multiplying the
rates by the integrated luminosity of the collider. The simulation of fake
photons from meson decays, from poorly measured particle trajectories, and
unusual patterns of energy deposition in the electromagnetic calorimeter is
more involved and requires the full detector to be simulated. It is the latter
procedure that takes the vast majority of the computer time.
For this study, we will concentrate on
the background events from dijets, since the production rate of this process is
very high compared to the Higgs signal, and the suppression of the background
by several orders of magnitude is poorly understood. The rate prediction for
the dijet background corresponds to 200,000,000 jets events satisfying a
preselection. Of these, approximately 45,000 will contain two real photons that
satisfy the Higgs selections. These selections impose the required event
topology and kinematics consistent with H ® g g decay. To make a reasonable study, we choose to
simulate a factor 1000 less events than will occur in the run: 200,000 events
yielding » 45 selected as Higgs. In addition we will
measure the background rate coming from events with one or more fake photons
directly from the study. (The previous, very approximate, studies indicate this
rate should be less than 45.) This must be done in each of several mass ranges
in which the Higgs may lie, so that we can understand the behaviour of the
background contamination from dijets as a function of mass. We thus propose to
generate this number of events in each of five possible Higgs mass ranges.
Software Infrastructure: We will use the standard CMS simulation program
CMSIM, a Fortran-77 code that we have made parallel at the granularity of
single events using the MPI library. The code is structured on the de-facto
standard detector and material simulation program, GEANT, which is part of the
CERN Program Library.
GEANT is a large, complex and mature
Fortran-77 code that has been extensively optimized on a large number of
computer platforms. The first version was written in the early 80's. Originally
designed for HEP experiments, it has also found applications outside this
domain in the areas of medical and biological sciences, radio-protection and
astronautics. Its principal application in HEP is the tracking of particles
through an experimental setup for the simulation of detector response. Certain
of the GEANT algorithms such as those that search for boundaries between
geometrical regions, and those that explore which volumes are adjacent to the
current one, are time consuming and more logical than computational. On the
other hand, the parts dealing with shower simulation are floating point
intensive. GEANT thus contains a mixture of integer and floating point
arithmetic. It is known that, to a very good approximation, the GEANT
performance scales with the SPECInt rating of a particular hardware platform.
The input to the GEANT-based CMSIM is a
set of parameters which describe the structure, geometry and materials in the
CMS detector, together with a set of events generated by a separate program called
CMGEN, based on PYTHIA. The CMGEN program simulates the physics of the
high-energy proton-proton collisions. For each simulated collision, the CMGEN
output is a set of parameters that describe the starting positions and energies
of particles which are produced in the collision. Using these particle
parameters, and the information about the CMS detector, the CMSIM program
calculates, for each event, the trajectories of the particles, their
interactions in the materials of the detector, and thence the expected read-out
digitizations from the online data acquisition system.
With that information, we are able to
reconstruct the composition and topology of the original event, and to
calculate the photon/jet misidentification rate, as required.
Preliminary
Results and Progress
In preparing the software for the
Exemplar environment we have investigated various levels of compiler
optimizations, both for the CMSIM library and for CERNLIB, the CERN Program
Library. CERNLIB is required by CMSIM for mathematical, random number, GEANT
and other subroutines. The majority of CERNLIB we re-compiled with the fort77
"-03" switch. This, and rewriting a crucial array copy routine,
resulted in an executable performance improvement of a factor 1.8 compared with
our initial version, which had been compiled and linked on a HP-UX workstation.
During the course of the work porting CMSIM to the Exemplar, we encountered
several minor software bugs and platform specific problems, which were all
fixed rapidly. The porting effort took roughly two programmer-days. From timing
measurements we have made on other platforms. we are satisfied that the
performance of CMSIM on the Exemplar is excellent.
Following the port, we tested the
software with three types of physics events. In the first test, we generated
and passed through CMSIM single photon events in batches of several thousand
events per run. In these tests we accumulated approximately 2 million events,
over approximately 10 days of running on the SPP 2000. Our timing results show 0.5
minutes on a single SPP 2000 CPU to generate each event in CMGEN and process it
using CMSIM. (The time spent in CMGEN ortis in fact negligible, regardless of
the complexity of the physics, when compared to the time spent in CMSIM.)
In the second test, we simulated the
more complex events H ® m m m m . Each of these events consumed approximately 2
minutes in simulation time. For one of the test runs, which executed in a
64-CPU node of the machine, we produced a graph of the elapsed time
distribution for 100 events, which is reproduced in Figure 1.
Figure
1: Showing the elapsed time to simulate 100 events in each of 64 jobs on the
Caltech SPP2000
We finally made initial timing
measurements for the target dijet background events, which indicate 4.5 minutes
per event per CPU is required. These events take considerably longer due to the
simulation of the electromagnetic showers in the calorimeters.
The tests were made with the assistance
of staff at Caltech's CACR. As they are preliminary in nature, the results have
not been published. Since we have already developed and optimized for the SPP
2000 the CMSIM and CMGEN programs described above, the development time to
prepare for the simulation task will be minimal.
Justification
of the Request
Monte Carlo simulation is just one of a
spectrum of compute-intensive tasks employed in High Energy Physics research.
It is characterized by low I/O needs, and intense CPU activity. Proving the
suitability of the SPP 2000 for running large production simulations is a
milestone in the GIOD project, as it helps to demonstrate how large systems
such as this will operate in the running phase of the experiments, given a
workload of different computational tasks. Monte Carlo simulation is also a
suitable task to have running "in the background" on a large machine
such as the Exemplar: it poses little or no demand on the I/O subsystem. In
fact, it is expected that a majority of the simulation will be run in this way
during the running periods of the LHC experiments. Thus we also wish to run
background simulation on the Exemplar as part of our investigation.
In consequence, we request both
dedicated system time in production runs, and system time at low priority for
background runs. The relative proportions are detailed in the tables below:
Table
1: Production Runs per Quarter on the HP/Convex SPP 2000
Higgs Mass |
90 GeV/c2 |
110 GeV/c2 |
130 GeV/c2 |
140 GeV/c2 |
160 GeV/c2 |
Events from Production Runs |
75,000 + 75,000 |
75,000 + 75,000 |
50,000 + 50,000 |
40,000 + 40,000 |
40,000 + 40,000 |
Production run length |
20 hours + 20 hours |
22 hours + 22 hours |
16 hours + 16 hours |
14 hours + 14 hours |
16 hours + 16 hours |
SPP 2000 requirement |
10 runs of 14 - 22 hours each on
dedicated 256 CPUs »
180 hours Temporary disk space » 100 Gbyte per run |
Table
2: Background Runs per Quarter on the HP/Convex SPP 2000
Higgs Mass |
90 GeV/c2 |
110 GeV/c2 |
130 GeV/c2 |
140 GeV/c2 |
160 GeV/c2 |
Events from Background Runs |
150,000 |
150,000 |
100,000 |
80,000 |
80,000 |
SPP 2000 requirement |
»
180 hours Þ 1/12 of the machine continuously over the
Quarter Temporary disk space » 100 Gbyte continually available spool area |
The intention is to move the result data
across the network to a server in Caltech’s HEP department, where they will be
analysed and stored on tape.
We anticipate discovering areas of the
problem domain that will require further investigation. For this purpose, we
are requesting a further, but smaller, allocation of system time on the
Exemplar in the Quarter following this study.
Local
Computing Environment
The Caltech CMS group owns a variety of
workstations running Windows NT, HP-UX, and AIX which will be used to analyze
the results.
In particular, we have recently
installed an HP C-200 workstation, which will be attached via ATM fibre
directly to the SPP-2000. This workstation will be used to verify the
correctness of the simulation jobs as they complete on the SPP-2000, and to
carry out other post-processing tasks. We have also acquired 14 disks each of 9
Gbytes, which will be used to store the simulation data. It is intended that
these disks be attached to the Exemplar, with the agreement and assistance of
CACR staff, and used as a spool area for the result data.
Other
Supercomputer Support
In the past, we have used a
five-processor DEC Alpha EV5 server for some small Monte Carlo simulation runs.
The speed and number of the processors coupled with demands of other users and
the inconvenience of the location of the machine (at FermiLab), make this
server of very marginal value for the proposed research.
During development of the CMSIM program
on the Caltech SPP 2000, we were given access to a smaller Exemplar at HP in
Richardson, Texas. We made use of this machine over one weekend when the
Caltech machine was unavailable due to system maintenance work.
Qualifications
Dr. Julian J. Bunn, co-leader of the
GIOD project, is a Visiting Faculty Associate at Caltech and a senior staff
member in the Information Technology Division at the European Laboratory for
Particle Physics, CERN. He has been closely involved in the design and
implementation of large software systems for HEP since 1981.
Professor Harvey Newman is Professor of
Physics at Caltech, and is both co-leader of the GIOD Project, and chairman of
the CMS experiment's Software and Computing Board, which sets computing policy
for the entire collaboration. He is also a member of the Steering Committee of
the Network Task Force commissioned by ICFA. Along with S. Shevchenko and R.
Zhu, he has been active in design and performance simulations of the CMS electromagnetic
calorimeter. He also led the design and development of the precision crystal
calorimeter for the GEM experiment at the former SSC.
Dr. Sergey Shevchenko, a Senior Research
Fellow at Caltech, is investigating the effects of the CMS design on its
potential to discover the Higgs boson. He has developed a neural network
algorithm to separate photons from their largest background, neutral pions. He
also contributed to the design study, testing and development of the crystal
calorimeter for the GEM experiment at the former SSC..
Dr. Richard Wilkinson, an Assistant
Scientist at Caltech, has optimized and tailored the CMSIM Monte Carlo
simulation for the Caltech SPP 2000. Using this code, he has managed the
simulation on the SPP 2000 of over one million single-particle interactions.
Other
Members of the GIOD Project
Other participants in the GIOD project
are P. Messina (CACR), J. Patton (CACR), and R. Williams (CACR), E.
Arderiu-Ribera (CERN), K. Holtman (CERN), and V. Innocente (CERN), and A. Kirkby
(Caltech/HEP).