Last Update: Friday, 16 February, 2007

 

California US CMS Tier2 Center

Latest Information

For the latest information about the Caltech Tier2, please go to the official Web pages.

Some recent photos of the Caltech Tier2 (early 2007):


 

What follows below is a historical document that details the initial Tier2 Prototype development. The Caltech Tier2 enjoys the distinction of being the first Tier2 ever constructed.

Introduction

 

Caltech and UCSD are implementing a prototype “Tier2” center as part of the ongoing preparations for deployment of the global software and computing system for the CMS experiment at the LHC. The implementation was agreed to following discussions at the US CMS Collaboration Meeting in May 2000, at the DOE Review of Software and Computing, and during the Hoffmann Review of LHC Computing at the end of 2000.

 

Role of the Tier2 in the Unified System Concept

 

The Tier2 prototype fits in the Unified System Concept being developed. This is a Grid Hierarchy that has

  • Tier0: Site of the Experiment (CERN)
  • Tier1: National Center
  • Tier2: Regional Center
  • Tier3: Workgroup server at an institute
  • Tier4: A desktop 

 

This is shown schematically in the Figure below.

 

 

 

The hierarchy naturally partitions and orders the CMS user community. Tier N sites are served by Tier N-1, and help to offload Tier N-1 when there is spare capacity available (other spare capacity at Tier N is devoted to “background” simulation tasks). The site architecture for the Tier2 centers is complementary to the major Tier1 national laboratory-based centers, and features:

  • Medium-scale Linux CPU farm, powerful data server(s), RAID disk arrays
  • Less need for 24 X 7 Operation
  • Some lower component costs
  • Less production-oriented, to respond to local and regional analysis
       priorities and needs
  • Supportable by a small local team and physicists’ help

It is intended that there be one Tier2 center in each region of the US, to catalyze local and regional focus on a particular sets of physics goals, encourage coordinated analysis developments which emphasize particular subdetectors, and put emphasis on training, and involvement of students in front-line data analysis and physics results. One special feature of the Tier2 is that they should include a high quality environment for desktop-based remote collaboration. 

 

Tier2 centers are an important part of the distributed data analysis model (comprising approximately one-half of US CMS’ proposed computing capability) that has been adopted by all four LHC Collaborations. The use and management of the regional centers requires the use of “Data Grid” tools, some of which are already under development in CMS[1]. The prototype system, and the coordination of its operation with CERN and Fermilab for production and analysis, is providing a testing ground for these tools. 

 

Planning History for the Prototype Tier2

 

A cost effective candidate configuration for the prototype was developed in mid-2000, and specified Linux rackmounted computational servers, medium scale data servers and network interfaces capable of providing high I/O (100 Mbyte/sec range) throughput, together with a few-Terabyte RAID arrays as nearline data storage for simulated, reconstructed and analysed events.

 

It was decided that the location of the prototype center would be split between the Caltech Center for Advanced Computing Research (CACR) and the San Diego Supercomputer Center  (SDSC) which are linked by high throughput WAN links. This allowed the leveraging of existing large scale HPSS tape storage systems; system software installations and developments as part of the Particle Physics Data Grid (PPDG), ALDAP and GriPhyN projects; as well as existing expertise of the staff located at each of these facilities.

 

 

The CALREN OC-12 (622 Mbps) California regional network and the NTON dark fiber link interconnecting CACR and SDSC would be used to test distributed center operations at speeds characteristic of future networks (0.6 – 2.5 Gbps). (A future possibility would be to consider the use of CALREN or its successor (and/or NTON) to include more California sites, such as UC Davis, UC Riverside and UCLA.)

 

Work Plan

 

The work plan has three major branches:

 

·        R&D on the distributed computing model. Strategies for production processing, and for data analysis, will be investigated with the help of the MONARC simulation tools

·        Help with upcoming production milestones in association with the Physics Reconstruction and Selection (PRS) studies

·        Startup of a prototypical US-based data analysis among the California universities in CMS.

 

System Configuration and Features:

 

The prototype system consists of two nearly identical symmetric systems. One is located at Caltech’s Center for Advanced Computing Research (CACR) and the other at the San Diego Supercomputing Center (SDSC). They are connected over a high-speed network link, currently CALREN (at OC12/622 Mbits/second) and later NTON (OC48/2.4 Gbits/second).

 

The Figure below shows the schematic arrangement of Tier2 components.

 

 

 

 

 

 

 

 

 

 

The following Figure shows the Caltech half at an early stage of construction.

 

Each contains approximately 40 rack-mounted, dual CPU Pentium III Linux computational nodes, a multiple terabyte RAID disk array connected to one or two servers, a network switch, and connections to the existing high performance storage systems (HPSS) at CACR and SDSC.   The computational nodes are networked using 100 base-T Ethernet connections. The servers are connected using Gbit Ethernet.  Past experience indicated that a 100 base-T Ethernet connection would be sufficient for a dual CPU machine running CMS software performing reconstruction or simulation, when data have to be streamed across the network.  An important aspect of the research agenda is to determine the optimal network configuration (in terms of topology and bandwidth) for each system: the initial configuration does not experience network bottlenecks when running the expected batch and interactive loads, but can become bandwidth limited when running certain specialized tasks.

 

The Figure below shows the semi-complete Tier2 at Caltech, with callouts that describe the various hardware components.

 

HP ProCurve Switch

24 10/100 Ethernet

2 Gbit Ethernet

 

3 kVA UPS

Hosting Server and RAID

 

21 Slave Nodes

Dual 850 MHz PIII

512 MB memory

2x30GB disk

10/100 Ethernet

 

Winchester

FlashDisk

1 TByte

(Two trays of 8x75GB Ultra SCSI III)

RAID5

 

1U fold-away Console

 

Dual 1GHz PIII Servers

Dual-port SysKonnect GigE

(Under test)

Console Switch

 

nStor 18F

1 TByte

FibreChannel

(Tray One 18x18GB

Tray Two 18x36GB)

RAID0

 

ASA Disk Server

Dual 933MHz PIII

1 GB memory

2 SysKonnect GigE

1 TByte

(16x75GB ATA on 3ware Escalade 6800)

RAID0

 

Dell PowerEdge 4400

Dual 1 GHz PIII

2 GB memory

2 SysKonnect GigE

5 internal disks

 

 

Computational Nodes

 

The computational (“slave”) nodes are based on the SuperMicro DLE motherboard in a 2U rack mounted case.  The systems have two 800MHz Intel Coppermine PIII CPU’s and 512MB of 133MHz SDRAM.  The Coppermine CPUs have full speed cache, instead of the half speed cache on previous Pentium III versions.  Benchmark tests, on evaluation systems using CMS reconstruction code, revealed approximately 5% improvement in performance using Coppermine CPU’s with 133MHz memory, over previous Pentium III versions using 100MHz memory.  The choice of motherboard was based on the availability of a system which could support dual Coppermine CPU’s and 133 MHz SDRAM, while not including a SCSI controller (which is an unnecessary and costly for the computational nodes).  The CPU speed was chosen to be at the best price performance point.  The amount of memory was chosen to be adequate for two simultaneous ORCA production jobs, with a small amount in reserve for future increases in the ORCA image size. The computational node motherboards can address up to 4GB of RAM, but 512MB is seen as sufficient for now.  A number of vendors were asked to provide cost estimates for assembling the nodes: in the end the two lowest cost vendors offered the same price. Datel Systems of San Diego was chosen because they offered a better service arrangement.

 

Server Nodes

 

Caltech and UCSD use different servers to attach the RAID arrays. Caltech uses twoservers. The primary server is a Dell PowerEdge 4400, a similar model to which we have been using extensively in other projects, and found to be reliable with excellent I/O and CPU performance. The secondary server is a dual processor Pentium III 933 MHzsystem with 1 GByte memory, and integrated dual 3ware Escalade 6800 ATA RAID controllers.This server also contains dual SysKonnect SK9843 single port Gbit ethernet cards, as doesthe Dell PowerEdge. The secondary server contains about 1.2 TByte of ATA disk in the formof 16 IBM DeskStar units each of 75GB. These disks are configured in RAID0 (see next section).UCSD uses a 4U custom configured rack mounted system based on the SuperMicro DE6 motherboard, which has a much lower cost than the Dell.  Both are dual CPU systems with 2GB of SDRAM.

 

The following Figure shows the UCSD Tier2 equipment.

 

 

Disk Arrays and Tests

 

Substantial quantities of disk space are required at each of the Tier2 prototype centers. An initial capacity of 2 TBytes at each site was foreseen. We have chosen SCSI RAID as the best technology from price vs. performance as well as availability and reliability considerations. We evaluated several different configurations before choosing a vendor:

 

  • JBOD Seagate and IBM disks with software RAID
  • FibreChannel Seagate disk arrays from nStor Corp. with hardware RAID
  • Dell PERC hardware RAID with IBM and Seagate disks
  • Adaptec I2O hardware RAID with IBM and Seagate disks
  • Software RAID with IBM and Seagate disks
  • Hardware RAID from Winchester Systems
  • Software RAID (1) with Hardware RAID (5) from Winchester Systems
  • Hardware RAID (0,1,10,5) from 3ware using ATA disks

 

The evaluation tests involved measuring the capability of the RAID array to sustain high I/O rates with a variety of block sizes, and the real throughput from an Objectivity database application when writing and reading to the RAID array. The best performance was achieved using the Winchester Systems FlashDisk.

 

Winchester “FlashDisk”

 

Showing the two Winchester “FalshDisk” arrays, each of 500GB.

 

 

The Winchester “FlashDisk” RAID array we tested is a ruggedized rack-mountable disk enclosure of height 4U, containing 8 SCSI 75Gbyte Ultra160 Seagate disks and a RAID controller. The controller was initially an Ultra II (80 Mbytes/sec), but was replaced with an Ultra160 (160 Mbytes/sec) version, which only recently become available in the industry. We configured the 8 disks as two RAID5 arrays of 4 disks each. Each of the arrays was connected to a host Dell PowerEdge 4400 server on one of two dedicated Adaptec 39160 Ultra160 channels. The two arrays were then striped at RAID1 in software, to arrive at a virtual disk of approximately 0.5 Tbyte in capacity. This virtual device was capable of very stable read and write speeds of a consistent 116 MBytes/second using a memory-disk-memory performance measurement tool[2] and approximately 35 MBytes/sec read and write speeds using our Objectivity/DB-based test application[3] running a single thread. In a third test, using Java I/O in three parallel streams, we achieved 70 MBytes/second by simultaneously reading three database files of 1Gbyte each. The results with the Winchester FlashDisk were much better than any of the other configurations listed above. An additional advantage of the FlashDisk is that we are free to ourselves insert commodity SCSI disks of larger capacity as and when they become available. Furthermore, no software drivers are required for the FlashDisk, as the RAID controller emulates a simple SCSI interface, which means that the array appears as

a standard SCSI device to a host OS.

 

 

Winchester FlashDisk Test details

 

Windows 2000 Server

         Memspeed_v2 (Jim Gray)

          Single Ultra2 Array

         Read @ 67 MB/sec

         Write @ 66 MB/sec

          Single Ultra3 array

         Read @ 97 MB/sec

         Write @ 89 MB/sec

          Software NTFS Stripe two Ultra2 Arrays

         Read @ 116 MB/sec

         Write @ 115 MB/sec

         Java IOSpeed (JJB)

          Read one 5GB file @ 53 MB/sec [10 MB buffers, Ultra3]

          Read two 5GB files @ 74 MB/sec

 

Linux 2.4 with new SCSI drivers

         Dsktst6 (Jan Lindheim)

          Ultra3

         Write @ 70 MB/sec

         Java IOSpeed (JJB)

          Read one 2GB file @ 33 MB/sec [reiserfs, 1MB buffers]

 

 

 

 

ASA Server – 3ware “Escalade” 1.2 TByte Array

ASA Computers built 22-disk bay, dual 933MHz PIII, 1 GB memory, one SCSI system disk, twin SysKonnect SK9843 Gbit cards

Two 3ware Escalade 6800 PCI controllers, each with 8 IBM 75 GB ATA100 disks: 1.2 TB usable capacity

Disks on each 6800 can be configured in RAID0, 1, 10 or 5. All RAID is done in hardware.

Custom built - very heavy, well cooled

Pre-installed with RedHat Linux 6.2

We paid 14k$ for all the above

 

ASA Server Test Details

 

          Hardware RAID0

         4 disk set, write @ 52 MB/sec

         8 disk set, write @ 65 MB/sec

 

          Software stripe of two Hardware RAID0 sets

         Two sets of 4 disks, write @ 68 MB/sec

         Two sets of 8 disks, write @ 73 MB/sec

 

          Hardware RAID5

         4 disk set, write @ 2.8 MB/sec

         8 disk set, write @ 6.0 MB/sec

          Consistent with 3ware results opposite

 

 

nStor FibreChannel 1TB array

 

 

 

         One tray of eighteen 18GB disks

         One tray of eighteen 36GB disks

         One controller

         Total of 1 TB

         RAID5 and RAID0 being used

         Hosted on Sun Enterprise 250 using LightSpeed LP8000 PCI-FibreChannel adapter

         Good controller, good disk cooling.

         Poor disk mounting system

         Best read rate seen: 30 MB/sec

         Best write rate seen: 7 MB/sec

         Poor performance blamed on LP8000 Solaris driver

         But similar results obtained under Win2000 server using the Windows driver

         Smoking!

 

Other Disk Arrays

 

         We tested an Adaptec “I3O” PCI RAID controller with four Seagate Cheetah 15,000 rpm Ultra3 disks. Using RAID0 the best we saw was read @ 23 MB/sec and write @ 14 MB/sec i.e. worse than using a single disk!

         In another test, using Windows 2000 software striping, and an array of Seagate disks, we saw read @ 55 MB/sec and write @ 35 MB/sec using default settings and without further experimentation.

 

Network Configuration

 

Both Caltech and UCSD are preparing to use low cost, medium performance network switches from HP, which will be connected via the servers to existing high performance switches at CACR and SDSC.  The HPSS and external network at both facilities are connected to Gbit Ethernet switches.  To network the Tier2 prototype components, the computational nodes are connected to an HP ProCurve 2524Ethernet switch (auto sensing 10/100 Mbit on each of 24 ports) equipped with two Gbit ports. One or both of the Gbit ports will be connected to the server, which in turn (and on a separate circuit) is connected to the existing high performance Gbit Ethernet switches at SDSC and CACR. 

 

Leverage Resources:

 

The Tier2 prototype project is leveraging resources (both in terms of infrastructure and people) from CACR and SDSC.  In terms of facilities the project benefits from high speed networking, both locally at each site and between the two sites; an HPSS at each site; and environmental services like cooling and power conditioning.   The project also benefits from considerable expertise in the staff at CACR and SDSC in the areas of assembling, configuring, and managing Linux clusters.

 

Hardware Activities

 

January 2001-March 2001

 

  • Node cloning completed
  • ATA Disk array installation, configuration and testing
  • Winchester Disk array installation, configuration and testing
  • Gbit connections to CACR HPSS and High Performance Network from Tier2 (main server) and Tier2b (secondary server, ATA disk array server) completed.

 

Software Activities

 

January 2001-March 2001

 

  • CMS Software Configuration, Installation, and Maintenance – ORCA versions 4.3.0.pre4, 4.3.2, 4.4.1 installed, IGUANA 2.1.0, 2.1.2, 2.3.0 installed
  • System Configuration for CMS Production
  • Lots of work from Vladimir automating CMSIM Production and ORCA Production
  • Installation of GDMP 1.2.1.
  • GDMP Stability Testing – replication of 170GB of Minimum Bias events from CERN to Caltech Tier2.
  • Production of 500,000 CMSIM events.
  • Production of 150,000 fully reconstructed ORCA events.
  • Production of ooDigis (50,000 events)
  • (UCSD) Installed and Tested NPACI developed Rocks Cluster Management Software
  • GDMP Stability Testing - replication of 300GB of Minimum Bias events from CERN to Caltech and UCSD Tier2s.
  • (UCSD) Installation and Testing of prototype Fermilab Developed Production Tools.
  • Installation of Hickey Scheduler that serves processors on Caltech and SDSC Tier2.
  • Installation of PBS.

 

Tier2 Personnel

 

Apart from the individuals in the following list, who are concerned with various management aspects of the Tier2, we have a growing community of end users.

 

Name

Affiliation

Tasks

%

Julian Bunn

Caltech/CACR

System Oversight

20%

Suresh Singh

Caltech

System Management

75%

Harvey Newman

Caltech

Overall Direction

10%

Koen Holtman

Caltech

PPDG and scheduling

50%

Vladimir Litvin

Caltech

ORCA and CMS environment support

25%

Takako Hickey

Caltech

ALDAP and scheduling methods

50%

Mehnaz Hafeez

Caltech

Globus infrastructure

40%

Asad Samar

Caltech

Globus infrastructure

30%

Philippe Galvez

Caltech

network measurements

30%

James Patton

Caltech/CACR

Network setup and NTON measurements

 5%

Jan Lindheim

Caltech/CACR

data server configuration

 5%

Ian Fisk

UCSD

ORCA, Physics tools

25%

Jim Branson

UCSD

Physics Tools

10%

Reagan Moore

SDSC

grid data

 

Mike Vildibill

SDSC

grid data handling system

 

Phil Andrews

SDSC

grid data handling system

 

 

 

Role of Other Universities in California and Elsewhere

 

All of the CMS member universities in California will be connected at high speed by the Calren-2 project.  We expected that UC Davis, UCLA, and UC Riverside would be the first users of this prototype center: several physicists from those institutes are already active on the Tier2. In addition there are of course CMS users from at UCSD and Caltech.  We expect that all of these institutes will have significant software and physics efforts next year (2002), when the center is operating. We intend to provide support for users from other US regions, if that turns out to be necessary despite the presence of other resources at FNAL and elsewhere. 

 

It may be possible, later, to enlarge the center by including resources at the other California universities.  UC Riverside is interested in this possibility. 

 

 

Tier2 FTE Effort by key personnel

 

This table breaks down the approximate effort by individuals at Caltech who were involved in the installation and commissioning of the Caltech Tier2.

 

 

 

 

Hours

Totals

Technical Coordinator

 

meetings

2

 

benchmarking

10

 

general installation tasks

28

 

Total

 

40

Network Expert (hi.perf. network)

 

meetings

4

 

installation

1

 

Total

 

5

Network Expert (comm. network)

 

meetings

7

 

general tasks

2

 

network installation

1

 

Total

 

10

Electrical Engineer

 

nodes, racks, cabling

40

 

Total

 

40

Systems Group Manager

 

meetings

4

 

correspondence

2

 

related activities

5

 

Total

 

11

Facilities Manager

 

meetings

3

 

research

20

 

purchasing

4

 

coordination

40

 

cleanup

1

 

Total

 

68

Application Software Engineer

 

CERN software installation etc.

20

 

Total

 

20

OS Software Engineer

 

Linux install, cloning, boot disk

60

 

Total

 

60

Administrative Assistant

 

purchasing, order tracking etc.

10

 

Total

 

10

Grand Total

 

264

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 



[1] Work now underway, done in collaboration with Ian Foster and Carl Kesselman as part of the PPDG, GriPhyN, and EU DataGrid Projects.

[2] For this test we used the “MemSpeed” application from Jim Gray/Microsoft, which provides a detailed set of performance disk performance statistics at various block sizes.

[3] The Objectivity database used had a page size of 32 kBytes. The test application (TOPS) was developed by Koen Holtman/Caltech.