C        Project Description

C.1     Introduction

Vision: We propose to develop and deploy the Physics Lambda-based Network System (PLaNetS) to drive a new round of discoveries at the frontiers of data-intensive science. PLaNetS will support Terabyte and multi-Terabyte “data transactions” between sites to complete in minutes to hours, rather than hours to days, and significantly improve overall working efficiency of the network resources. PLaNetS will build on this progress to develop a core suite of high performance end-to-end data transfer tools and applications, enhanced by real-time network and end-system monitoring and management services, components of which have been developed and proven in sustained field-trials over the last four years. These will be integrated to form a new paradigm of network operations and management including (1) Queues for tasks (transfers) of different lengths and levels of priority, coupled to dynamic (real or virtual) path-construction services for the most demanding, high-priority tasks, leveraging the work of the DOE-funded OSCARS [u19], TeraPaths [a14] and LambdaStation [a10] projects (2) A task "director" aided by end-system agents to partition the work among foreground, real-time-background and queued transfers, (3) End-to-end monitoring, network path and topology discovery, and path performance estimation and tracking services, based on the MonALISA [u10] and the Clarens [a3] frameworks, as well as IEPM monitoring services [a9], and (4) Policy-based network path-request and utilization services incorporating the OSG infrastructures for authentication, authorization and accounting. 

Relationship to OSG and Ultralight: PLaNetS will amplify and broaden the capabilities of OSG, a uniquely capable national computational facility supporting simulation and experimental research. We will augment the OSG software stack with UltraLight’s real-time services that estimate, monitor and track performance, and build managed data channels or complete network paths for high priority data transfer tasks as needed, to optimize the overall throughput among the grid sites. 

PLaNetS will exploit the recent breakthrough progress in several network-related areas of information technology to exploit the full capabilities of long range networks, in partnership with UltraLight. UltraLight is a state-of-the art facility based on a hybrid optical packet- and circuit-based dynamic network infrastructure with more than twenty 10 Gbps links, interconnecting the major high energy physics labs (Fermilab, BNL, SLAC) in the US, CERN in Europe and KEK in Japan, and key university sites in the US, Korea and Latin America (Caltech, Florida, Michigan, FIU, Manchester in the UK, UERJ in Rio, UNESP in Sao Paulo,  and KNU in Korea). Several novel methods for real-time network and end-system monitoring and management services have been developed and proven in sustained field-trials over the last four years including  (1) new fair-sharing, stable, high performance TCP-based network protocols (FAST, MaxNet), (2) tuned Linux kernels and network interface settings capable of sustained data transport in the range approaching 10 Gigabits/second (Gbps) for individual streams (3) global-scale agent-based systems, exemplified by Caltech’s MonALISA system, that autonomously monitor and help manage major research networks, hundreds of grid clusters and other distributed systems around the clock.

Target Applications: In partnership with the Open Science Grid (OSG) [u20], UltraLight [a16], [a15]and DISUN [u5] , we will deliver these capabilities to high energy physics (HEP), gravity-wave physics, astrophysics and, radio-astronomy communities by closely coupling PLaNetS to the Grid-based physics production and analysis systems under development in ATLAS [u1] and CMS [u29].  Physicists will use the testbed to exploit the powerful new network monitoring, profiling and real-time operations support systems, helping them to meet near-term Grid analysis milestones and greatly improve the performance observed in data challenges. Working in partnership with NSF’s DISUN project, the Large Hadron Collider (LHC) [u16] groups operating Tier1 and Tier2 centers will be able to greatly enhance their ability to transport data among their sites, in a strategically managed, fair-shared and policy-driven manner, thereby greatly increasing their ability to harness their computing and storage resources for data analysis, and scientific discovery. Astrophysicists will enhance their ability to locate, extract and if needed distribute and further process massive datasets. Radio astronomers will enhance the sensitivity and discovery reach of their explorations, acquiring, processing and correlating data at burst-rates several magnitudes higher than previously attainable.

As PLaNetS (2006-2011) covers the full ramp up period of the LHC to design luminosity, during which time network technologies and applications will continue to advance, it is essential that PLaNetS also incorporate the next step in networking. The UltraLight testbed, which currently includes more than 20 national, transoceanic and metropolitan-area links at 10 Gbps - and is continuing to evolve - is planned to be enhanced within the limits of available funding by a transition of Caltech’s external network connections to support up to 32 10 Gbps wavelengths using a software reconfigurable optical multiplexer (ROADM) on the dark fiber connecting the campus to the CENIC [u2] and UltraLight points of presence in Los Angeles. This will enable multi-10Gbps wavelength use by our target projects, and allow the first at-scale network tests guiding the design and modes of use of networks and associated systems supporting the next round of frontier science projects starting in 2011-2013, including the Super-LHC, Advanced LIGO, the Large Synoptic Survey Telescope[u15], and next-generation eVLBI experiments. These goals are coincident with those of the GENI framework now being prepared at NSF.

The combined PLaNetS and OSG system will initially be adapted to meet the specific needs of physicists in the US LHC, NVO, LIGO and eVLBI programs, but the PLaNetS tools and services also will be packaged for more general use. By integrating these tools in the Virtual Data Toolkit (VDT)[u26], we will also benefit astrophysicists in the Sloan Digital Sky Survey [u22] and Dark Energy Survey[u4], as well as bioinformatics and genetics application communities such as GADU/GNARE[a18].

Education, Outreach and Broad Impact:  We have designed a unique set of activities to broaden the impact of PLaNetS, targeting both the current and next generation of scientists. Tutorial workshops and summer research projects will be offered to students who will evolve into the next generation of scientists, providing them with access to the leading edge of the scientific frontier. The data intensive research paradigm shift resulting from PLaNetS requires additional mini-workshops for practicing researchers to allow them to take full advantage of the network infrastructure. Professional workshops, based on student workshops, will be offered at a variety of collaborative venues.

Our education and outreach program utilizes methods established and refined during the iVDGL, CHEPREO, and UltraLight projects. It provides direct and significant support for E&O activities including: interactive workshops, application development, experiment participation, infrastructure deployment, and internships at participating institutions. We will reach a variety of students at our collaborating institutes including a significant number of students from traditionally underrepresented groups and minorities as well as students from our collaborating international institutions. We will also invite a limited number of students from our LHC collaborating institutions that are not part of the PLaNetS collaboration.

Funding for five students is provided in the PLaNetS budget, additional students will be encouraged to attend with funding provided by their institution. Two student research lines are also provided, with additional funded students provided by REU programs at the participating institutions. Travel funds for the leaders of both student and professional workshops are provided by their institution.

PLaNetS, through its groundbreaking level of network capability, the scope of its international testbed and partnerships, the unique nature of the real-time systems aimed at data intensive science (building on the systems that are beginning to be deployed now in UltraLight), will provide vital input for NSF’s GENI initiative [u9] by the time it begins in 2009. PLaNetS’ end-to-end managed network paradigm, its ability to field real-time autonomous network systems on a global scale begun in the UltraLight project while isolating the hard technical and policy issues, and its long-term development program driven by the ongoing mission to serve a growing international community in support of their science, as well as research and education, will help shape the worldview of networks over the next few years. PLaNetS will thus influence GENI’s concept and design, especially as it relates to large-scale information usage by the scientific community, and in daily life.

C.2     Motivating Applications

C.2.1    LHC: Physics and Computing Challenges.

High Energy Physics experiments are breaking new ground in the understanding of the unification of forces, the origin and stability of matter, as well as structures and symmetries that govern the nature of matter in our universe. To improve our understanding of the fundamental constituents of matter, and the nature of space-time itself, researchers work to isolate and observe rare events, predicted by a variety of new physics theories which go beyond our current understanding.  Even by utilizing highly-processed analysis object data, the size of an LHC dataset suitable for discovering such rare events is expected to be at the terabyte level with similar amounts of Monte Carlo simulated data required for hypothesis testing.  Further, assuming a typical LHC data taking period (“Run”) corresponds to approximately 1-3 hours of stable online operations, the size of such a canonical RAW dataset is also expected to be about a terabyte.  Hence, physicists performing searches for possible new physics discoveries as well as physicists conducting detector calibration and systematic studies, vital for establishing the early presence of possible new physics signals, will frequently request terabyte sized dataset “chunks” for their work.

Over time, total HEP data volumes to be processed, analyzed and shared are expected to rise from the multi-Petabyte (1015 Byte) to the Exabyte (1018 Byte) range within the next 10-15 years, and the corresponding network speed requirements on each of the major links used in this field are expected to rise from the current 10 Gigabit/sec (Gbps) to the Terabit/sec (Tbps) range during this period, as summarized in the following roadmap of major HENP network links  [a19].

Table 1: Bandwidth Roadmap (in Gbps) for Major HENP Network Links

Year

Production

Experimental

Remarks

2001

0.155

0.622 – 2.5

SONET/SDH

2002

0.622

2.5

SONET/SDH; DWDM; GigE Integr.

2003

2.5

10

DWDM; 1 & 10 GigE Integration

2005

10

2-4´10

l Switch, l Provisioning

2007

2–4´10

~10´10 (and 40)

1st Gen. l Grids

2009

~10´10 (or 1–2´40)

~5´40 (or 20–50´10)

40 Gbps l Switching

2011

~5´40 (or ~20´10)

~5´40 (or 100´10)

2nd Gen. l Grids, Terabit networks

2013

~Terabit

~Multi-Terabit

~Fill one fiber

The HEP community leads science in its pioneering efforts to develop globally-connected, grid-enabled, data-intensive systems. Its efforts have led the LHC experiments to adopt the Data Grid Hierarchy of 5 “Tiers” of globally distributed computing and storage resources [a20]. Data at the experiment are stored at the rate of 200-1500 Mbytes/sec throughout the year, resulting in Petabytes per year of stored and processed binary data that are accessed and processed repeatedly by worldwide collaborators.

Processing and analyzing the data requires the coordinated use of the entire ensemble of Tier-N facilities. The relatively few large Tier-0 and Tier-1 facilities are best suited for the high priority large-scale tasks of systematic data processing, archiving and distribution. Moving down the hierarchy to the smaller and more numerous Tier-2 and Tier-3 facilities, individuals and small groups have greater control over how these resources are allocated to small and medium-sized tasks of special interest to them. Data flow among the Tiers will therefore be more dynamic and opportunistic, as thousands of physicists vie for shares of more local and more remote facilities of different sizes, for a wide variety of tasks of differing global and local priority, with different requirements in turnaround times (from seconds to hours), computational requirements (from processor-seconds to many processor-decades) and data volumes.

Data rates and network bandwidth estimates [a1] between the Tier-N sites were initially based on a conservative baseline formulated using an evolutionary view of network technologies. However, more recent estimates indicate that HEP network demands will reach multiple 10 Gbps links within the next two to three years during the time when the LHC begins operation, followed by a need for scheduled and dynamic use of 10 ´ 10 (or 1-2 ´ 40) Gbps wavelengths as the LHC transitions to full luminosity running.

C.2.2    Initial and Advanced LIGO: Physics and Computing Challenges. 

The Laser Interferometer Gravitational wave Observatory (LIGO) community is beginning to integrate its data analysis efforts into OSG in an effort to efficiently utilize “friendly computational cycles.” One of the greatest challenges associated with conducting LIGO data analysis on OSG is to efficiently move datasets that are typically on the order of one TB in size from LIGO’s data grid, which houses the full LIGO dataset, onto the storage resources associated with OSG compute resources. PLaNetS, in partnership with the OSG, will allow LIGO to address current network limitations and exploit the full potential of the computational resources available on the OSG, thereby promoting the most efficient analysis of gravitational wave data.

LIGO data movement among its Observatories, Tier-1 and -2 centers, international gravitational wave partners and onto OSG storage resources will benefit through the use of PLaNetS new transparent transport layer tools and services for monitoring and tracking network performance.

In the future, Advanced LIGO, with its increased sensitivity, will likely experience a tenfold increase in data volume. Binary inspiral waveforms will increase in length by nearly two orders of magnitude.  Typical data analysis of Advanced LIGO data will likely require data transfers of several terabytes to efficiently utilize local computational resources on the grid. LIGO’s partnership with PLaNetS will provide opportunity for testing and design guidance prior to the production-level science start-up of Advanced LIGO. 

C.2.3    eVLBI: Astronomy and Computing Challenges

The Very-Long Baseline Interferometer (VLBI) is one of the most powerful techniques for studying objects in the universe at ultra-high resolutions, combining simultaneously acquired data from a global array of up to ~20 radio telescopes to create a single coherent instrument. Traditionally, VLBI data are currently collected at data rates up to ~1 Gbps/telescope of incompressible data on magnetic disks that are shipped to a central site for correlation processing. Since the sensitivity of the observations increases as the square root of the data rate, there are large advantages to increase to multi-Gbps data rates.  Within the three years data rates are projected to increase to ~16 Gbps/telescope for some experiments, with ultimate goals of ~100 Gbps/telescope.  Recording and shipping of physical media at these higher rates quickly becomes uneconomical, making transfer to the correlator by high-speed networks (dubbed ‘e-VLBI) very attractive and economical.  To achieve this, data would be transferred in real-time to the correlator, requiring only relatively small electronic buffers at the telescopes and the correlator, though temporary buffering on physical media at either/both the telescope or the correlator is sometimes practical. In addition, support for future ‘distributed correlation’ where GRID computing resources are dynamically gathered and utilized to spread the correlator processing over hundreds or thousands of geographically distributed resources to enable new and better science at lower costs

C.2.4    NVO: Astrophysics and Computing Challenges.

The next decade will witness the completion of several new and massive surveys of the Universe.  These surveys span the whole electromagnetic spectrum from X-rays (ROSAT, Chandra, and XMM satellites) through optical and ultraviolet (SDSS, GALEX, LSST surveys) to measurements of the cosmic microwave background and radio (WMAP and PLANCK satellites).  It is only when these datasets are combined – collating data from several different surveys or matching simulations to observations – that the full scientific potential is realized; the scientific returns from the total will far exceed those from any one individual component.  The Palomar-Quest sky survey produces 50 Gbyte each clear night, and newer surveys will be coming online in the next few years (Pan-STARRS[u21], LSST[u15]) that are expected to raise this to tens of terabytes per night.

With the advent of event-based astronomy, the demands on the computing system grow, as the astronomers want to be notified of changes in the sky within minutes of a Gamma-Ray Burst or supernova, meaning that the pipeline must be able to meet these real-time requirements.  Other pipelines build derivative products that can be used for mining. The Hyperatlas project [a21] is an infrastructure to deliver image data in uniform projections and wide mosaics, so that images from different times or wavelengths can be jointly mined.

C.3     Need for Incorporating the Networking as an Active Element

Grid systems so far have treated the network as a passive and largely featureless substrate for data transport, in spite of the fact that wide area network bandwidths have grown approximately two orders of magnitude faster than processor speeds over the past two decades.  As the designers of HEP online data acquisition systems have learned, successful development and operation of a distributed system requires treating the network as an active element, and an important resource, similar to computing and storage facilities, whose use is to be monitored, tracked and optimized in real time. This is the central theme of the PLaNetS proposal.

The HEP community has become a principal driver, architect and co-developer of advanced networking infrastructure and new tools and techniques for end-to-end data transmission. For example, within the past year teams from Caltech, CERN, Michigan, Florida, FNAL, SLAC, and others demonstrated sustained transfers of LHC Monte Carlo physics data across the UltraLight testbed with throughputs of over 100 Gigabits/sec (Gbps), peaking to 150 Gbps, resulting in a total of 0.5 Petabytes transported during a 24 hour period [u23].  Such milestones clearly reveal that we are pushing the capabilities of networks that are based on statically routed and switched paths. It is now generally understood that in the longer term “intelligent photonics” (the ability to use wavelengths dynamically and to construct and tear down wavelength paths rapidly and on demand through cost-effective wavelength routing) are a natural match to the peer-to-peer interactions required to meet the needs of leading-edge, data-intensive science. The integration of intelligent photonic switching with advanced protocols is an effective basis for efficient use of network infrastructures, wavelength by wavelength, and holds the promise of bringing future Terabit networks within the reach, technically and financially, of scientists in all world regions.

Integrating these new capabilities into already complex computing models requires a sustained effort and close coordination with the LHC collaborations, many of whose members are either participating or partnering in this proposal.  Applications will need to be adapted and instrumented while Data Grid scheduling [a23] and management middleware must be written to take full advantage of the new infrastructure.  Feedback between applications and infrastructure will be critical to implementing an efficient, effective system. In the following we describe some of the features that are required to develop such a system

C.3.1    Terabyte Size Transactions

The driving forces behind much of the LHC, LIGO, NVO and eVLBI network bandwidth requirements are (1) that the “small” requests for data samples will often exceed a Terabyte (even in the early years of LHC operation), and could easily reach 10-100 Terabytes (in the years following), and (2) the number of requests from, for example, the global HEP, LIGO, and eVLBI communities are expected to reach hundreds per day. This leads to the need to support Terabyte-scale transactions, where the data is transferred in minutes rather than many hours, so that many transactions per day can be completed. The likelihood of the transaction failing to complete is much smaller than in the case of many long transactions sharing the available network capacity for many hours. Taking the typical time to complete a transaction as 10-15 minutes, then a one Terabyte transaction will use a 10 Gbps link fully, and a 100 Terabyte transaction (e.g. in 2010 or 2015) would fully occupy a link of ~1 Tbps.

PLaNetS will enable these modes of scientific analysis and discovery by providing the necessary system of services which facilitate frequent terabyte scale transactions requested by multiple users, completing the transaction in minutes rather than hours (or days) for efficient search optimization and systematic understanding.

 

C.3.2    Sustained Production Flows

One of the highest-priority bandwidth uses for the LHC are sustained production flows of data collected by the experiment’s online system which are stored at the Tier-0 and then distributed to the Tier-1’s at a rate of 200-1500 Mbytes/sec throughout the year.  In addition, most of the production of Monte Carlo simulated LHC data will be performed at the Tier-2 facilities and will be distributed back to the Tier-1’s, the Tier-0 as well as other Tier-2’s at aggregate rates of perhaps up to 50-100 Mbytes/sec. 

PLaNetS will enhance these modes of high-priority, sustained scientific data flows by providing the necessary system of services for differential, policy-based, network resource management which is able to discriminate between high-priority sustained versus lower priority peaky flows.

C.3.3    Burst Streaming

An interesting and important network requirement of both eVLBI and NVO involve burst data rates (used in real-time gathering and analysis of the data) which, in the case of eVLBI, are projected to increase to ~16 Gbps/telescope within a few years, with ultimate goals of ~100 Gbps/telescope.   To achieve this, attendant high-level services are required to provide managed ‘on-demand’ dedicated paths from telescopes to correlator to allow full real-time correlation. 

PLaNetS will enable these modes of burst scientific data flows for real-time data analysis by providing the necessary system of services which facilitate rapid policy based network resource re-prioritization.

C.4     The PLaNetS Managed, Integrated System

The PLaNetS collaboration will meet these goals by creating a cohesive system composed of a configurable, agile network, intelligent middleware and integrated applications, while also working with the developing Grid infrastructure from projects such as OSG, EGEE[u7], LCG [u17] and similar efforts.

The PLaNetS system, with its integration of network monitoring and dynamic provisioning through the use of intelligent end-to-end middleware (or “global services”) will transparently support efficient Terabyte-scale data transport, and both small and large real-time flows for the LHC, eVLBI, and Astrophysics collaborations. By maintaining a real-time global view of the state of the system, from the network infrastructure to the Grid middleware to the end-to-end managed services and the physicists' applications, and by applying self-learning algorithms that optimize the workflow while aiming to match the collaborations' policies for coordinated (network, data and compute) resource usage, it will enable stable and effective use of a petabyte-scale globally distributed system for the first time.

By tracking the system state and the data flows associated with various classes of work, with varying priorities and network requirements (throughput; latency and jitter for real-time streams), the monitoring system and global services will be able to learn (supervised, and later autonomously) how to resolve or mitigate network bottlenecks or other resource scheduling conflicts in the face of massive demands for large but limited resources.

Figure 1 shows a high level view of how PLaNetS services will be embedded within the e-science infrastructure. PLaNetS will build on the successful work of the UltraLight project which deployed an ultra scale hybrid network, basic network services and end-2-end network monitoring. PLaNetS will provide interfaces and functionalities for physics applications to effectively interact with the physics application-level services domain.  (Physics) application frameworks will be augmented to interact with a new class of high-level global services that in turn interact with the data storage and data access layers. Low-level UltraLight and OSG Services provide hints to the high-level PLaNetS services which allow optimization of data access and throughput, enabling the effective use of caching, pre-fetching, and offering opportunities for local and global system optimizations. PLaNetS adds a completely new dimension to these interactions by interfacing the applications to the novel managed networking services. This allows PLaNetS to extend the advanced planning and optimization behavior into the networking and data access layers, allowing a whole new class of advanced system behaviors and functionalities.

Text Box: Figure 2 PLaNetS Architecture:  Global Services provide policy based management for multiple individual data transfers.Figure 2 shows a multi usage view of how PLaNetS services will be utilized to support data transfers. At any given time in the distributed system there will be multiple transfers in progress all competing for the same (limited) storage and network resources. Within the PLaNetS computing model these transfers take place through interaction with policy based, secure PLaNetS services that provide a quality of service, and based on the priority of the transfer, a higher (guaranteed) throughput. The PLaNetS services collaborate, not to provide an optimized throughput for a single transfer, but to provide a fair sharing of the available (network) resources at any given time for all transfers. End-2-end monitoring provides feedback on the state of the network and the individual transfers to the global PLaNetS services which utilize this feedback to collaborative optimize the transfers. This feedback loop of continuous monitoring and adaptation of the global PLaNetS services is important to optimize (network) resource usage needed by multiple transfers. The PLaNetS computing model will not force applications to utilize its advanced services to use the network resources, but not doing so, will result in less guarantees on the quality of service for a particular transfer, and thus weaken the competitiveness of US based physics groups through the delay of potential scientific discoveries.

A crucial requirement for developing the PLaNetS intelligent infrastructure is the ability to allow high-level applications or service layers to effectively understand the current state of the underlying (network) infrastructure. Having a global view of the entire system will lead to optimized decision making and throughput.  Another feature of the PLaNetS system will be graceful degradation/enhancement of performance because of hardware failures or presence of congestion or availability of new resources. The goal of the proposed global services will be to provide the users and administrators with automated decision making to support a variety of connections with best effort or guaranteed service, including pricing mechanisms that will allow the administrator to charge differential pricing (effectively prioritizing different applications in cases of contention in the underlying routes and resources).  The system will also provide feedback on current measures of quality and performance that is being made available so that suitable fine tuning can be performed in the application if it is developed to be network aware. With the above goals in mind, our proposed PLaNetS global services have the following features:

·      Queues for tasks (transfers) of different lengths and levels of priority,

·      A best effort (lowest common denominator) service for transfers that do not interface to the network management services, with a specified (typically small) share of the bandwidth,

·      Higher levels of service via a “path” construction service, where the “paths” can have one of a few different priority levels as well as varying sizes (bandwidths)

·      Subsystem that prioritizes tasks based on their bandwidth requirements e.g., short tasks may execute in real-time and large tasks are run in background. Note that end-to-end paths could be constructed in a real sense (layer 1 “lightpaths”) or virtually (MPLS tunnels augmented by QoS attributes) or combinations thereof.

Text Box: Figure 3 Interaction between PLaNetS services and (storage) components.Figure 3 shows an overview of several of the proposed global PLaNetS services in a scenario. The individual services are described in more detail in the remainder of this section.  Shown is a generic PLaNetS aware File Transfer Service (FTS) which intends to move a 1.2TB file from Site A to Site B.  The FTS primarily interacts with an End-Host Agent (EHA) which is responsible for transparently interacting with the complex array of services necessary to optimize this transfer in the context of all ongoing and scheduled network usage, tracking the progress of the transfer and dealing with faults or preemption as required.  A sample of the types of queries and service interactions are also shown (green thin lines).

C.4.1    End-Host Agent

One of the shortcomings of many network monitoring systems is that they neglect monitoring what is arguably one of the most problem laden sections of the network: the end-systems. We propose to address this within PLaNetS by integrating and evolving the LISA agent (part of the current MonALISA release) into an End-Host Agent (EHA). LISA is already being used with VRVS to provide a short list of candidate-best servers, and then selects the best connection based on performance, server load (doing load balancing), etc. One of the End-Host Agent’s tasks will be to monitor the end system, profiling the client's state (CPU, memory, interrupts, disk-usage) correlated with achieved network performance.  This will be critical for quickly diagnosing the correct location of any performance problem within the PLaNetS fabric. As the End-Host Agent develops it will become the central point of contact for each host’s applications that need to utilize the network.  The EHA will transparently negotiate with the other PLaNetS services to setup and optimize effective use of the network.  In addition it will actively track network connections and respond to failures, errors and preemption to insure task completion.

C.4.2    Path Discovery Services

When initiating a network connection between two endpoints we need to understand what possibilities exist.  PLaNetS will rely on a set of Path Discovery Services (PDS) which will provide a comprehensive view of the possibilities that exist in the network. Specifically the PDS will provide whether or not options like dynamic virtual pipes or optical circuits exist partially or end-to-end along the path.    Examples of targeted capabilities for resource discovery are:  (1) Determine which options exist between two locations in the network (2) List components in the path that are “manageable”. (3) Given two replicas of a data source, “discover” (in conjunction with monitoring) the estimated bandwidth and reliability of each to a given destination. (4) Locate network resources and services which have agreements with a given VO.

C.4.3    Network Request Services

A managed network requires the means to define and enforce policy as well as allocate and schedule limited resources within the network.  A set of Network Request Services (NRS) need to be developed, implemented and deployed to reach the PLaNetS network vision. In many cases the PLaNetS collaboration will rely upon external projects (OSG, EGEE, etc.) to deliver appropriate core services which can be adopted or adapted for PLaNetS use. Examples of needed capabilities for NRS will be: (1) Negotiation, classification and queuing of requests (assign service/priority). (2) Policy Description Language:  provide networks (local, regional, national and international) the means to define their capabilities (prioritized flows, minimized latency, virtual dynamic point-to-point connections, light-path construction) and specify the conditions under which users or applications can be utilize those capabilities. (3) Policy Implementation and Enforcement:  Given an appropriate policy description language we need to implement a means of enforcement integrating AAA and monitoring information with the resultant end-to-end set of policies along the path(s). (4) Co-Scheduling requests:  Network request services, modulated by policy and scheduled use, must be able to coordinate with other scheduled resources like storage and access to compute cycles.  To achieve these capabilities, we will coordinate and work within any future technical group devoted to OSG resource management and towards the development of eventual OSG Resource Request Services.

C.4.4    Network Path Services

Part of our effort will be devoted to Network Path Services, broadly categorized into two areas: Construction and Management. Network Path (Virtual or Real) “Construction” Services (PCS) are required once a “path” resource is discovered and allocated.  Routers and switches may need to have configurations dynamically altered to enable the requested path. Optical Control Planes must be implemented to allow the highest degree of interoperability between various optical networks. For many of these services we hope to adopt the work of projects like such as Terapaths[a14], Lambda Station[a10], OSCARs[u19], Dragon[u6], Cheetah [u3] and HOPI [u13] to meet PLaNetS needs. The second area of focus will be Network Path Management Services (NPMS) such as: (1) Provide real-time status including time-to-completion estimation. (2) Provide fault handling, including soft or partial faults. (3) Redirection and preemption notification. Additional path information may also be available including information which may be summarized from historical monitoring information as well as policy and management information.  Examples are the round-trip times, recent bandwidth usage, packet loss statistics, reliability measures and current and future scheduling information.

PLaNetS will rely upon UltraLight for monitoring services that will be utilized by some of the global services previously described.

We believe that PLaNetS’ deployment and success, with the key features described above, will be crucial for the scientific success of the global scientific collaborations we work in. This is particularly true during peak periods of scientific opportunity: when a new accelerator such as the LHC first makes a new energy range available for exploration; when the accelerator luminosity begins to rise rapidly after the early days; when an upgrade to the accelerator or detector gives the physicists unprecedented capability to measure and identify new physics signals above the backgrounds. In response to the renewed opportunities for discoveries, and the competitive pressure among experiments, physicists' demands on the resources tend to rise rapidly, and the potential for substantial oversubscription of resources and bottlenecks throughout the system increases.

In the extreme, the system may undergo a “meltdown” as a result of an inability to cope with the resource-demand conflicts, and to exert reasonable policies for network usage and data placement across the entire system. By construction, the potential for such a meltdown will be maximal during times when rapid turnaround in analyzing and understanding the physics results is most crucial.

C.4.5    Relationship to Open Science Grid

The Open Science Grid program of work identifies five specific areas in which work is needed to extend the capabilities of the OSG in order to meet the challenges the OSG science drivers are facing within the next 3-5 years. To meet these challenges, the OSG expects to work closely with external projects like PLaNetS on specifying requirements and interfaces, testing and integrating middleware components, and coordinating deployment at participating OSG sites. Advanced network services are one of these five areas of work that aligns particularly well with the goals of PLaNetS, and we expect to deliver to this area in the program of work of OSG. The OSG is committed to SRM and GridFTP at the transfer management layer of its software stack, and expects to extend their capabilities as part of the focus area “Data Storage Access and Management”. PLaNetS is committed to build upon these lower level protocols. PLaNetS will engage with OSG, and be sufficiently well aligned such that the total effort requested in the separate OSG proposal to NSF and DOE is reduced by one FTE of effort due to synergies between PLaNetS and OSG, and work provided by PLaNetS to OSG

Integration with OSG AAA Services, Tools and I/O devices: An integral component of enhancing the network throughput for the PLaNetS project is an authorization component that works in conjunction with the resource discovery component. Currently the OSG software stack provides authorization decisions to Globus using VOMS [u27], PRIMA[a13], and GUMS [a7] for authentication, authorization, and identity mapping. The MonALISA project provides the resource monitoring capabilities that can be used for resource discovery.

The PRIMA authorization module only provides an interface to the Globus Toolkit, though, to render an authorization decision after a job has been submitted. We intend to extend the work of the VO Privilege project to provide a query interface to expose PRIMA's authorization decisions, and expand the scope of it's supported policy attributes.  Some areas of development: (1) Authentication/Authorization: extend VOMS, PRIMA, GUMS to incorporate any network specific additions as needed. (2) Account for the utilization of network resources: bandwidth, prioritized flow bandwidth, virtual or real paths (Size x duration) and organize by User, Institution, Role, Working Group or VO.

C.5     Project Planning

C.5.1    Project Management

Leadership: The PI, Co-PIs and other senior personnel have extensive experience with the management and successful execution of national-level scientific, networking and Grid projects, including H. Newman (chair of the Standing Committee on Inter-Regional Connectivity chair of the International Committee on Future Accelerators[u12], chair of the US-CMS collaboration, PI for the Particle Physics Data Grid), P. Avery (Director of the multi-disciplinary GriPhyN and iVDGL Grid projects), S. McKee (Co-lead of the OSG Networking Technical Group, lead of US ATLAS Networking), R. Cavanaugh (Project Manager of UltraLight) and J. Bunn (UltraLight management, TeraGrid/Caltech Science Portals Co-PI). Their leadership in these projects also provides unique opportunities to exploit their considerable personnel and IT resources for PLaNetS’ benefit.

Management: The management team will consist of the PI as Director, the Co-PIs, and the Project and Technical Coordinators. The Project and Technical Coordinators will be appointed by the PI from the project personnel. The management team will manage the project, as a Steering Group, with input from the following bodies: (1) International Advisory Committee, international experts from the research, academic and vendor communities, which will review project progress and advise the management team; (2) Applications Group, representing end users and applications served by PLaNetS, which will report regularly to the management team.

C.5.2    Team Structure

PLaNetS research, development, testing and deployment will be carried out by multi-institutional Project Teams, composed of physicists, computer scientists and network specialists from universities and national laboratories, including those of our international partners.  Each team (working group) will have two co-Leaders who coordinate activities and report to PLaNetS management.  In addition, to maintain tight integration with the experiments driving this research, the teams will coordinate their activities and carry out joint projects and exercises with the Grid projects, LHC collaborations and VLBI collaboration. 

C.5.2.1    Working Group: Global Services

The set of needed services to be developed and/or deployed in PLaNetS, described in detail above  will be the focus of this group. Our milestones below will give the details about timelines for each of these services.  In many cases there is already existing work that the PLaNetS collaboration would like to integrate and deploy but would not need to develop.  OSG and Global Grid Forum(GGF) technical groups [u11] and related activities will provide a basis for creating a PLaNetS managed network (e.g., the “Monitoring and Information Services” and “Security” technical groups and the “Accounting”, “Blueprint”, “Deployment”, “Integration” and “Operation” activities), but there may be new requirements which incorporating a managed network may engender. Our intent is to extend existing or planned OSG or GGF tools, where ever feasible, to meet our needs.

The development of a complete management system for targeted networks will requires tens of man-years effort.  Initially we plan to deploy a straw man implementation of our network-management paradigm focusing on a small number of critical links and networks.  While simplified, this first implementation is a major step beyond the "anarchic, unstructured" view of networks that exists now. This is driven by the need to manage limited network resources like LHCNet which has two 10G links currently. ESnet will be in a similar situation in that the amount of bandwidth allocable to a single task at "high priority" will be limited.  We will need to scale the largest allowable flow-allocations in the infrastructure proportionally to the maximal available bandwidth to insure a responsive and fair system.

C.5.2.2    Working Group: Integrated Physics Applications

The proposed PLaNetS environment offers unparalleled accessibility to huge, distributed data stores. We plan to expose the capability to the LHC community by developing distributed services that directly interface existing (and future) ATLAS and CMS Data Management Systems including: Dataset Bookkeeping Services, Dataset Location Services, Data Placement and Transfer Services (e.g. PhEDEx[a24]), Local File Catalogues, Data Access (e.g. POOL[a25]), and the LCG/OSG Storage Element (e.g. SRM/dCache[a4]).  LHC data management will happen across many levels, from global dataset discovery to local dataset access and storage.  The PLaNetS-LHC integration activities will primarily focus on the triad relationship amongst the PLaNetS Global Servics, the OSG Storage Element, and the ATLAS & CMS Data Placement and Transfer Service.

PLaNetS plans to implement a new set of global end-to-end managed services, building on the ongoing and rapidly advancing work in MonALISA and OSG. This approach allows applications and higher-level service layers to be made aware of advanced behaviors as well as options available within the system and to provide the required interfaces to grid middleware services. This enables a new class of pro-active and reactive applications that can dynamically handle unexpected system behaviors, like congestion or hardware failures, and allow for dynamic responses to changes in the system setup, such as when new network paths or modes become available. These new functionalities will enhance the global system resilience to malfunctions and will allow optimization of resource usage, enhancing the overall throughput, and enabling the effective implementation of policies.

C.5.2.3    Working Group: Network Operations and Service Integration

PLaNetS will rely upon the UltraLight project’s efforts in network engineering and operations to deliver the needed “testbed” for PLaNetS development and testing. This working group will focus on deploying and integrating the PLaNetS services into UltraLight operations.  As UltraLight phases out this group will be responsible for network operations for the PLaNetS infrastructure inherited from UltraLight.

UltraLight has deployed a “lambda-based” transcontinental backbone interconnecting major HEP laboratories and institutes in the US. Main nodes of the network have been established at strategic location such as MANLAN[1] (NY) or StarLight[2] (CHI) to interconnect with national and international partners. As shown on Figure 4, the network can take advantage of HOPI [u13], NLR[3] and USNet [u25] resources and connects some of the major international HEP centers in Asia, South America and Europe.

The “lambda-based” network is designed to provide on demand dedicated bi-directional data paths between UltraLight sites. Today, the control plan is still manually configured by the engineering team for each request, but a more sophisticated, automated and distributed provisioning system based on the Monalisa software should progressively be deployed.

Over the past three years, the networking team has acquired a recognized experience and know-how in operating and managing high bandwidth/delay networks (refer to ACM/IEEE and Internet2 Awards).With engineers based in Europe and in the US, the small networking team offers 24 hours a day, 7 days a week support and guarantees a high availability of the network.

Figure 4: UltraLight Network (January 2006)

C.5.2.4                 Working Group: Open Science Grid Activities

This group will insure that the PLaNetS software will be properly interfaced to the OSG software stack, subject to a specification and plan agreed upon between OSG and PLaNetS.  In detail, one FTE in this proposal will be dedicated to working with OSG on the needs and specifications, and then the integration and support, of software and tools in this area. This person will be made available to OSG as a point of contact for OSG/PLaNetS collaboration and will serve as the leader of this working group.

C.5.2.5       Working Group: Education and Outreach

One week tutorial workshops will be offered annually in early June to physics and computer science students. These tutorials will stress cooperative group interactive exercises to build teamwork that emulates professional science collaborations while providing training in the PLaNetS toolkit. Combined teams of physics and computer science students from several institutions will be formed to pursue on the tutorial exercises. Leaders of the workshop will be the coPIs and senior personnel of the participating institutions.

Following the workshops, the teams will continue working on PLaNetS research projects for the remainder of the summer. This will provide an unprecedented opportunity for students to work internationally and be involved in scientific, cultural, and educational exchanges on a global basis enabled by the experimental infrastructure proposed. They will be integrated into the core research and application integration activities at participating universities and be one of the first users of the PLaNetS testbed, thus providing a valuable resource for the project while providing a memorable experience for the students.

The professional mini-workshops will be offered in conjunction with international gatherings including scientific meetings and collaboration meetings. Participants will be exposed to the overall scope of the PLaNetS design as well as implementation details. The materials will be borne out of the students workshops, pared down to coincide with a scientist’s expectations. Materials will also be made available online for instant around-the-clock access.

C.5.3    Milestones and Schedule of Work

Our proposed schedule of work is divided into 3 phases. The next sub-sections provide an outline of the activities in these phases, ordered by workgroup, based on the PLaNetS architecture described in the previous section.

C.5.3.1    Phase-1 Milestones (12 months): R&D, initial deployment and testing

·         PLaNetS Global Services: We will develop and deploy the initial end-host agent (EHA) by extending the existing LISA agent used in VRVS/MonALISA to include monitoring and tuning of the host.  Initial path discovery services focused upon network managed resources will be prototyped and deployed along the UltraLight-PLaNetS testbed.  Network request services will be developed to handle negotiation, classification and queuing of requests.  Existing monitoring and measurement services in UltraLight, MonALISA and IEPM will be integrated with higher level network request and path construction services.  A prototype end-to-end path construction service will be developed and deployed.

·         Integrated Physics Applications: We will work within the ATLAS and CMS Collaborations to agree upon and implement the necessary APIs for interfacing the PLaNetS Global Services with their respective Data Management Systems. Initial, but stable, PLaNetS services will be tested within the ATLAS and CMS Data Management Systems in a case-by-case and need-based fashion; more speculative, but more powerful, services will be tested in an opportunistic fashion. 

·         Network Engineering and Operation: Existing path construction services from Terapaths, Lambda Station and OSCARs will be deployed and integrated into the UltraLight-PLaNetS testbed and coordinated with the prototype end-to-end path construction service. 

·         OSG Activities:  We will work within OSG to agree upon and implement the necessary APIs for interfacing the PLaNetS Global Services within the OSG software stack. In addition, according to a mutually agreed upon schedule, PLaNetS will deliver to OSG components which open up the PRIMA authorization component to external queries,  PLaNetS services deemed suitable for production use will be integrated with the OSG software distribution.

·         Education and Outreach: We will organize and host the first PLaNetS summer tutorial workshop for students. Organization will begin in January with the workshop running in early June. The workshop will reflect the current status of the PLaNetS implementation. Recruitment will include students at PLaNetS institutions and targeted HEP and LIGO partners with networking responsibilities. Presentations will given to a minimum of two collaboration meetings and one international scientific meeting detailing the scope and status of the project.

C.5.3.2    Phase-2 Milestones (24 months): Scale to advanced LIGO and LHC turn-on

·         PLaNetS Global Service: Services will be hardened and more broadly deployed in coordination with partners like ESnet, Internet2 and others.  The EHA will be augmented to interact with path discovery, path construction and network request services as well as other end-host agents.  Path discovery services will be augmented provide “recommendations” and extended to include VO and service-level agreement (SLA) capabilities.  Path construction services will be updated from partner projects and extended to provide a richer policy-based authorization capability, including integration with accounting services, authorization enforcement components and network related attribute and capability definitions.  These services will also begin to implement fault handling and preemption methodologies.

·         Integrated Physics Applications:  We will work to enable the production-level ATLAS and CMS Data Management Systems with the most critical of the PLaNetS Global Services which are required to support the first LHC physics run, including end-host-agents as well as network path, request, and discovery services.   Prototype policy-based network services will be interfaced and tested with the ATLAS and CMS environments according to a mutually agreed upon schedule, so as not to interfere with production operations.

·         Network Engineering and Operation: Our network testbed will be regularly updated to include the newest PLaNetS services, including capabilities from partner projects in bandwidth-management, monitoring and optical control planes.  One focus will be on fault handling and dynamic path rebuilding.

·         OSG Activities:  Integrate the PRIMA authorization component with network services, and expand the scope of authorization decisions and attributes.  Continue to provide and package production ready services and APIs into the OSG software distribution in close coordination with OSG.

·         Education and Outreach:  We will organize and host the second and third PLaNetS summer workshops for students. Organization will occur in January with the workshop running in early June. The second workshop will target graduate student practitioners, as the implementation will be scaled up. The third workshop will target undergraduate students and provide them with exposure to state of the art science. Presentations and mini-workshops will given at collaboration meetings and international scientific meetings to enable scientists to implement PLaNetS services. Scope articles will be prepared and circulated through collaboration and science newsletters. Tutorials and manuals will be available on the PLaNetS website.

C.5.3.3    Phase-3 Milestones (24 months): Scale to full-lumi LHC and LSST

·         PLaNetS Global Service: Production level scalable services will be the focus.  The end-host agent will be improved to incorporate additional automated host and network related tuning capability and hide the underlying network and service complexity from the user or application. Path discovery services will be enabled to discover all managed components along arbitrary end-to-end paths, modulated by VO and role-based authorizations in force.  Network request services will be able to coordinate with existing grid AAA infrastructures to classify, schedule or preempt network flows and will include any needed policy implementation and enforcement engines.  Network path services will provide fault handling, preemption and reclassification capabilities.  All services will provide accounting interfaces.

·         Integrated Physics Applications:  As the LHC moves to full-scale operations, we will work with the LHC Collaborations to refine the PLaNetS policy-based network services to meet priority-based management decisions against emerging network oversubscriptions from large-scale, diverse LHC data analysis groups.

·         Network Engineering and Operation: Management of a dynamic data transport infrastructure will be the focus.

·         OSG Activities: Implement changes as necessary to improve robustness and efficiency.  Coordinate with OSG to implement new production ready services.

·         Education and Outreach:  We will organize and host the fourth and fifth PLaNetS summer workshop for students. Organization will occur in January with the workshop running in early June. The last workshop will target graduate student practitioners, as the complete implementation will operational. Updated presentations and mini-workshops will given at collaboration meetings and international scientific meetings for practitioners. Scope articles will be prepared and circulated through collaboration and science newsletters. Tutorials and manuals will be updated and remain available on the PLaNetS website.

C.6     Broad Impact

PLaNetS, through its groundbreaking level of network capability, the scope of its international testbed and partnerships, the unique nature of the real-time systems aimed at data intensive science (building on the systems that are beginning to be deployed now in UltraLight), will provide vital input for NSF’s GENI initiative by the time it begins in 2009. PLaNetS’ end-to-end managed network paradigm, its ability to field real-time autonomous network systems on a global scale while isolating the hard technical and policy issues, and its long-term development program driven by the ongoing mission to serve a growing international community in support of their science, as well as research and education, will help shape the worldview of networks over the next few years. PLaNetS will thus influence GENI’s concept and design, especially as it relates to large-scale information usage by the scientific community.

PLaNetS will also serve to inform and train both current and next generation of scientists through its extensive education and outreach efforts. Annual intensive summer tutorial workshops coupled with follow up research activities will provide students (including students from traditionally under-represented groups) with a unique opportunity to engage and participate in cutting-edge science. Presentations and mini-workshops will engage current practitioners in PLaNetS collaborations and the broader scientific community and provide them with knowledge of and access to PLaNetS services. Articles in journals and newsletters will inform scientists and draw them to the project’s services. Our website will provide thorough documentation for those implementing the PLaNetS network services.



[1] Manhattan Landing (MAN LAN) is a high performance exchange point in New York City to facilitate peering among U.S. and international research and education networks

[2] StarLight is a 1GigE and 10GigE switch/router facility for high-performance access to participating networks, and a true optical switching facility for wavelengths.

[3] National LambdaRail (NLR) is a major initiative of U.S. research universities and private sector technology companies to provide a national scale infrastructure for research and experimentation in networking technologies and applications