PASTA – The LHC Technology Tracking Team for Processors, Memory, Architectures, Storage and Tapes – Run II

Working Group (d):

Storage Management Systems


Status report – 05 July 1999 – Version 1.4


WG (d) members:

I.Augustin, J.P.Baud, R.Többicke, P.Vande Vyvre




        The PASTA WG (d) has investigated the area of (distributed) file systems, network storage and mass storage.

        The investigations have been limited to the products or projects that are relevant to the computing for the LHC era in terms of capacity and performances. These products or projects have been selected on the basis of the HEP requirements. We have excluded from our study the Object-Oriented DataBase Management Systems because this technology is already being investigated by the RD45 and MONARC projects and because the file system remains the basic storage paradigm in the computing industry. The local file systems are not part of this study because they are integrated into operating systems and because the emergence of 64 bits file systems will cover the needs for the LHC era.

        The first section lists some HEP requirements for distributed file systems. The second section describes the traditional distributed file systems based on the client/server paradigm. The third section describes the more recent developments in the area of file systems with the emergence of network attached storage and storage area networks.

        The fourth section includes a summary of the HEP requirements for mass storage systems as expressed by a document of the Eurostore project. The fifth section is dedicated to the mass storage and hierarchical storage management systems. It includes the status of the relevant standards and a list of products.

        The reference section includes web pointers to most of the companies, consortia or products mentioned in this report. These references are indicated in the text by "[ ]". The two appendices list the characteristics of the commercial products investigated in the sections on Distributed File Systems and Mass Storage Systems.

Distributed file systems

HEP requirements for distributed file systems

        The concept of distributed file system has modified the way people share and distribute information. It has now become a required component for the overall experiment’s data model. This is still in evolution with the access allowed through the web. However, the distributed file system has inherent limitations, which make it unpractical, at least today, for large data transfer. The limits come from several factors: transfer cost, transfer speed, total storage capacity etc.

        In summary, the distributed file system seems an ideal tool for the very wide sharing of limited amount of information such as home directories, documents or physics data at the final stage of the analysis. Whereas raw data or DSTs require a file system with better performance even if it is at the cost of less easy data access. The production stage of the experimental data processing will probably be executed in a closed environment with limited access.

        The requirements of the HEP community for file systems storing moderate amount of data are similar to the requirements of other large communities such as academic, industrial or commercial organisations. These file systems must be able to support the home directories and limited file exchange in a large and geographically distributed community. It must be open, location transparent and well protected by an access control system. Some of these features are well, and sometimes better supported by the web, at least for read-only information.

Distributed file systems technologies and products

        Three technologies of distributed file-systems exist today:

    • Traditional distributed file system: client/server based file system communicating with a general purpose network and network protocol (LAN and TCP/IP)

    • Network Attached Storage (NAS): integrated storage system attached to a messaging network that uses common communications protocols. In general, a NAS includes a processor, an OS or kernel and processes file I/O protocols such as NFS or IPI-3

    • Storage Area Network (SAN): networking technology that supports the attachment of storage devices on a shared access network. SAN products process block I/O protocols such as SCSI.

        This report is limited to the file systems aspects of these categories. The connectivity aspects are treated by other PASTA WGs. Most of the products with a large installed base fall in the first category. SAN is an old concept (similar to the VAX/VMS cluster interconnect) that has been recently revisited. The NAS and SAN technologies are raising a lot of interest and several projects and developments are going on.

        The most often used commercial products are AFS [AFS, TRANSARC], DFS [DFS], NFS [NFS], Microsoft Windows 2000 [W2000] and Novell Netware [Novell]. A detailed list of characteristics can be found in the Appendix A.


        The future of this type of products in the PC world depends largely on the file system of W2000 that is now in Beta testing. Some of the key features of this product are known: support for distributed and large storage subsystems, usage of industrial standards such as TCP/IP or DFS, support for sites on local and wide area networks [W2000]. However, it is not yet clear if the PC products (from Windows or Novell) can scale to match the needs of our community.

        Furthermore, the HEP environment is still dominated by the Unix operating systems for all the activities that are specific to physics activities (data acquisition, processing and analysis). This has been recently reinforced by the quick adoption of Linux by our community. Unless a radical change happens, it seems unlikely that Microsoft or Novell products will be the core of the physics data information system for LHC. The AFS has been and is still used extensively in the academic community. The emerging DFS system has slowed down the AFS development but has not been able to impose itself. There is today no obvious successor for AFS but the web constitutes today a good alternative for some of the needs covered now by AFS.

        This issue will have to be investigated actively in the near future, taking into account the potentials of storage area networks. They will probably influence our future architecture of distributed file systems in the local area.

Storage Area Network based File Systems

        The simplicity of the distributed file system interface has facilitated collaboration of dispersed groups and has modified the way people work. The limited performances of server-based distributed file systems are acceptable for wide-area network. They become a problem for local –area network and more demanding applications.

        The dramatic increase of performance of local-area networks and of switching technologies has made possible faster and more scalable networks. The same performance shift is desirable for storage. Some device attachments available since few years, such as HiPPI, Fibre Channel or the IBM’s SSA, allow for better performance, scalability and sharing. Two different classes of devices can be connected on these shared file systems:

    • Storage devices connected to a general purpose local area network: the storage is able to understand and execute IP requests transmitted through a standard local-area network;

    • Storage devices connected to a dedicated storage-area network: the storage device receives standard SCSI commands from another media such as Fibre Channel.

        However, although the hardware is available since several years, server-less SANs are not yet available. The storage device sharing is not yet available at the application level. The difficulties of developing and marketing this technology are twofold. First it requires splitting the functionality’s of the storage device driver between the software driver and the hardware device. This implies a modification of the operating systems kernels. Second, the storage market is completely open for the two most used storage attachment standards: IDE and SCSI. Any modification of an existing standard or creation of a new one will be a long and heavy process. The issue is even more complicated by the possibility of sharing storage devices between machines running different operating systems.

        Several projects are investigating these issues and some products are being developed to realise servers-less shared file systems. Here is a list of some of them:

    • CDNA [CDNA] of DataDirect and distributed by Storage Tek
    • GFS (Global File System) [GFS]
    • NAStore [NAStore]
    • PFS (Parallel File System)
    • The SUN Store X project based on the Java technology [Store X]
    • The distributed file systems for Linux [Coda].

        Two consortiums are also driving the efforts in this emerging field. First, the Storage Networking Industry Association (SNIA) [SNIA] has been founded by companies from the computing industry (IBM, Compaq, Intel etc), the storage industry (Strategic Research Corporation, Crossroad Systems, Legato Systems, Seagate Software, Storage Technology Corporation) and the microelectronics industry (Symbios Logic) and counts now 98 members. SNIA's goal is to promote storage networking technology and solutions and to ensure that storage networks become efficient, complete, and trusted solutions across the IT community.

        Second the Fibre Alliance [Fibre Alliance] has been formed by 12 companies (Ancor, EMC, Emulex, HP) to develop and implement standard methods for managing heterogeneous Fibre Channel-based SANs (networks of systems, connectivity equipment and computer servers). The Fibre Alliance has submitted to the Internet Engineering Task Force (IETF) the engineering Management Information Bases (MIB). It requests the IETF consider the MIB as the basis of SAN management standards.


        The underlying technology is understood and affordable. It would have a lot of benefits to achieve a high performance, reliable and portable data sharing system. Its adoption will require agreeing on new standards and modifying the operating systems. Despite these difficulties, it will probably become available before the LHC start-up. In HEP, its applicability is much wider than distributed file systems. It would have a big impact for all the operations involving large data transfers such as central data recording or production data processing. This technology should therefore be taken into account in the future LHC computing plans.

Mass Storage Systems

HEP requirements for Mass Storage System

        The user requirements for Mass Storage Systems (MSS) have been divided into "phases" corresponding to the different tasks of data recording and processing happening in a typical HEP environment. These phases are the data recording, the data processing, the analysis development and the analysis production. For all these phases, the main computing operations will be described, the consecutive requirements will be listed and the applicability of a database will be explained.

Raw Data Recording

        The data recording is more and more executed by central facility available in the computing centre. The Central Data Recording (CDR) is becoming the de facto standard. Given the rapid progress of the networking technology, it is already guaranteed that this will be possible during the LHC era. This is the option that we have considered here.

        In this simplistic model, the CDR can be described as a set of different data streams that are fed into the storage system continuously 24 hours a day for several months a year. Except for operational failures, these streams will not stop for any discernible period. The data will be stored in a disk buffer before it is copied to a permanent storage as soon as possible.

        Traditionally the "raw data" in the permanent storage (e.g. tapes) is not overwritten during the lifetime of the experiment. It can be considered as a WORM storage class.

        An essential part of the CDR is the monitoring of the performance of the experiment and the CDR itself. The experiment performance is usually checked by accessing the raw data that resides still on disk thus requiring an extended lifetime of this data. Also the readability of the data on the tapes is checked by accessing (at least part of) the permanent storage.

        In parallel the independent stream of calibration data has to be stored on disk, with an additional copy to a permanent storage. This data is continuously analysed during the data taking and maybe even after that.

        All these operations are executed by a specialised group of users and can be optimised.


    • Aggregate transfer rate at least few gigabytes/second with tens of streams in the 10-100 MB/s range.
    • Continuous operation (operator attendance < 8 hrs/day)
    • Storage capacity in the 10 - 100 PB range
    • Sequential data access
    • I/O data rate only marginally affected by the software and limited by hardware only
    • Allocation of dedicated resources to selected tasks (e.g. tape drives and hard disks)
    • Access control system to limit the access of the data to certain user groups
    • Raw data tapes in permanent storage can be marked as "READ ONLY"
    • Efficient monitoring of tape drive, disk and network performance and of media reliability
    • Possibility to re-dedicate resources used by other phases as CDR drives (hot spares) dynamically


        In this phase, the files are named in a transparent way. Most of the experiments use a combination of consecutive numbering and time labels. As every file ends up in the permanent storage, a simple database is sufficient. Additional need for a database arises from time dependent parameters, like calibration and detector configurations. For this also a simple database is sufficient.

Data Processing

        The raw data in the permanent storage will have to be reprocessed due to improved calibrations and reconstruction software. Therefore the bulk of data will be read and processed systematically. The resulting data will also end up in a permanent storage (e.g. tapes). Every experiment attempts to avoid these reprocessing campaigns, but previous experience show that one or two of them are likely.

        Same requirements as previous phase


        During the processing, the "data stream" will be broken up. This means the data will be split in a set of output classes depending on their physics content. Probably the consecutive order will be lost and a more sophisticated database will be needed.

Analysis Development

        In contrast to the previous stages, which are co-ordinated efforts of a few users, this one comprises up to several hundred users who attempt to access data in an uncontrolled way (quotas will be a topic). Each of them probably accesses of the order of ten GB in each job. Here a sophisticated staging system is required. The amount of output is small compared to the amount of input data, but has to be backed up. The external participants probably want to export data (~ nTB/institute, ~100 Institutes) to their computing facilities. This eases the load on the central systems but requires export services.


    • Thousands of simultaneous clients
    • Complex data access including direct access
    • Administrator programmable quota and garbage collection of the staging disk area
    • Administrator configurable resource sharing and user priorities
    • Accounting and quota allocation tools
    • Export procedure for processed and user data
    • Exportable tape format to external institute (not requiring CERN MSS system)
    • Export metadata format should conform to the proposed AIIM C21 standard [AIIM]
    • Import facility for data from the external institutes
    • Interface to OODB
    • User definable backup levels for certain user files


        Additional to the processed data files, the outputs of the various analyses have to be managed by the storage system. A priori it is not determined whether the data produced by the previous stages is maintained by the same storage system. This strongly depends on the actual hardware configuration. Even if the users analyse their data on remote machines, the access to the processed data files has to be centrally controlled.

Analysis Production

        In theory, every physics analysis of a user leads to a systematic analysis of a big fraction of the data. In practice, a lot of work is redundant and analyses are only done on data already preselected at the data processing stage. This phase strongly depends on the experiment’s data organisation, physics goals and requirements.

        Same requirements as previous phase


        The organisation of the results of the systematic analysis of all the relevant data is unknown. This strongly depends on the experiment. As these results are the final ones it is likely that the experiment wants to store them centrally.

General Requirements

        There are several requirements that are common to all phases:

    • Not restricted to a single hardware platform
    • Distributed servers
    • Support of available robotics, drives and networks
    • GUI, script and WEB interfaces for administration, operation and monitoring
    • File sizes only limited by the operating system
    • Total number of files (~264)
    • Reliable and error free transactions
    • Regular backup of system/metadata

Summary of requirements

        The central data recording and the data processing can be viewed as relatively static environments. In the first case, the most important fact is the uninterrupted storage of data onto a permanent storage medium. The data rate is predictable and quite constant for a long period. The lack of human operators requires a stable and reliable system. Dynamic allocation of resources normally only happens in case of a failure in the system (e.g. tape drive). The data processing is, in principle, quite similar with slightly relaxed requirements on the continuous operation.

        The challenge changes in the analysis phases. The access pattern to the data becomes unpredictable and the need for a sophisticated data and resource management (e.g. disk space, staging) arises. Backup requests, file management and the limited number of resources indicate the need for a full storage system.

Mass Storage Standards

        The IEEE Storage System Standards Working (SSSWG) (Project 1244) [IEEE] has developed a Mass Storage Reference Model. Several releases of this model have been issued, the last one being the Version 5 in 1994. This is now known as the IEEE Reference Model for Open Storage Systems Interconnection (OSSI - IEEE P1244). This model provides the framework for a series of standards for application and user interfaces to open storage systems:

    • Object Identifier (SOID - 1244.1)
    • Physical Volume Library (PVL - 1244.2)
    • Physical Volume Repository (PVR - 1244.3)
    • Data Mover (MVR - 1244.4)
    • Storage System Management (MGT - 1244.5)
    • Virtual Storage Service (VSS - 1244.6)

        This set of standards is still under discussion and there is today no product that covers the whole OSSI. Instead some products have used part of the standard as a basis for their architecture. The standard has not followed the most recent technical development such as the SAN. Parts of the standard, such as the data mover, may therefore become quickly obsolete if they are not updated to take into account these developments.

        The evolution of the OSSI proposed standard and its practical influence on the market is also unclear. The standard will probably not be ready before 2000 and maybe even later. It leaves a very short time to have standards conforming, or at least influenced products available by the start of the LHC.

Mass Storage Products

        The most often used commercial products are ADSM/IBM [ADSM], AMASS/Raytheon E-Systems [AMASS], DMF/SGI [DMF], EuroStore [EuroStore], HPSS [HPSS] and SAM-FS/LSC [SAM-FS].

        Their main characteristics are summarised in the Table 1 (a) and (b). A detailed list of features can be found in the Appendix B.

        These systems use real file systems, while HPSS uses a name server.

        The MSS system delivered by the EuroStore project might result in a commercial product supported by QSW and/or a non-commercial product supported by DESY. These two options are shown in the Table 1 (b).


        The reference standard for mass storage systems is the IEEE Reference Model for Open Storage Systems Interconnection. Its development has been very long and it is evolving very slowly. Several products conform to the model or part of it but none has implemented it. In addition, the standard does not specify the interface between the various components. Therefore, the interoperability between different systems will most probably remain a dream and the conformance to the standard is not a key issue.

        The issue of portability of applications to another MSS or another computer platform is therefore critical. Even more dramatic is the issue of moving bulk amount of data from one system to another. The data recorded by one MSS might not be readable by another one. Given the duration of the LHC project, it is probable that at least one change of MSS will be done during the whole project lifetime.

        The market of mass storage systems is relatively limited and the future of these products and of these companies seems often unclear. They target needs (backup or dynamic tape space management) that are relatively different and more complex than ours are but some of these products could be and are sometimes used for physics data management. The questions of their cost of buying and ownership, their complexity, their portability and their future have to be addressed.

        Given all the previous considerations, different home-made systems are being developed to address the needs of HEP. This is the case of CASTOR [CASTOR] at CERN, ENSTORE [ENSTORE] at Fermilab and the EuroStore MSS [EuroStore] at DESY. These are certainly good alternatives that should be pursued before a decision is taken for the LHC. The questions of their development cost and long-term maintenance should also be addressed.

        Table 1 (a): Comparison of the Mass Storage products.

        Table 1 (b): Comparison of the Mass Storage products.













Fibre Alliance





Store X


Appendix A : Distributed file systems products


    Coding limits

    • 32 bit volume ID -> limited to 2^31 volumes
    • 32 bit vnode ID -> limited to 2^30 files per volume (even vnodes) + 2^30 directories
    • 32 bit fileoffset -> 2^31 bytes per file (2GB)

        Supposedly there is a factor 2 margin in some of those limits (using 32 unsigned numbers) by trivial fixes to the code.

        AFS uses a 1-level lookup to resolve the location of data from the name (volume ID) -> server/partition:

    • every volume has en entry in the volume database -> ~200 bytes/entry. The VLDB itself is managed with 32 bit pointers-> effective limit around 2^23 entries
    • AFS servers are identified by 4-byte IP addresses => < 2^32 servers
    • An AFS server cannot manage more than 255 ‘partitions’ == file systems

    Architectural limits

    • File servers and clients using hashing tables for keeping track of files. These are probably designed for moderate numbers of files/servers/clients.
    • In case of a system crash the file system must be salvaged before it can be reused, a process which cannot be parallelized beyond the ‘partition’ level. A salvage of a 35 GB partition today takes over an hour (on top of the usual ‘fsck’ performed by the OS).
    • The way the cache manager is designed today makes it impractical to use more than 100000-200000 ‘chunks’ (cached portions of files). Each chunk cannot be greater than 2^30 bytes. Each files uses up an integral number of cache chunks. With today’s default of 64K per chunk a cache bigger than 6GB is impractical.
    • Call-back scaling issues: file server keeps track of files cached by clients.

    Management issues

        There are very few built-in administrative tools in AFS. However, similar to its close relative DFS, the system has been designed sufficiently flexible to allow operating a service with thousands users and millions of files on a around-the-clock basis without downtime for reconfigurations. The key concepts here are "mount points" used to build the AFS namespace out of smaller, manageable objects which are referenced by name only, and the "volume" abstraction which implements the physical handling of data independent of operating system or hardware characteristics.

        The supplied tools only allow to make use of the offered flexibility on a relatively high "scriptable" level, a management strategy has to be implemented on top of that. It has thus been possible to automate all basic disk space administration (allocation, balancing) and push out the remaining high level decisions to the user groups themselves without the need for extensive training or dedicated administrators.


        AFS uses disk caching to improve performance for file accesses. Files are cached in configurable pieces (typically 64KB) which are maintained in a consistent state, modifications to a file are automatically signalled to all clients holding a copy.

    Unix/Windows interoperability

    • Native Windows client


        DFS is part of the Open Software Foundation’s DCE. However, OSF does not sell any DFS product. DFS for Solaris is available from Transarc, DFS for AIX from IBM. Both are full-featured DFS implementations. HP offers an enhanced version of DFS called "EFS", which works also for Windows/NT. As to Digital (Compaq)’s DCE it seems that the DFS server part does not provide the full DFS functionality (e.g. ACLs, aggregates).

        The DFS architecture is very similar to AFS – the main difference is that DFS makes use of DCE transport, security and directory functionality instead of providing it’s own mechanism.


    Management issues

        The DFS architecture is very similar to AFS, offering the similar flexibility to extend the basic system with some suitable management framework.


        File caching in implemented similarly to AFS with extensions allowing for partial file locking.

    Unix/Windows interoperability

    • Native Windows client


        NFS is a Unix software layer on top of TCP/IP that allows sharing of data volumes between different machines. It is supported on most of the operating systems used in our environment. It supports the sharing of the same volume between different platforms. NFS is used extensively in our community.

        It is used in the online dataflow of some experiments and has proven to have good performances.

Windows NT 4 (with Microsoft’s Dfs)

        Windows 2000 will replace NT 4 – however W2000 is largely unknown yet. Straight extrapolation from NT4 features may not always be valid.

    Architectural limits

    • No more than 23 disks/partitions (drive letters) (??)
    • In order to face operational problems with path names for data files containing the absolute location of the data, a big networked NT file server makes only sense with Microsoft’s Dfs.
    • File server keeps track of open files – scaling issues

    Management issues

        Windows networking is mainly a continued development of Microsoft's Lan Manager technology that allows clients to access files on other Windows machines similar to NFS. In that scheme the path name of a file always contains a reference to the server and share on which it is located.

        Location independence, an essential part for round-the-clock operation, is achieved through a "Dfs" tree, a special "share" that translates path names to absolute references. With a growing number of "shares" this tree threatens, in the absence of efficient caching, to become a performance bottleneck and, since it cannot be replicated, a single point of failure.



        Windows does not support any other caching then operating system buffers. Connections to file servers are state-full, the file server keeps track of open files and ensures locking. This has implications: server downtime will affect even "inactive" clients with e.g. open applications.

    Unix/Windows interoperability

    • Unix connectivity: NFS
    • Linux/Solaris: Samba


    Architectural limits

    • "Novell Storage Services allow 8 terabyte file sizes, holds up to 8 trillion files in an NSS volume, and allows you to mount up to 255 NSS volumes on a single server"

    Management issues

Might partially be addressed by Network Directory Services (NDS).

    Unix/Windows interoperability

    • Unix connectivity: NetWare/NFS

Appendix B : Mass Storage products


    • Automatic migration/recall
    • When threshold is reached, the selection of files to be migrated is based on size and time last accessed
    • No separation between control and data
    • Storage capacity: no architectural limit (100’s TB per server achieved)
    • File size limit: 2**63
    • No software striping done by ADSM itself
    • Supported on AIX, HP/UX, Sun Solaris and Windows/NT
    • Supported drives: DAT, Exabytes, AIT, DLT, IBM3570, IBM3590, STK Redwood
    • Supported robots: IBM3494, STK9710, STK9714, STK9740, STK9310(PowderHorn)
    • ADSM responsibility transferred from IBM SSD to TIVOLI

    AMASS/Raytheon E-Systems

    • Centralised server: FileServ
    • Distributed client: DataMgr
    • AMASS Index (no stub): there is an automatic backup of the Index and a fully synchronised journal
    • Migration types: explicit, periodic, demand (threshold)
    • File selection based on age, file size, … (files and directories can be locked against migration)
    • File contents readable during reload
    • Reloads directly from any level in the hierarchy
    • File replication (max, 4 copies) with automatic fail over
    • Library volumes can be exported
    • Complies with IEEE Mass Storage Reference Model
    • No architectural limit on file system size
    • Maximum file size limited by OS only
    • Motif interface for administration
    • Supported platforms: HP HP-UX 10.10, IBM AIX 4.2.1, SGI Irix 6, Sun Solaris 2.5


    • Automatic and explicit migration/recall
    • Redundant copies
    • Volume merging (to recuperate space on tapes when files are deleted)
    • DMAPI interface (Data Migration API): this is a standard proposed by the Data Management Interface Group (DMIG)
    • Maximum number of files: 2**64 (tested with 20 millions files per file system)
    • Maximum user volume of data today: 300 TB
    • Maximum user throughput observed today: 2.5TB/day
    • File size limited only by OS (9TB on SGI today)


    • Automatic and explicit migration/recall
    • Migration policies based on file size, name, age, frequency of use…
    • Migration can be done at regular intervals or when a threshold is reached
    • Multi-threaded: high performance I/O, parallel migrate/recall
    • Software disk striping
    • Fast access to first bytes
    • File grouping (file families)
    • Redundant copies (up to 4)
    • Unlimited number of file systems
    • Unlimited number of files (sites with more than 40 million files)
    • File size limit: 2**64
    • Unlimited number of tapes
    • Java based GUI interface
    • Access to data migrated in a proprietary format (using the Migration Toolkit)
    • Standard Tar tape format
    • HSM service extended to remote servers (using SAM Remote)
    • This product is also sold by STK under the name ASM


    • High Performance Storage System Product Offering
    • Depends on DCE and Encina (Transarc)
    • Automatic migration/recall
    • Migration policy based on last access date
    • Migration is done at regular intervals
    • Multi-threaded: high performance I/O, parallel migrate/recall
    • Software disk striping
    • Redundant copies (up to 4)
    • Maximum 2**64 files
    • File size limit: 2**64
    • GUI interface using Sammi
    • Proprietary tape format


    • Open media manager for UNIX and Windows/NT
    • Developed by SGI and becoming a de-facto standard
    • IEEE 1244 Standard for Media Management Systems
    • Components: Abstract Library Interface, Abstract Drive Interface, Library Control Program, Drive Control Program, Media Library Manager (server), Administrative and Client API
    • Single instance of a multi-threaded MLM
    • OpenVault is used by DMF
    • Legato has done the port to Solaris