Scientific Computing – Data Services Group

The Data Services Group runs datastores for storing, archiving, preserving, analysing, and backing up scientific data, with a nominal capacity well over 100 petabytes. Most of the data is from the Large Hadron Collider, the second-largest by volume is currently climate modelling, and the Science and Technology Facilities Council ‘s (STFC) own facilities are also growing in volume.

As many of us are scientists, we also participate in projects and other research into high-end data management. The aim is to increase the knowledge and capability of data management supporting research globally, to improve the services we run by making use of research, and to increase the economic and social impact of our data by providing expertise and facilities for using open data.

The data services include:

tape-backed storage, with optionally one, two, or three copies on tape – the most paranoid users have one copy in a tape robot, one in a fire safe, and one off site. Most of the tapestore capacity is based on CASTOR, the same storage system run by CERN, but we also run DMF and our own in-house data service
database services, mainly based on Oracle and MySQL: we run 13 Oracle RAC databases in production over about 38 nodes, serving over 10,000 calls per second on average
preservation services – we are one of the first science users of Preservica Enterprise edition, a service we run for long-term preservation of science data from ISIS, which is also available for other science customers.

Our datastores provide a range of interfaces to enable data to be deposited and read back:

storage resource manager (SRM) serves the LHC and GridPP in particular, and other global grid communities, driving data with GridFTP
xroot is also used to move data internally: together, xroot and GridFTP and CASTOR’s native RFIO (and, eventually, WebDAV) protocols routinely deliver up to ten gigabytes per second for LHC alone, most of it going into our own LCG Tier 1 clusters, and the rest copied to Tier 2s or other Tier 1s across the world. As moving data is critically important to these services, we spend a lot of time fine-tuning transfer parameters to optimise the transfer rate
we run both SRB and iRODS services, the latter mainly serving the EUDAT project at the moment. We have run SRB since forever, and iRODS since it was developed; iRODS is seeing increasing production use in EUDAT
we provide GlobusOnline endpoints, currently to disk-only storage
Some of our data is available over the web, via either dedicated web server endpoints or data portals
We have internal interfaces to the datastore such as NFS.

Expertise

The group’s expertise lies in providing high-end scientific data services to support research, as well as the research that supports building new data services. We are often involved in projects, providing expertise in high-end data management and data security. The group’s expertise includes:

big (volume) data for research – working repository, archiving, preservation
high availability services
data security – while data integrity is our main security concern, we have considerable expertise in practical data security, including single sign-on
data for specific areas of research: high energy physics, astronomy, fusion,
scaling, scalability testing.

Projects

EUDAT – delivering a shared data e-infrastructure for a diverse range of user communities.

Daresbury Laboratory

Contents