Contact Sales

RSS

Object Storage

Brent Welch

CTO

Categories: 

Object storage is a generic term that describes an abstract data container and related metadata. The data can be an image, a file, a sensor reading, or whatever an application needs to store. The metadata describes the data, including a useful name for that data, where it came from, when it was created, how long it is valid, and whatever else may be important to the application that creates and uses the data object.

Objects are explicitly created and deleted, which is unlike storage on a block device. In a traditional block storage system, a file system or database layer needs to manage block allocation so that writes go to free space and reads can find those writes later. A block storage device doesn’t understand this issue, and will write and read any block you ask it to. An object storage device (OSD), in contrast, requires that you create an object before you write data, and that you read data relative to an object and not the device. When a file system or database is layered over an object storage device, the details of block allocation and space management are delegated to the device.

To put this in perspective, a 500GB disk drive has 1 billion sectors, each 512 bytes. The disk has a single, simple, flat address space for sectors. You can read and write any sector you want within the device address range. In contrast, an object storage device with the same capacity could store millions of objects that are variable in size and the write and read operations to those objects would be byte sized and byte aligned. The tedious chore of mapping a data object to the underlying sectors is performed by the device itself rather than being done by the file system or database. The objects also have attributes, or metadata that the device stores with each data object. The object storage device converts an undifferentiated set of sectors into a large set of variable-sized data containers that have attributes

The idea of object storage dates back to the Network Attached Secure Disk (NASD) work done by Panasas founder and chief scientist, Dr. Garth Gibson, in the 1990’s. In fact, many storage systems are built around a data container abstraction. The NASD work led to a standardization effort to create a command set for SCSI that can be used for objects. In 2005, the OSD-1 command set was ratified, and it was updated to OSD-2 in 2008. The OSD command set for SCSI defines commands for object create, delete, read, write, set attributes, and get attributes. There are partitions to help manage objects for different systems that share a device. OSD-2 added support for snapshots and clones.

The difference between a “data object” and a “file” is subtle. In simple applications there is one data object per file. In more complex (and realistic) scenarios, there can be multiple objects associated with one file, and these objects may be stored on different object storage devices for performance and redundancy purposes. Files also have semantics that may not be directly supported by the underlying object storage, and are added by a file service that is layered over the object storage.

Object storage is used in conjunction with a metadata service that organizes objects so they are useful to applications. For example, a web site that stores images will have a database that tracks photos stored by a particular user, and records the object identifiers for each photo. As a different example, a POSIX file system can organize objects with a hierarchical system of directories and pathnames, with standard user and group ownership and access control semantics expected by a POSIX application.

The Panasas file system, PanFS, is a high performance parallel file system built from a collection of Object Storage Devices (OSD). In our product, Panasas ActiveStor, the OSD is a storage blade that has two high capacity SATA drives, an SSD, an x86 processor, memory, network interfaces, and software stack. Eleven of these blades are housed in a 4U chassis, and large numbers of blades (more than 1,000) can be aggregated together to create a very large, very high performance file system.

Web Object Storage provides the same data container abstraction, but uses the HTTP web protocol (i.e., a REST API) to access the data objects. The universality of data objects can be found in the parallel invention of data objects using the SCSI interface for HPC file systems, and the HTTP interface for web based services. In practice, the data objects are very similar in both worlds: a data object is a variable-sized sequence of bytes with metadata and an access control protocol.

Web objects are very simple: they are written in their entirety with a single PUT operation. A nice property of this interface is that the client that creates the object can compute an MD5 checksum over the data, and store that checksum as metadata along with the data. When the object is read back later, the checksum can be re-computed and compared with the metadata to ensure that the data is still valid. Obviously, a drawback of this model is that in-place updates are not allowed, so this simple storage container would be unsuitable for a database or a POSIX file system. Objects can have other attributes such as retention times or expiration times such that they are automatically deleted after they expire.

Objects can be replicated for redundancy. Or, applications can use erasure codes to store their data in multiple objects with a more space efficient scheme. The PanFS file system uses multiple objects to store each file. File data is striped across objects, and erasure codes are used to compute redundant data for that file that is stored in additional objects. The advantage of erasure codes over replication is space efficiency. For example, a RAID 5 encoding with 8 data components and 1 parity component protects against one failure with a 12.5% capacity overhead. In contrast, a simple mirror protects against one failure with a 100% capacity overhead. Many web object systems use triplication to protect against two failures, which has a 200% capacity overhead. In contrast, RAID 6 and higher levels of redundancy can be used to protect against multiple failures with a sub-linear multiplier of capacity overhead such as 2 extra components to protect 8, or a 25% overhead.

Web objects are often associated with other services related to virtual machine management. For example, the Amazon Web Service (AWS) provides a Simple Storage Service (S3) that can be used from its Elastic Compute (EC2) service. Data stored in S3 can be cheaply and, presumably, efficiently accessed from jobs running in the EC2 environment. One important issue that remains to be resolved is compatibility between the various web object interfaces. Applications written to use S3, Microsoft Windows Azure, or OpenStack may have similar storage requirements, but the syntax and details of using the storage API will be different. The SNIA CDMI standard for cloud storage provides a standard way for web-based applications to use object storage. It appears that web storage for objects is still in an early phase that will go through a period of standardization so that applications can be portable across different cloud infrastructures.

For more information on how Panasas implements and leverages object storage, visit our PanFS pages on our website.