Atomizer Pattern

from Riehle et al (PLoP'96)

Synopsis

Read arbitrarily complex object structures from and write them to varying data structure-based backends. Efficiently store and retrieve objects from different backends, such as flat files, relational databases, and RPC buffers.

Context

You want to copy, print, or store an arbitrarily complex object structure.

Forces

In general, contained objects should be contained in the output, acquaintances should be acquaintances of the output, and temporaries, such as caches, should not appear in the output.
Objects may be shared in the structure, i.e. pointed to multiple times. Sharing should be preserved in the output.
The structure may contain loops, e.g. self-reference, which should also be preserved.
The same code should work for multiple backends: memory copies, the screen, files, databases, RPC buffers, etc.

Solution

Objects load or store themselves using an Atomizer object with a stylized interface. The Atomizer converts between the internal format and the backend. The object decides containment vs. acquaintance vs. temporary and the Atomizer handles the rest.

An object is represented as a class ID and a sequence of fields. Each field contains a value and an optional name. Omitting names entails less storage but also less flexibility. Values are either

Primitives, like integers, floats, and strings.
Embedded objects, i.e. nested sequences of fields.
Object references, for pointing to external objects. They can be global names or embedded Proxy objects.
Object tokens, for referring to previously stored objects.

The Atomizer supports methods for writing primitives, embedded objects, and object references. The object calls these to serialize itself. To write an embedded object, the Atomizer tells the embedded object to serialize itself using the Atomizer, i.e. it is a recursive call. Eventually, every object in the structure will be represented using only primitives. The object can choose to omit temporaries, or the Atomizer can have a method for writing a temporary to be stored at the Atomizer's whim.

To resurrect a stored object, the class is instantiated from its ID, then the object initializes itself using the Atomizer. The Atomizer only supports reading primitives and reading objects. The rest is handled automatically. Fields can be read in order or by name, depending on how the object was stored.

To handle both sharing and circular references, the Atomizer keeps a table of objects which have already been written. If an object is to be written a second time, a token is written instead, referring back to the first occurrence. On reading, the Atomizer automatically substitutes the previously-read object for the token.

If the eventual reader has access to the class definition, then the object's methods can be omitted. Otherwise, the methods must be sent as fields, using code as the value. See Implementation for more on this issue.

Consequences

Since the Atomizer's interface is abstract, it is easy to add a new backend (representation format).
The Atomizer automatically handles sharing and circular references. However, it does not automatically determine which fields denote containment.
Storing acquaintances requires a persistent naming mechanism, such as that provided by CORBA.
For copying, the Atomizer can simply store to a memory buffer, from which another object reads. In this case, it might be advantageous to include temporaries, since the communication is cheap.
The Atomizer can be used for reflection, e.g. a field inspector or object structure browser. Fields should be named in this case. Temporaries should probably be preserved.
System resources, e.g. file handles, held by the stored object may need to be re-acquired when the object is resurrected.
The Atomizer pattern only makes a physical copy of the object. A shared memory mechanism must be added for actually migrating the object.

Implementation

Object tokens can simply be integers, counting back over the number of fields written since the first occurrence (relative indexing), or they can count from the beginning of the record (absolute indexing).
Using named fields allows partial representation of the object, e.g. a change notification or an incremental backup. The receiver can read just the named fields.
The Atomizer can utilize hints about the serialized object's destination. For example, if the object is being copied to somewhere in the same process, local names (pointers) can be used instead of global names.
Storing methods is difficult in C++. You can use machine code, but this isn't portable. You can use a library name, but this requires a shared filesystem. You can use source code, but this requires compilation on resurrection. The best solution is probably to send intermediate-level code, e.g. Python or Java bytecode.
The Atomizer's methods for reading and writing can be combined. For example, the object could simply call DoInteger on the Atomizer with a pointer to the integer variable to be read or written. The Atomizer can read from the variable or write to the variable, depending on what type of Atomizer it is.
Another way to implement reading is for all objects to support the Atomizer's writing interface. Thus the stored data is written to the memory object the same way the memory object writes to storage. This integrates the Atomizer pattern with the Builder pattern from Design Patterns. Note that copying can now be done without an intermediate buffer at all. This technique is particularly effective with named fields.

For more information:

Riehle et al's paper (PLoP'96)
The book Object Persistence by Roger Sessions
The CORBA externalization standard
The eXternalization template library

Known Uses

ET++ uses the Atomizer pattern to copy, store, and transmit object data to other applications. Shared libraries are used for the object methods.
The CORBA externalization service uses an Atomizer for object persistence. There is no current support for storing methods, though the Java language has been proposed for it. The CORBA relationship service is used to automatically determine containment vs. acquaintance.
JavaBeans uses Java's reflective abilities to provide automatic serialization for object data and methods. Data fields can be tagged as temporary via the transient keyword in the class declaration. However, there is no way to denote acquaintance except by writing a custom serialization routine.
XDR, used in Sun RPC, uses the Atomizer pattern to serialize arguments to remote procedures. It uses the combined-method trick.

Thomas Minka

Last modified: Fri Sep 02 17:10:03 GMT 2005