Formatters#

A formatter can be made by implementing the interface of IFormatter. A completely custom formatter that does not use the pre-made base class Formatter will still have to acknowledge the interface of FormatterContext to maintain compatibility of user objects that are adhering to the interface of Serializable.

The base class Formatter was made to be the boiler plate for all of my intents and purposes, so all of my writings are going to be from the perspective of inheriting from Formatter.

Basic Anatomy#

The basic formatter consists of a subclass of Formatter, one or more FormatterContext and a FrameStackContext. The formatter is meant to be used multiple times after instantiation for different operations and allow the user to manage default values for the processes it manages. The Processors are meant to be used once and then disposed, since they are objects that only exist to do the work of serialize() and deserialize(). The FormatterContext and FrameStackContext are manged by the Processor and are created along with them. The Processors and the objects they manage can be safely used to manage state throughout a processing job. They are not designed to be multi-threaded or reused so that is safe.

The design of the Processor is a little bit confusing because it uses recursion to walk through the object hierarchy, but it also uses multiple stacks (in the form of FrameStackContext and the FormatterContext. This is for two reasons. The first is that the user objects adhering to Serializable need to be exposed to some of the information in the process but not all of it and the FormatterContext has the responsibility of exposing these features to them. Also the custom managed stacks simply do not exactly have the same timings and organisational needs as the function call stack.

Role of context managers#

To push and pop from the contexts mentioned above they implement the Context Manager Interface. The FormatterContext is typically meant to be called when entered and supplied a “path key”. The FormatterContext keeps track of the logical path of a process. The default meaning of this is the objects that would be used by __getitem__ to retrieve the current object from its parent (this is not a strict definition). This is used for referencing objects in a different sections of the hierarchy, navigating the hierarchy and retrieving objects by a unique path.

Even though an instance of FrameStackContext is a member of the FormatterContext, the FormatterContext does not manage the FrameStackContext. It is there so that user objects can interact using Processor is responsible for using the FrameStackContext like a context manager and the reference is held in the Processor in the attribute semantics. The stack on the FrameStackContext holds collections of DeSerializationHandler or SerializationHandler). This encourages handlers and semantics to propagate down-stream but not upstream.

Concessions for references#

The implementation of Formatter would be fairly straight forward if it were not for two issues, one affecting serialization and the other deserialization. Both issues involve preserving object references.

Serialization & references#

The serialization process has to keep track of all the ids that may be references more then once, but it does not know where an object “comes from.” By this I mean it has no knowledge of if a “dict” object it observes was created in a handler, to_dict() method (during the serialization process) or if is referenced by the original object hierarchy. I will explain why it is highly encouraged to treat these two kinds of object differently. We determine uniqueness by reading object ids with the built-in id() function. The problem is that if we dereference an object after we cached its object id, when a new object is created it will probably take the id of the object that was just dereferenced causing incorrect connections between referenced objects. This is mitigated by making sure that all objects who’s ids are cached ultimately have the same “lifecycle” as all other objects that are cached. This is easy enough, we just add them to a set (this can be turned off since it shouldn’t be necessary EnforceReferenceLifecycle).

Here in lies the next problem. If we add all the objects that are processed to a collection thus maintaining their lifecycle then we are pretty much doubling our memory footprint, for objects that would have been dereferenced that is. Why are objects being dereferenced mid-process anyway? This is because some objects are made by functions like handlers that make objects purely to communicate structure to the formatter. If a data structure exists as a python objects but it communicates its state as a dict, the dict and the original object are not the same and the generated dict is only going to live as long as the stack frame that asked for the state object to be created. If we can differentiate between these communicative objects (hence forth will be called Temporary objects) and user referenced objects (like dicts that are directly referenced by user objects) we can avoid caching them. We can also do destructive things to these “temporary” objects because it can be assumed that once they are given to the formatter they “belong” to the formatter since they dont belong to anyone else at that point. We can save even more hassle by overwriting the container objects in place instead of creating yet another copy of the structure. The default SerializationHandler automatically wraps the return value of handlers in Temporary objects see: This link for more info. From this info mentioned in the link we can see that the way to communicate to the formatter that an object is temporary is to wrap it in a Temporary object.

The Serializer Processor has special methods for dealing with Temporary objects. They are handle_serialize_list_in_place(), handle_serialize_dict_in_place() and handle_temporary(). The latter method is added to the Serializer handler (not to be confused with the check_in_object on non-temporary objects and the in-place logic runs immediately on them. The only other non-standard handler on the Serializer is handle_add_semantics() which is partially explained here. You can look at the __init__ method for the handler setup.

DeSerialization & references#

When the deserialization process encounters a preserved reference it is ideal to have that object already deserialized and sitting in the cache, but for a couple of reasons this may not be the case. The first case is that is simply has not reached the object yet, but if things were just in a different order it could have been prepared already. The second case is that it is not possible for object to have been prepared already regardless of the order because the object that is being referenced is currently being deserialized because it is a parent of the current object (this is a circular reference). These cases are handled separately by DeSerializer.

In the case of a circular reference, the PreservedReference is simply given to the object and the object is responsible for sorting it out using the NotifyFinalizedMethodName semantic and/or the FormatterContext`s ``finalize()` event handler.

In the case of a non-circular reference, the process “jumps” to the location of the reference, deserializes it, replaces it with a PreservedReference linked to the return key path and then returns to the return key path and gives it the fully deserialized object. This will have the effect that once the proces reaches the PreservedReference that was left during the jump it will be guaranteed to successfully retrieve the object from the cache and proceed normally. This process, as well as several other conveniences are accomplished by the DeSerializer having a two stage handling process. First an object is handed by the handler attribute then the secondary_handler.