Ruby Marshalling from A to Z

Jan 26, 2016    #ruby   #marshalling   #serialization   #tlv  

What is Marshalling

Marshalling is a serialization process when you convert an object to a binary string. Ruby has a standard class Marshal that does all the job for serialization and deserialization. To serialize an object, use Marshal.dump, to deserialize - Marshal.load or Marshal.restore.

marshalled = Marshal.dump([1, 2, 'string', Object.new])
# => "\x04\b[\ti\x06i\aI\"\vstring\x06:\x06ETo:\vObject\x00"

Marshal.load(marshalled)
# => [1, 2, "string", #<Object:0x00000002643000>]

This article explains the format of marshalling and shows how to write a pure Ruby marshalling library compatible with the standard Ruby implementation.

The gem

Let’s try to make a pure Ruby gem that is compatible with standard Marshal class.

$ bundle gem pure_ruby_marshal
$ tree pure_ruby_marshal
pure_ruby_marshal
├── bin
│ ├── console
│ └── setup
├── CODE_OF_CONDUCT.md
├── Gemfile
├── lib
│ ├── pure_ruby_marshal
│ │ └── version.rb
│ └── pure_ruby_marshal.rb
├── LICENSE.txt
├── pure_ruby_marshal.gemspec
├── Rakefile
└── README.md

Our main module PureRubyMarshal has the following interface:

module PureRubyMarshal
  extend self

  def dump(object)
    # ...
  end

  def load(data)
    # ...
  end
end

Also I’d like to extract all logic related into reading/writing encoded data to separate classes - ReadBuffer for reading, WriteBuffer for writing.

module PureRubyMarshal
  extend self

  autoload :ReadBuffer,  'pure_ruby_marshal/read_buffer'
  autoload :WriteBuffer, 'pure_ruby_marshal/write_buffer'

  def dump(object)
    WriteBuffer.new(object).write
  end

  def load(data)
    ReadBuffer.new(data).read
  end
end

Reading

As we decided before, ReadBuffer is our class responsible for reading an object from marshalled data. Here is how it should look:

class PureRubyMarshal::ReadBuffer
  attr_reader :data

  def initialize(data)
    @data = data
  end

  def read
    # ...
  end
end

Decompressing

First let’s take a look at Marshal.dump.

marshalled = Marshal.dump([1, 2, 'string', Object.new])
data = marshalled.chars
# => [a long array of characters]

The first two characters represent a version of library that was used for marshalling:

data[0].ord
# => 4
data[1].ord
# => 8

Which represents a current version of Marshal library - 4.8 (link to the source ). This values are constants for any marshalled data and they are stored in Marshal::MAJOR_VERSION and Marshal::MINOR_VERSION:

 :001 > Marshal::MAJOR_VERSION
 => 4
 :002 > Marshal::MINOR_VERSION
 => 8

Our PureRubyMarshal should have same constants:

module PureRubyMarshal
  MAJOR_VERSION = 4
  MINOR_VERSION = 8
end

The rest of the array is actually a marshalled object. So, for now marshalled data looks like ['4.8', some unknown characters]

To read the version of Marshal library we can modify ReadBuffer#initialize:

attr_reader :minor_version, :major_version

def initialize(data)
  @data = data.chars
  @major_version = read_byte
  @minor_version = read_byte
end

def read_char
  data.shift
end

def read_byte
  read_char.ord
end

# In irb:
marshalled = Marshal.dump(123)
read_buffer = PureRubyMarshal::ReadBuffer.new(marshalled)
read_buffer.major_version
 => 4
read_buffer.minor_version
 => 8

Getting objects from the raw data

When Marshal converts your object to a string, it uses very simple rules:

NilClass, TrueClass, FalseClass

ReadBuffer.new(Marshal.dump(nil)).data
 => ["0"]

Character 0 means that encoded object is nil. To handle it, we can write something like:

class PureRubyMarshal::ReadBuffer
  # ...

  def read
    char = read_char
    case char
    when '0' then nil
    else
      raise NotImplementedError, "Unknown object type #{char}"
    end
  end
end

 :001 > ReadBuffer.new(Marshal.dump(nil)).read
 => nil
 :001 > ReadBuffer.new(Marshal.dump(true)).data
 => ["T"]
 :002 > ReadBuffer.new(Marshal.dump(false)).data
 => ["F"]

true converts to T, false converts to F, nothing complex for now.

Let’s extend our case statement:

when 'T' then true
when 'F' then false

Tests

Of course, we can’t develop without tests,

# fixtures.rb

FIXTURES = {
  'nil' => nil,
  'true' => true,
  'false' => false
}

# pure_ruby_marshal_spec.rb

require 'spec_helper'

describe PureRubyMarshal do
  describe '.load' do
    FIXTURES.each do |fixture_name, fixture_value|
      it "loads marshalled #{fixture_name}" do
        result = PureRubyMarshal.load(Marshal.dump(fixture_value))
        expect(result).to eq(fixture_value)
      end
    end
  end
end

# rspec --format documentation

PureRubyMarshal
  .load
    loads marshalled nil
    loads marshalled true
    loads marshalled false

3 examples, 0 failures

Integer

All encoded integers are prepended with an "i" symbol. Added one more when to our case:

when 'i' then read_integer

The next byte defines a strategy of decoding the rest of the data. Then comes a bytes sequence representing a number.

Let’s say that byte = (first_char_code_after_i ^ 128) - 128 (it ranges from -128 .. 128).

There are 5 different cases depending on the byte value:

Let’s write a utility lambda to simplify examples:

get_sequence = lambda do |n|
  buffer = ReadBuffer.new(Marshal.dump(n))
  buffer.read_char
  buffer.data.map(&:ord)
end

It:

The first case is the simplest one:

get_sequence[0]
 => [0]

Zero is actually encoded as zero.

The second case:

get_sequence[7]
 => [12]
get_sequence[8]
 => [13]
get_sequence[120]
 => [125]

In this case result = byte - 5 and it covers small positive numbers.

The third case:

get_sequence[99999]
 => [3, 159, 134, 1]

3 means three numbers representing a single number, they should be merged with binary OR and byte shifting:

(159 << (8*0)) | (134 << (8*1)) | (1 << (8*2))
 => 99999

The fourth and the fifth examples are just like 2nd and 3rd, but for negative numbers.

Let’s implement our read_integer method!

def read_integer
    # c is our first byte
    c = (read_byte ^ 128) - 128

    case c
    when 0
      # 0 means 0
      0
    when (4..127)
      # case for small positive numbers
      c - 5
    when (1..3)
      # c next bytes is our big positive number
      c.
        times.
        map { |i| [i, read_byte] }.
        inject(0) { |result, (i, byte)| result | (byte << (8*i))  }
    when (-128..-6)
      # case for small negative numbers
      c + 5
    when (-5..-1)
      # (-c) next bytes is our number
      (-c).
        times.
        map { |i| [i, read_byte] }.
        inject(-1) do |result, (i, byte)|
          a = ~(0xff << (8*i))
          b = byte << (8*i)
          (result & a) | b
        end
    end
  end

To add tests for integers, we can just add more key-value pairs to our FIXTURES constant.

Complex data types like

and others include numbers into their encoded structure to represent their length.

Symbol

ReadBuffer.new(Marshal.dump(:a_symbol)).data
 => [":", "\r", "a", "_", "s", "y", "m", "b", "o", "l"]

Symbols are encoded using:

# one more "when" statement
when ':' then read_symbol

# and implementation
def read_symbol
  read_integer.times.map { read_char }.join.to_sym
end

String

ReadBuffer.new(Marshal.dump("a string")).data
 => ["\"", "\r", "a", " ", "s", "t", "r", "i", "n", "g"]

Where:

To support loading marshalled string, we can use the following code:

# ...
when '"' then read_string
# ...
def read_string
  read_integer.times.map { read_char }.join
end

Array

ReadBuffer.new(Marshal.dump([1,2,3])).data
 => ["[", "\b", "i", "\x06", "i", "\a", "i", "\b"]

Array items are encoded as separate objects, every item is prepended by its own service character:

# ...
when '[' then read_array
# ...
def read_array
  read_integer.times.map { read }
end

Hash

Hash is encoded as an array of key-value pairs:

ReadBuffer.new(Marshal.dump({ 15 => 5 })).data
 => ["{", "\x06", "i", "\x14", "i", "\n"]
# ...
when '{' then read_hash
# ...
def read_hash
  pairs = read_integer.times.map { [read, read] }
  Hash[pairs]
end

Float

Float is encoded as its string representation

ReadBuffer.new(Marshal.dump(1.5)).data
 => ["f", "\b", "1", ".", "5"]

Floats are encoded using #to_s method:

To get it back, we can use String#to_f method:

# ...
when 'f' then read_float
# ...
def read_float
  read_string.to_f
end

Class

ReadBuffer.new(Marshal.dump(Array)).data
 => ["c", "\n", "A", "r", "r", "a", "y"]

Classes are represented by their names:

After reading the name we can do Object.const_get(const_name) to retrieve the actual class. The only remark here is that when the class does not exist anymore, Marshal re-raises ArgumentError, "undefined class/module #{const_name}" instead of a NameError. Moreover, if the constant returned by Object.const_get is not a class, Marshal raises ArgumentError, "#{const_name} does not refer to a Class". So, the constant must exist and it must be a Class

# ...
when 'c' then read_class
# ...
def marshal_const_get(const_name)
  Object.const_get(const_name)
rescue NameError
  raise ArgumentError, "undefined class/module #{const_name}"
end

def read_class
  const_name = read_string
  klass = marshal_const_get(const_name)
  unless klass.instance_of?(Class)
    raise ArgumentError, "#{const_name} does not refer to a Class"
  end
  klass
end

I have extracted marshal_const_get to a separate method to use it later for reading a Module from the marshalled data.

Module

Modules are similar to Classes, but the “magical” character is "m" instead of "c" (and, of course, messages of exceptions are about modules).

# ...
when 'm' then read_module
# ...
def read_module
  const_name = read_string
  klass = marshal_const_get(const_name)
  unless klass.instance_of?(Module)
    raise ArgumentError, "#{const_name} does not refer to a Module"
  end
  klass
end

Struct

Struct classes are encoded by their class names + their data:

Point = Struct.new(:x, :y)
a = Point.new(3, 7)
ReadBuffer.new(Marshal.dump(a)).data
 => ["S", ":", "\n", "P", "o", "i", "n", "t", "\a", ":", "\x06", "x", "i", "\b", ":", "\x06", "y", "i", "\f"]

This output can be split to:

So, we have a Struct defined with its class name and the Hash containing object’s data

# ...
when 'S' then read_struct
# ...
def read_struct
  klass = marshal_const_get(read) # Point
  attributes = read_hash # { x: 3, y: 7 }
  values = attributes.values_at(*klass.members) # [3, 7]
  klass.new(*values)
end

Why is class name encoded as a Symbol, not a String? See section ‘Symbol link’

Regexp

ReadBuffer.new(Marshal.dump(/a_regexp/)).data
 => ["/", "\r", "a", "_", "r", "e", "g", "e", "x", "p", "\x00"]

Alright, there are:

# ...
when '/' then read_regexp
# ...
def read_regexp
  string = read_string
  kcode = read_byte
  Regexp.new(string, kcode)
end

Abstract object

class Point2
  attr_reader :x, :y

  def initialize(x, y)
    @x, @y = x, y
  end

  def ==(other)
    other.is_a?(Point2) &&
      other.x == self.x &&
      other.y == self.y
  end
end

point = Point2.new(5, 10)
ReadBuffer.new(Marshal.dump(point)).data
 => ["o", ":", "\v", "P", "o", "i", "n", "t", "2", "\a", ":", "\a", "@", "x", "i", "\n", ":", "\a", "@", "y", "i", "\x0F"]
# ...
when 'o' then read_object
# ...
def read_object
  klass = marshal_const_get(read) # Point2
  ivars_data = read_hash # { :@x => 5, :@y = 10 }
  object = klass.allocate # #<Point >
  ivars_data.each do |ivar_name, value|
    object.instance_variable_set(ivar_name, value)
  end
  object
end

User class

User classes are classes inherited from default classes:

class MyArray < Array
end
my_array = MyArray[1,2,3]

ReadBuffer.new(Marshal.dump(my_array)).data
 => ["C", ":", "\f", "M", "y", "A", "r", "r", "a", "y", "[", "\b", "i", "\x06", "i", "\a", "i", "\b"]

Objects of these classes are encoded in the following way:

# ...
when 'C' then read_userclass
# ...
def read_userclass
  klass = marshal_const_get(read)
  data = read
  klass.new(data)
end

Extended object

Sometimes your object is very, very complex. Like an object extended with some modules:

MyModule = Module.new
obj = []
obj.extend(MyModule)

In this case Marshal prepends you object structure with the list of extended modules:

ReadBuffer.new(Marshal.dump(obj)).data
 => ["e", ":", "\r", "M", "y", "M", "o", "d", "u", "l", "e", "[", "\x00"]

Where:

# ...
when 'e' then read_extended_object
# ...
def read_extended_object
  mod = marshal_const_get(read) # MyModule
  object = read # []
  object.extend(mod)
end

I guess this idea was originally created for compressing marshalled output. Consider the following situation: you have a collection of objects of the same class (the simplest example is a result of calling YourActiveRecordModel.all). When you pass this collection to Marshal.dump, it converts objects one by one, writing them to an output stream. But an abstract object is represented by its class name and a hash of instance variables. Symbol links save you from writing the same class.name again and again to the output. Instead, it remembers all symbols that have been written, and when the symbol appears twice in the sequence, it writes its sequence number.

a1 = [:symbol1, :symbol2]
a2 = [:symbol1, :symbol1]

dumped1 = Marshal.dump(a1)
 => "\x04\b[\a:\fsymbol1:\fsymbol2"
dumped1.length
 => 22

dumped2 = Marshal.dump(a2)
 => "\x04\b[\a:\fsymbol1;\x00"
dumped2.length
 => 15

For this example it saves us 31% of the initial size. Let’s see how it works:

ReadBuffer.new(Marshal.dump(a2)).data
 => ["[", "\a", ":", "\f", "s", "y", "m", "b", "o", "l", "1", ";", "\x00"]

So we have an array of two items:

The algorithm for parsing Symbol should be changed a little bit to save all symbols to an internal array.

def initialize
  # ...
  @symbols_cache = []
  # ...
end

# ...

def read_symbol
  symbol = read_integer.times.map { read_char }.join.to_sym # no changes here
  @symbols_cache << symbol # save a symbol
  symbol
end

# ...

when ';' then read_symbol_link

# ...

def read_symbol_link
  @symbols_cache[read_integer]
end

I’ll return to symbol links and my thoughts about how it can be used to compress marshalled output in “Optimizations” section.

Same story here, when you have an object that appears multiple times in your data, Marshal will serialize it only once:

obj1 = Object.new
obj2 = Object.new

a1 = [obj1, obj2]
a2 = [obj1, obj1]

dumped1 = Marshal.dump(a1)
dumped1.length
 => 18

dumped2 = Marshal.dump(a2)
dumped2.length
 => 16

A symbol that indicates a beginning of an object link is "@". However the criteria for dumping an object link instead of an object itself is objects equality (equal?). Here’s the problem: if you dump an array [{}, {}], Marshal will dump both objects without any object links, because these objects are not equal.

Also Marshal does not cache:

Which is mostly correct.

The code for object links is quite big to paste it here, you can find it here

Other cases

To be honest, there are a few more cases, but they are too complex for implementing and pasting it here. I’m not going to cover here:

Writing

If you have read the previous part, I suppose it should be clear for you how to write it yourself :)

Optimizations

Let’s try on the real-world examples. Stuff that usually gets serialized is your data. And I can remember only one example where Marshal is used - model caching.

Imagine I have a model User with the following columns:

Marshal.dump(User.first).length
 => 891

That’s a lot… The easiest solution is to dump only attributes hash:

class User < ActiveRecord::Base
  def marshal_dump(*)
    attributes
  end

  def marshal_load(attributes)
    initialize(attributes)
    @new_record = id.blank?
  end
end

Marshal.load Marshal.dump(User.first)
 => #<User id: 131 ...>
Marshal.dump(User.first).length
 => 205

That’s better, let’s see what else we can improve.

By default ActiveRecord::Base#attributes returns a hash where keys are String. That sounds like it can be optimized by calling attributes.symbolize_keys in marshal_dump to use symbol links when we dump a collection. But that’s not true!

User.all.map(&:attributes).map(&:keys).flatten.map(&:object_id).uniq.length
 => 4 (for 4 columns)

This is an amazing example of optimization!

User::AttrNames.constants
 => [:ATTR_9646, :ATTR_56d61696c6, :ATTR_36275616475646f51647, :ATTR_57074616475646f51647]

User::AttrNames.constants.map do |const_name|
  User::AttrNames.const_get(const_name)
end
 => ["id", "email", "created_at", "updated_at"]

Keys in the attributes hash are from these constants, so they are always the same objects, so Marshal cashes them using object links.

By the way, why does ActiveRecord save attribute names as String? The answer is simple, Symbol has no encoding.

Well, currently our record is represented by:

Let’s see what happens when we serialize AS::TimeWithZone:

Marshal.dump(Time.zone.now).length
 => 120

So, from 205 characters of our record 120 is taken for a time with zone. AS::TimeWithZone has its custom implementation of marshal_load that converts it to [utc, zone, time]. If you always use your application time zone, then you can store it as epoch time:

class ActiveSupport::TimeWithZone
  def marshal_dump(*)
    to_i
  end

  def marshal_load(epoch)
    utc = Time.at(epoch)
    zone = Rails.application.config.time_zone
    local = utc.in_time_zone(zone)
    initialize(utc.utc, ::Time.find_zone(zone), local)
  end
end
Marshal.dump(User.first).length
 => 139
Marshal.dump(User.first(3)).length
 => 261

Personally I’m not sure that the next optimization is a good idea, but you can try:

class User < ActiveRecord::Base
  def self.marshallable_attributes
    @marshallable_attributes ||= %w(
      id
      email
      created_at
      updated_at
    )
  end

  def marshal_dump(*)
    self.class.marshallable_attributes.map do |attribute|
      attributes[attribute]
    end
  end

  def marshal_load(attributes)
    attributes = self.class.marshallable_attributes.zip(attributes)
    attributes = Hash[attributes]
    initialize(attributes)
    @new_record = id.blank?
  end
end

The idea is to store a hard-coded sequence of attribute names and using it serialize only values of attributes. It allows us to not serialize object links of attribute names.

Why can’t we just get attributes.keys.sort? Because you can add one more column that may come to the middle of that array. As a result, your attributes may become shuffled. But probably you can avoid it by changing a cache key right after adding a column.

You can extend this solution to something similar to ActiveModel::Serializer where you have a separate serializer class for every model class.

Let’s see what this optimization gives us:

Marshal.dump(User.first).length
 => 84
Marshal.dump(User.first(3).to_a).length
 => 190

That’s 10 times less then initial size, good job!

Of course, if you have columns with type text there’s nothing to optimize, this is my example and my experience.

What is it for?

I have been working for about 3 weeks on implementing Marshal module for opal . It’s almost compatible with MRI implementation. I believe Marshal is the thing that can bring real isomorphism to opal applications. If you have an object that can be dumped on the server and its class was compiled on the client, you can pass it directly from the server without any serialization on the client/server.

GitHub repository with PureRubyMarshal

Possible Opal implementation of Marshal