top of page
Writer's pictureVijithkumar V

Creating a Custom YAML Dumper and Representer Function in Python



Man pondering YAML serialization of Python's complex data structures
Contemplating the Art of YAML Serialization for Python's Complex Data Structures

The complex data that pyYAML finds harder to serialize


This is a complex data

data = [{0.627: -47.57142857142857, 0.66: -35.76190476190476, 0.6930000000000001: -40.61904761904761, 0.726: -50.33333333333332, 0.759: -61.66666666666664, 0.792: -71.38095238095235, 0.8250000000000001: -76.23809523809521, 0.8580000000000001: -72.99999999999997, 0.891: -73.19047619047616, 0.924: -72.90476190476188, 0.9570000000000001: -72.33333333333331, 0.99: -71.66666666666664, 1.0230000000000001: -71.09523809523807, 1.056: -70.8095238095238, 1.089: -70.99999999999999, 1.122: -71.47619047619047, 1.155: -70.76190476190474, 1.1880000000000002: -69.33333333333331, 1.221: -67.66666666666664, 1.254: -66.23809523809523, 1.2870000000000001: -65.5238095238095, 1.32: -66.47619047619045, 1.353: -65.76190476190474, 1.3860000000000001: -64.33333333333331, 1.419: -62.66666666666665, 1.452: -61.23809523809522, 1.485: -60.52380952380951, 1.518: -60.99999999999998, 1.5510000000000002: -61.95238095238093, 1.584: -60.5238095238095, 1.617: -57.66666666666665, 1.6500000000000001: -54.33333333333332, 1.683: -51.47619047619046, 1.7160000000000002: -50.04761904761903, 1.749: -50.99999999999999, 1.782: -52.14285714285714, 1.8150000000000002: -50.42857142857143, 1.848: -47.0, 1.881: -43.0}, {5.577: -43.99999999999998, 5.61: -66.99999999999997, 5.643000000000001: -86.71428571428568, 5.676: -96.57142857142853, 5.7090000000000005: -89.99999999999997, 5.742: -89.99999999999997, 5.775: -89.99999999999997, 5.808: -89.99999999999997, 5.841: -89.99999999999997, 5.8740000000000006: -89.99999999999997, 5.907: -89.99999999999997, 5.94: -90.19047619047618, 5.973: -89.9047619047619, 6.006: -89.33333333333331, 6.039000000000001: -88.66666666666666, 6.072: -88.09523809523807, 6.105: -87.8095238095238, 6.138: -88.0, 6.171: -88.0, 6.204000000000001: -88.0, 6.237: -88.0, 6.2700000000000005: -88.0, 6.303: -88.0, 6.336: -88.0, 6.369000000000001: -88.0, 6.402: -88.0, 6.4350000000000005: -88.0, 6.468: -88.0, 6.501: -88.0, 6.534000000000001: -88.0, 6.567: -88.0, 6.6000000000000005: -88.19047619047618, 6.633: -87.9047619047619, 6.666: -87.33333333333331, 6.699000000000001: -86.66666666666666, 6.732: -86.09523809523807, 6.765000000000001: -85.8095238095238, 6.798: -86.0, 6.831: -87.8095238095238, 6.864000000000001: -85.09523809523809, 6.897: -79.66666666666666, 6.930000000000001: -73.33333333333331, 6.963: -67.9047619047619, 6.996: -65.19047619047618, 7.029: -66.99999999999999, 7.062: -74.8095238095238, 7.095000000000001: -63.09523809523809}]

This is a list object containing a sequence of dictionary objects.

We need to serialize this complex data, by writing it into a yaml file. Okay, so if you have gone through my previous tutorial on yaml, I have clearly provided the solution to it.

import yaml

with open(r”dataSerialized.yaml”, “w”) as wFile:

    yaml.dump(data, wFile, default_flow_style=False)

Here, default_flow_style can be set to True or False, depending upon the data structure. If it is set to “False”, the data will be arranged in blocks, otherwise, they will be in block style.


Let's look at this complex data, a bit closer.

import yaml

listVar = [{0.627: -47.57142857142857, 0.66: -35.76190476190476, 0.6930000000000001: -40.61904761904761, 0.726: -50.33333333333332, 0.759: -61.66666666666664, 0.792: -71.38095238095235, 0.8250000000000001: -76.23809523809521, 0.8580000000000001: -72.99999999999997, 0.891: -73.19047619047616, 0.924: -72.90476190476188, 0.9570000000000001: -72.33333333333331, 0.99: -71.66666666666664, 1.0230000000000001: -71.09523809523807, 1.056: -70.8095238095238, 1.089: -70.99999999999999, 1.122: -71.47619047619047, 1.155: -70.76190476190474, 1.1880000000000002: -69.33333333333331, 1.221: -67.66666666666664, 1.254: -66.23809523809523, 1.2870000000000001: -65.5238095238095, 1.32: -66.47619047619045, 1.353: -65.76190476190474, 1.3860000000000001: -64.33333333333331, 1.419: -62.66666666666665, 1.452: -61.23809523809522, 1.485: -60.52380952380951, 1.518: -60.99999999999998, 1.5510000000000002: -61.95238095238093, 1.584: -60.5238095238095, 1.617: -57.66666666666665, 1.6500000000000001: -54.33333333333332, 1.683: -51.47619047619046, 1.7160000000000002: -50.04761904761903, 1.749: -50.99999999999999, 1.782: -52.14285714285714, 1.8150000000000002: -50.42857142857143, 1.848: -47.0, 1.881: -43.0}, {5.577: -43.99999999999998, 5.61: -66.99999999999997, 5.643000000000001: -86.71428571428568, 5.676: -96.57142857142853, 5.7090000000000005: -89.99999999999997, 5.742: -89.99999999999997, 5.775: -89.99999999999997, 5.808: -89.99999999999997, 5.841: -89.99999999999997, 5.8740000000000006: -89.99999999999997, 5.907: -89.99999999999997, 5.94: -90.19047619047618, 5.973: -89.9047619047619, 6.006: -89.33333333333331, 6.039000000000001: -88.66666666666666, 6.072: -88.09523809523807, 6.105: -87.8095238095238, 6.138: -88.0, 6.171: -88.0, 6.204000000000001: -88.0, 6.237: -88.0, 6.2700000000000005: -88.0, 6.303: -88.0, 6.336: -88.0, 6.369000000000001: -88.0, 6.402: -88.0, 6.4350000000000005: -88.0, 6.468: -88.0, 6.501: -88.0, 6.534000000000001: -88.0, 6.567: -88.0, 6.6000000000000005: -88.19047619047618, 6.633: -87.9047619047619, 6.666: -87.33333333333331, 6.699000000000001: -86.66666666666666, 6.732: -86.09523809523807, 6.765000000000001: -85.8095238095238, 6.798: -86.0, 6.831: -87.8095238095238, 6.864000000000001: -85.09523809523809, 6.897: -79.66666666666666, 6.930000000000001: -73.33333333333331, 6.963: -67.9047619047619, 6.996: -65.19047619047618, 7.029: -66.99999999999999, 7.062: -74.8095238095238, 7.095000000000001: -63.09523809523809}]

#Let us check the data type

print(f"\nThis is a {type(listVar)} data type")

dictItem = listVar[0]

#Let us check what is inside the list
print(f"\nThis is a {type(listVar)} data type that contains {type(dictItem)} data type")

numItem = list(dictItem.keys())[0]

valueItem = list(dictItem.values())[0]

print(f"""\nThis is a {type(listVar)} data type that contains {type(dictItem)}
data type and the keys are {type(numItem)} data type""")
print(f"""\nThis is a {type(listVar)} data type that contains {type(dictItem)}
data type and the values are {type(valueItem)} data type""")

It shows the following output.

This is a <class 'list'> data type
This is a <class 'list'> data type that contains <class 'dict'> data type
This is a <class 'list'> data type that contains <class 'dict'> data type and the keys are <class 'float'> data type
This is a <class 'list'> data type that contains <class 'dict'> data type and the values are <class 'float'> data type

Now let’s write the list (the complex data) to a yaml file.

with open(r"D:\website\wixSite\articles\yaml\testYaml.yaml", "w")as wFile:
    yaml.dump(listVar, wFile)

The output is as you see below.

- 0.627: -47.57142857142857
  0.66: -35.76190476190476
  0.6930000000000001: -40.61904761904761
  0.726: -50.33333333333332
  0.759: -61.66666666666664
  0.792: -71.38095238095235
  0.8250000000000001: -76.23809523809521
  0.8580000000000001: -72.99999999999997
  0.891: -73.19047619047616
  0.924: -72.90476190476188
  0.9570000000000001: -72.33333333333331
  0.99: -71.66666666666664
  1.0230000000000001: -71.09523809523807
  1.056: -70.8095238095238
  1.089: -70.99999999999999
  1.122: -71.47619047619047
  1.155: -70.76190476190474
  1.1880000000000002: -69.33333333333331
..............................
...............................


What if the complex data contains data types that yaml can’t support?


Let’s tweak the above complex data for this tutorial purpose. For example, let’s change the type of key and values from “float” to “np.float64”

We have written the following code.

import yaml
import numpy as np
listVar = [{0.627: -47.57142857142857, 0.66: -35.76190476190476, 0.6930000000000001: -40.61904761904761, 0.726: -50.33333333333332, 0.759: -61.66666666666664, 0.792: -71.38095238095235, 0.8250000000000001: -76.23809523809521, 0.8580000000000001: -72.99999999999997, 0.891: -73.19047619047616, 0.924: -72.90476190476188, 0.9570000000000001: -72.33333333333331, 0.99: -71.66666666666664, 1.0230000000000001: -71.09523809523807, 1.056: -70.8095238095238, 1.089: -70.99999999999999, 1.122: -71.47619047619047, 1.155: -70.76190476190474, 1.1880000000000002: -69.33333333333331, 1.221: -67.66666666666664, 1.254: -66.23809523809523, 1.2870000000000001: -65.5238095238095, 1.32: -66.47619047619045, 1.353: -65.76190476190474, 1.3860000000000001: -64.33333333333331, 1.419: -62.66666666666665, 1.452: -61.23809523809522, 1.485: -60.52380952380951, 1.518: -60.99999999999998, 1.5510000000000002: -61.95238095238093, 1.584: -60.5238095238095, 1.617: -57.66666666666665, 1.6500000000000001: -54.33333333333332, 1.683: -51.47619047619046, 1.7160000000000002: -50.04761904761903, 1.749: -50.99999999999999, 1.782: -52.14285714285714, 1.8150000000000002: -50.42857142857143, 1.848: -47.0, 1.881: -43.0}, {5.577: -43.99999999999998, 5.61: -66.99999999999997, 5.643000000000001: -86.71428571428568, 5.676: -96.57142857142853, 5.7090000000000005: -89.99999999999997, 5.742: -89.99999999999997, 5.775: -89.99999999999997, 5.808: -89.99999999999997, 5.841: -89.99999999999997, 5.8740000000000006: -89.99999999999997, 5.907: -89.99999999999997, 5.94: -90.19047619047618, 5.973: -89.9047619047619, 6.006: -89.33333333333331, 6.039000000000001: -88.66666666666666, 6.072: -88.09523809523807, 6.105: -87.8095238095238, 6.138: -88.0, 6.171: -88.0, 6.204000000000001: -88.0, 6.237: -88.0, 6.2700000000000005: -88.0, 6.303: -88.0, 6.336: -88.0, 6.369000000000001: -88.0, 6.402: -88.0, 6.4350000000000005: -88.0, 6.468: -88.0, 6.501: -88.0, 6.534000000000001: -88.0, 6.567: -88.0, 6.6000000000000005: -88.19047619047618, 6.633: -87.9047619047619, 6.666: -87.33333333333331, 6.699000000000001: -86.66666666666666, 6.732: -86.09523809523807, 6.765000000000001: -85.8095238095238, 6.798: -86.0, 6.831: -87.8095238095238, 6.864000000000001: -85.09523809523809, 6.897: -79.66666666666666, 6.930000000000001: -73.33333333333331, 6.963: -67.9047619047619, 6.996: -65.19047619047618, 7.029: -66.99999999999999, 7.062: -74.8095238095238, 7.095000000000001: -63.09523809523809}]

#Let us check the data type

print(f"\nThis is a {type(listVar)} data type")

dictItem = listVar[0]

#Let us check what is inside the list
print(f"\nThis is a {type(listVar)} data type that contains {type(dictItem)} data type")

numItem = list(dictItem.keys())[0]

valueItem = list(dictItem.values())[0]

print(f"""\nThis is a {type(listVar)} data type that contains {type(dictItem)}
data type and the keys are {type(numItem)} data type""")
print(f"""\nThis is a {type(listVar)} data type that contains {type(dictItem)}
data type and the values are {type(valueItem)} data type""")

listVarConverted = [{np.float64(i):np.float64(j) for i, j in item.items()} for item in listVar]

print(listVarConverted)

dictItem = listVarConverted[0]

numItem = list(dictItem.keys())[0]

valueItem = list(dictItem.values())[0]

print(f"""\nThis is a {type(listVar)} data type that contains {type(dictItem)}
data type and the keys are {type(numItem)} data type""")
print(f"""\nThis is a {type(listVar)} data type that contains {type(dictItem)}
data type and the values are {type(valueItem)} data type""")

The output is as follows.

This is a <class 'list'> data type
This is a <class 'list'> data type that contains <class 'dict'> data type
This is a <class 'list'> data type that contains <class 'dict'> data type and the keys are <class 'float'> data type
This is a <class 'list'> data type that contains <class 'dict'> data type and the values are <class 'float'> data type
------------After conversion------------------------
This is a <class 'list'> data type that contains <class 'dict'> data type and the k+eys are <class 'numpy.float64'> data type
This is a <class 'list'> data type that contains <class 'dict'> data type and the values are <class 'numpy.float64'> data type

Now, let’s try to write this complex data into a yaml file.


with open(r"D:\website\wixSite\articles\yaml\testYaml.yaml", "w")as wFile:
     yaml.dump(listVarConverted, wFile)

And, the output is as follows.

- ? !!python/object/apply:numpy.core.multiarray.scalar
  - &id001 !!python/object/apply:numpy.dtype
    args:
    - f8
    - false
    - true
    state: !!python/tuple
    - 3
    - <
    - null
    - null
    - null
    - -1
    - -1
    - 0
  - !!binary |
    qvHSTWIQ5D8=
  : !!python/object/apply:numpy.core.multiarray.scalar
  - *id001
  - !!binary |
    kiRJkiTJR8A=
  ? !!python/object/apply:numpy.core.multiarray.scalar
  - *id001
  - !!binary |
    H4XrUbge5T8=
  : !!python/object/apply:numpy.core.multiarray.scalar
  - *id001
  - !!binary |
    GIZhGIbhQcA=

Here, the output is a bit different. It shows that it is a list of data. The ”?“ in "? !!python/object/apply:numpy.core.multiarray.scalar” shows that it is a dictionary data type or a mapping. This shows the data are np.float64 type.


Here, the keys and values are in binary data format. Below is a mapping/ dictionary.

? !!python/object/apply:numpy.core.multiarray.scalar
  - *id001
  - !!binary |
    lkOLbOf7G0A=
  : !!python/object/apply:numpy.core.multiarray.scalar
  - *id001
  - !!binary |
    wjAMwzBMUMA=
  ? !!python/object/apply:numpy.core.multiarray.scalar
  - *id001
  - !!binary |
    BFYOLbIdHEA=
  : !!python/object/apply:numpy.core.multiarray.scalar
  - *id001
  - !!binary |
    //////+/UMA=

*id001 is a yaml alias that generates the subsequent occurrence of data. The &id001, an yaml anchor has been defined at the beginning shows that

  - &id001 !!python/object/apply:numpy.dtype     
     args:     
     - f8     
     - false     
     - true

Here f8 represents a floating point number of 8 bytes (64bits). This is not a normal float value but numpy.float64

What the serialized yaml output is showing is that it is a list of dictionaries containing numpy data array. So, how can we properly serialize this complex data, if there are data items of np.float64? Here, we need to customize the complex data before serializing it.


How to serialize the complex data that can not be normally serialized


The first thing is to create a custom function, that programmers conventionally call “custom dumper function”. The custom dumper function takes two parameters: the “dumper” object of yaml module and the data to be serialized. Within the function, we need to manually convert the data object; for example, in the case discussed above, there are np.float64 data types. These data need to be converted to regular float. Once the data points are converted, we need to recreate the corrected version of the complex data. Once the corrected version was recreated, return it, using an appropriate method of the dumper object. For example, if you want to return a corrected version of a list, use the represent_list method of the dumper object. Later, using the add_representer method of yaml, register the new yaml representation of the complex data for serialization. Now, when yaml encounters similar complex data, the custom dumper function will be invoked, and the yaml representation of the complex data will be serialized, as specified in the custom function.

Let’s look at the Python program to perform this.

First create the custom representer function, with the dumper object and data being the parameters.

def numpy_representer(dumper, data):

Next, we need to initialize an empty list to store the corrected version of the complex data.


def numpy_representer(dumper, data):
    """Initialize a new empty list that would store the modified dictionary."""
    new_data = []

Now, Iterate through the list, and check if the items in the list is an instance of the dictionary object. If the items are of the type dictionary object, initialize an empty dictionary.

def numpy_representer(dumper, data):
    """Initialize a new empty list that would store the modified dictionary."""
    new_data = []
    
    #iterate through each dictionary
    for item in data:
        #If the item is a dictionary
        if isinstance(item, dict):
            new_item = {} 

Iterate through the key, and value of each dictionary, and check if the key and value are instances of np.float64 data type. If true, convert to the regular float types, and add to the new empty dictionary.

def numpy_representer(dumper, data):
    """Initialize a new empty list that would store the modified dictionary."""
    new_data = []
    
    #iterate through each dictionary
    for item in data:
        #If the item is a dictionary
        if isinstance(item, dict):
            new_item = {}
            #get the key and the value
            for key, value in item.items():
                #if the value is a np.float64, convert to float.
                if isinstance(value, np.float64):
                    key = float(key)
                    new_item[key] = float(value)

Once the keys and values in a dictionary are converted, then append the dictionary to the list.


def numpy_representer(dumper, data):
    """Initialize a new empty list that would store the modified dictionary."""
    new_data = []
    
    #iterate through each dictionary
    for item in data:
        #If the item is a dictionary
        if isinstance(item, dict):
            new_item = {}
            #get the key and the value
            for key, value in item.items():
                #if the value is a np.float64, convert to float.
                if isinstance(value, np.float64):
                    key = float(key)
                    new_item[key] = float(value)
                    #otherwise don't convert, but add to the dictionary
                else:
                    new_item[key] = value
            #When it is done, append the dictionary to the list
            new_data.append(new_item)

Finally, return the new converted list using the represent_list method of the dumper object.


def numpy_representer(dumper, data):
    """Initialize a new empty list that would store the modified dictionary."""
    new_data = []
    
    #iterate through each dictionary
    for item in data:
        #If the item is a dictionary
        if isinstance(item, dict):
            new_item = {}
            #get the key and the value
            for key, value in item.items():
                #if the value is a np.float64, convert to float.
                if isinstance(value, np.float64):
                    key = float(key)
                    new_item[key] = float(value)
                    #otherwise don't convert, but add to the dictionary
                else:
                    new_item[key] = value
            #When it is done, append the dictionary to the list
            new_data.append(new_item)
        else:
            new_data.append(item)
            #returning the serialized data
    return dumper.represent_list(new_data)

Now, register the yaml representation of the complex data, using the add_representer method of the yaml module. The add_representer method has two parameters: 1. the data type that needs to be considered for serialization, and the custom representer function that stores the yaml representation of the complex data.


def numpy_representer(dumper, data):
    """Initialize a new empty list that would store the modified dictionary."""
    new_data = []
    
    #iterate through each dictionary
    for item in data:
        #If the item is a dictionary
        if isinstance(item, dict):
            new_item = {}
            #get the key and the value
            for key, value in item.items():
                #if the value is a np.float64, convert to float.
                if isinstance(value, np.float64):
                    key = float(key)
                    new_item[key] = float(value)
                    #otherwise don't convert, but add to the dictionary
                else:
                    new_item[key] = value
            #When it is done, append the dictionary to the list
            new_data.append(new_item)
        else:
            new_data.append(item)
            #returning the serialized data
    return dumper.represent_list(new_data)

# Add the custom representer to PyYAML for lists
# If you see a list, perform the numpy_representer custom function
yaml.add_representer(list, numpy_representer)

Now, let us dump the complex data, using the dump method of the yaml module.


If the complex data is of the structure specified in the add_representer(), it will be subjected to the custom representer function.

with open(f"{path}/Dictionary_Data/dictLowThreshCorrected/dictLowThreshCorrected.yaml", "w") as wFile:
    yaml.dump(data1, wFile, default_flow_style=False)

Comments


bottom of page