Split

In this part of the tutorial we’ll learn how to make several analyses reading input data only once and without storing that in memory.

Introduction

If we want to process same data flow “simultaneously” by sequence1 and sequence2, we use the element Split:

from lena.core import Split

s = Sequence(
    ReadData(),
    Split([
        sequence1,
        sequence2,
        # ...
    ]),
    ToCSV(),
    # ...
)

The first argument of Split is a list of sequences, which are applied to the incoming flow “in parallel” (not in the sense of processes or threads).

However, not every sequence can be used in parallel with others. Recall the example of an element Sum from the first part of the tutorial:

class Sum1():
    def run(self, flow):
        s = 0
        for val in flow:
            s += val
        yield s

The problem is that if we pass it a flow, it will consume it completely. After we call Sum1().run(flow), there is no way to stop iteration in the inner cycle and resume that later. To reiterate the flow in another sequence we would have to store that in memory or reread all data once again.

To run analyses in parallel, we need another type of element. Here is Sum refactored:

class Sum():
    def __init__(self):
        self._sum = 0

    def fill(self, val):
        self._sum += val

    def compute(self):
        yield self._sum

This Sum has methods fill(value) and compute(). Fill is called by some external code (for example, by Split). After there is nothing more to fill, the results can be generated by compute. The method name fill makes its class similar to a histogram. Compute in this example is trivial, but it may include some larger computations. We call an element with methods fill and compute a FillCompute element. An element with a run method can be called a Run element.

A FillCompute element can be generalized. We can place before that simple functions, which will transform values before they fill the element. We can also add other elements after FillCompute. Since compute is a generator, these elements can be either simple functions or Run elements. A sequence with a FillCompute element is called a FillComputeSeq.

Here is a working example:

tutorial/2_split/main1.py
data_file = os.path.join("..", "data", "normal_3d.csv")
s = Sequence(
    ReadData(),
    Split([
        (
            lambda vec: vec[0],
            Histogram(mesh((-10, 10), 10)),
            ToCSV(),
            Write("output", "x"),
        ),
        (
            lambda vec: vec[1],
            Histogram(mesh((-10, 10), 10)),
            ToCSV(),
            Write("output", "y"),
        ),
    ]),
    RenderLaTeX("histogram_1d.tex", "templates"),
    Write("output"),
    LaTeXToPDF(),
    PDFToPNG(),
)
results = s.run([data_file])
for res in results:
    print(res)

Lena Histogram is a FillCompute element. The elements of the list in Split (tuples in this example) during the initialization of Split are transformed into FillCompute sequences. The lambdas select parts of vectors, which will fill the corresponding histogram. After the histogram is filled, it is given appropriate name by Write (so that they could be distinguished in the following flow).

Write has two initialization parameters: the default directory and the default file name. Write only writes strings (and unicode in Python 2). Its corresponding context is called output (as its module). If output is missing in the context, values pass unchanged. Otherwise, file name and extension are searched in context.output. If output.filename or output.fileext are missing, then the default file name or “txt” are used. The default file name should be used only when you are sure that only one file is going to be written, otherwise it will be rewritten every time. The defaults Write’s parameters are empty string (current directory) and “output” (resulting in output.txt).

ToCSV yields a string and sets context.output.fileext to “csv”. In the example above Write objects write CSV data to output/x.csv and output/y.csv.

For each file written, Write yields a tuple (file path, context), where context.output.filepath is updated with the path to file.

After the histograms are filled and written, Split yields them into the following flow in turn. The containing sequence s doesn’t distinguish Split from other elements, because Split acts as any Run element.

Variables

One of the basic principles in programming is “don’t repeat yourself” (DRY).

In the example above, we wanted to give distinct names to histograms in different analysis branches, and used two writes to do that. However, we can move ToCSV and Write outside the Split (and make our code one line shorter):

tutorial/2_split/main2.py
from lena.output import MakeFilename
s = Sequence(
    ReadData(),
    Split([
        (
            lambda vec: vec[0],
            Histogram(mesh((-10, 10), 10)),
            MakeFilename("x"),
        ),
        (
            lambda vec: vec[1],
            Histogram(mesh((-10, 10), 10)),
            MakeFilename("y"),
        ),
    ]),
    ToCSV(),
    Write("output"),
    # ... as earlier ...
)

Element MakeFilename adds file name to context.output. Write doesn’t need a default file name anymore. Now it writes two different files, because context.output.filename is different.

The code that we’ve written now is very explicit and flexible. We clearly see each step of the analysis and it as a whole. We control output names and we can change the logic as we wish by adding another element or lambda. The structure of our analysis is very transparent, but the code is not beautiful enough.

Lambdas don’t improve readability. Indices 0 and 1 look like magic constants. They are connected to names x and y in the following flow, but let us unite them in one element (and improve the cohesion of our code):

tutorial/2_split/main3.py
from lena.variables import Variable

def main():
    data_file = os.path.join("..", "data", "normal_3d.csv")
    write = Write("output")
    s = Sequence(
        ReadData(),
        Split([
            (
                Variable("x", lambda vec: vec[0]),
                Histogram(mesh((-10, 10), 10)),
            ),
            (
                Variable("y", lambda vec: vec[1]),
                Histogram(mesh((-10, 10), 10)),
            ),
            (
                Variable("z", lambda vec: vec[2]),
                Histogram(mesh((-10, 10), 10)),
            ),
        ]),
        MakeFilename("{{variable.name}}"),
        ToCSV(),
        write,
        RenderLaTeX("histogram_1d.tex", "templates"),
        write,
        LaTeXToPDF(),
        PDFToPNG(),
    )
    results = s.run([data_file])
    for res in results:
        print(res)

A Variable is essentially a function with a name. It transforms data and adds its own name to context.variable.name.

In this example we initialize a variable with a name and a function. It can accept arbitrary keyword arguments, which will be added to its context. For example, if our data is a series of (positron, neutron) events, then we can make a variable to select the second event:

neutron = Variable(
   "neutron", lambda double_ev: double_ev[1],
   latex_name="n", type="particle"
)

In this case context.variable will be updated not only with name, but also latex_name and type. In code their values can be got as variable’s attributes (e.g. neutron.latex_name). Variable’s function can be initialized with the keyword getter and is available as a method getter.

MakeFilename accepts not only constant, but also format strings, which take arguments from context. In our example, MakeFilename(“{{variable.name}}”) creates file name from context.variable.name.

Note also that since two Writes do the same thing, we rewrote them as one object.

Combine

Variables can be joined into a multidimensional variable using Combine.

Combine(var1, var2, …) applied to a value is a tuple ((var1.getter(value), var2.getter(value), …), context). The first element of the tuple is value transformed by each of the composed variables. Variable.getter is a function that returns only data without context.

Combine is a subclass of a Variable, and it accepts arbitrary keywords during initialization. All positional arguments must be Variables. Name of the combined variable can be passed as a keyword argument. If not provided, it is its variables’ names joined with ‘_’.

The resulting context is that of a usual Variable updated with context.variable.combine, where combine is a tuple of each variable’s context.

Combine has an attribute dim, which is the number of its variables. A constituting variable can be accessed using its index. For example, if cv is Combine(var1, var2), then cv.dim is 2, cv.name is var1.name_var2.name, and cv[1] is var2.

Combine variables are used for multidimensional plots.

Compose

When we put several variables or functions into a sequence, we obtain their composition. In the Lena framework we want to preserve as much context as possible. If some previous element was a Variable, its context is moved into variable.compose subcontext.

Function composition can be also defined as variables.Compose.

In this example we first select the neutron part of the data, and then the x coordinate:

>>> from lena.variables import Variable, Compose
>>> # data is pairs of (positron, neutron) coordinates
>>> data = [((1.05, 0.98, 0.8), (1.1, 1.1, 1.3))]
>>> x = Variable(
...    "x", lambda coord: coord[0], type="coordinate"
... )
>>> neutron = Variable(
...    "neutron", latex_name="n",
...    getter=lambda double_ev: double_ev[1], type="particle"
... )
>>> x_n = Compose(neutron, x)
>>> x_n(data[0])[0] # data
1.1

Data part of the result, as expected, is the composition of variables neutron and x. Same result could be obtained as a sequence of variables: Sequence(neutron, x).run(data), but the context of Compose is created differently.

The name of the composed variable is names of its variables (from left to right) joined with underscore. If there are two variables, LaTeX name will be also created from their names (or LaTeX names, if present) as a subscript in reverse order. In our example the context will be this:

>>> x_n(data[0])[1]
{
    'variable': {
        'name': 'neutron_x', 'particle': 'neutron',
        'latex_name': 'x_{n}', 'coordinate': 'x', 'type': 'coordinate',
        'compose': {
            'type': 'particle', 'latex_name': 'n',
            'name': 'neutron', 'particle': 'neutron'
        },
    }
}

Context of the composed variable is updated with a compose subcontext, which makes it similar to the context produced by variables in a sequence.

As for any variable, name or other parameters can be passed as keyword arguments during initialization.

Keyword type has a special meaning. If present, then during initialization of a variable its context is updated with {variable.type: variable.name} pair. During variable composition (in Compose or by subsequent application to the flow) context.variable is updated with new variable’s context, but if its type is different, it will persist. This allows access to context.variable.particle even if it was later composed with other variables.

Analysis example

Let us combine what we’ve learnt before and use it in a real analysis. An important change would be that if we create 2-dimensional plots, we add another template for that. Below is a small example. All template commands were explained in the first part of the tutorial.

tutorial/2_split/templates/histogram_2d.tex
\documentclass{standalone}
\usepackage{tikz}
\usepackage{pgfplots}
\usepgfplotslibrary{colorbrewer}
\pgfplotsset{compat=1.15}

\BLOCK{ set varx = variable.combine[0] }
\BLOCK{ set vary = variable.combine[1] }

\begin{document}
\begin{tikzpicture}
    \begin{axis}[
        view={0}{90},
        grid=both, 
        \BLOCK{ set xcols = histogram.nbins[0]|int + 1 }
        \BLOCK{ set ycols = histogram.nbins[1]|int + 1 }
        mesh/cols=\VAR{xcols},
        mesh/rows=\VAR{ycols},
        colorbar horizontal,
        xlabel = {$\VAR{ varx.latex_name }$
            \BLOCK{ if varx.unit }[$\mathrm{\VAR{ varx.unit }}$]\BLOCK{ endif }},
        ylabel = {$\VAR{ vary.latex_name }$
            \BLOCK{ if vary.unit }[$\mathrm{\VAR{ vary.unit }}$]\BLOCK{ endif }},
    ]
    \addplot3 [
        surf,
        mesh/ordering=y varies,
    ] table [col sep=comma, header=false] {\VAR{ output.filepath }};
    \end{axis}
\end{tikzpicture}
\end{document}

If an axis has a unit, it will be added to its label (like x [cm]).

RenderLaTeX accepts a function as the first initialization argument or as a keyword select_template. That function must accept a value (presumably a (data, context) pair) from the flow, and return a template file name (to be found inside template_path).

tutorial/2_split/main4.py
import os

import lena.context
import lena.flow
from lena.core import Sequence, Split, Source
from lena.structures import Histogram
from lena.math import mesh
from lena.output import ToCSV, Write, LaTeXToPDF, PDFToPNG
from lena.output import MakeFilename, RenderLaTeX
from lena.variables import Variable, Compose, Combine

from read_data import ReadDoubleEvents


positron = Variable(
    "positron", latex_name="e^+",
    getter=lambda double_ev: double_ev[0], type="particle"
)
neutron = Variable(
    "neutron", latex_name="n",
    getter=lambda double_ev: double_ev[1], type="particle"
)
x = Variable("x", lambda vec: vec[0], latex_name="x", unit="cm", type="coordinate")
y = Variable("y", lambda vec: vec[1], latex_name="y", unit="cm", type="coordinate")
z = Variable("z", lambda vec: vec[2], latex_name="z", unit="cm", type="coordinate")

coordinates_1d = [
    (
        coordinate,
        Histogram(mesh((-10, 10), 10)),
    )
    for coordinate in [
        Compose(particle, coord)
            for coord in (x, y, z)
            for particle in (positron, neutron)
    ]
]


def select_template(val):
    data, context = lena.flow.get_data_context(val)
    if lena.context.get_recursively(context, "histogram.dim", None) == 2:
        return "histogram_2d.tex"
    else:
        return "histogram_1d.tex"


def main():
    data_file = os.path.join("..", "data", "double_ev.csv")
    write = Write("output")
    s = Sequence(
        ReadDoubleEvents(),
        Split(
            coordinates_1d
            + 
            [(
                particle,
                Combine(x, y, name="xy"),
                Histogram(mesh(((-10, 10), (-10, 10)), (10, 10))),
                MakeFilename("{{variable.particle}}/{{variable.name}}"),
             )
             for particle in (positron, neutron)
            ]
        ),
        MakeFilename("{{variable.particle}}/{{variable.coordinate}}"),
        ToCSV(),
        write,
        RenderLaTeX(select_template, template_dir="templates"),
        write,
        LaTeXToPDF(),
        PDFToPNG(),
    )
    results = s.run([data_file])
    for res in results:
        print(res)


if __name__ == "__main__":
    main()

We import ReadDoubleEvents from a separate file. That class is practically the same as earlier, but it yields pairs of events instead of one by one.

We define coordinates_1d as a simple list of coordinates’ composition. Note that we could make all combinations directly using the language. We could also do that in Split, but if we use all these coordinates together in different analyses or don’t want to clutter the algorithm code, we can separate them.

In our new function select_template we use lena.context.get_recursively. This function is needed because we often have nested dictionaries, and Python’s dict.get method doesn’t recurse. We provide the default return value None, so that it doesn’t raise an exception in case of a missing key.

In the Split element we fill histograms for 1- and 2-dimensional plots in one run. There are two MakeFilename elements, but MakeFilename doesn’t overwrite file names set previously.

We created our first 2-dimensional histogram using lena.math.mesh. It accepts parameters ranges and nbins. In a multidimensional case these parameters are tuples of ranges and number of bins in corresponding dimensions, as in mesh(((-10, 10), (-10, 10)), (10, 10)).

After we run this script, we obtain two subdirectories in output for positron and neutron, each containing 4 plots (both pdf and png); in total 8 plots with proper names, units, axes labels, etc. It is straightforward to add other plots if we want, or to disable some of them in Split by commenting them out. The variables that we defined at the top level could be reused in other modules or moved to a separate module.

Note the overall design of our algorithm. We prepare all necessary data in ReadDoubleEvents. After that, Split uses different parts of these double events to create different plots. All important parameters should be contained in data itself. These allows a separation of data from presentation.

The knowledge we’ll learn by the end of this chapter will be sufficient for most of practical analyses. Following sections give more details about Lena elements and usage.

Adapters, elements and sequences

Objects don’t need to inherit from Lena classes to be used in the framework. Instead, they have to implement methods with specified names (like run, fill, etc). This is called structural subtyping in Python 1.

The specified method names can be changed using adapters. For example, if we have a legacy class

class MyEl():
    def my_run(self, flow):
        for val in flow:
            yield val

then we can create a Run element from a MyEl object with the adapter Run:

>>> from lena.core import Run
>>> my_run = Run(MyEl(), run="my_run")
>>> list(my_run.run([1, 2, 3]))
[1, 2, 3]

The adapter receives method name as a keyword argument. After it is created, it can be called with a method named run or inserted into a Lena sequence.

Similarly, a FillCompute adapter accepts names for methods fill and compute:

FillCompute(el, fill='fill', compute='compute')

If callable methods fill and compute were not found in el, LenaTypeError is raised.

What other types of elements are possible in data analysis? A common algorithm in physics is event selection. We analyse a large set of data looking for specific events. These events can be missing there or contained in a large quantity. To deal with this, we have to be prepared not to consume all flow (as a Run element does) and not to store all flow in the element before that is yielded. We create an element with a fill method, and call the second method request. A FillRequest element is similar to FillCompute, but request can be called multiple times. As with FillComputeSeq, we can add Call elements (lambdas) before a FillRequest element and Call or Run elements after that to create a sequence FillRequestSeq.

Elements can be transformed one into another. During initialization a Sequence checks for each its argument whether it has a run method. If it is missing, it tries to convert the element to a Run element using the adapter.

Run can be initialized from a Call or a FillCompute element. A callable is run as a transformation function, which accepts single values from the flow and returns their transformations for each value:

for val in flow:
    yield self._el(val)

A FillCompute element is run the following way: first, fill(value) is called for the whole flow. After the flow is exhausted, compute() is called.

There are algorithms and structures which are inherently not memory safe. For example, lena.structures.Graph stores all filled data as its points, and it is a FillRequest element. Since FillRequest can’t be used directly in a Sequence, or if we want to yield only the final result once, we cast that with FillCompute(Graph()). We can do that when we are sure that our data won’t overflow memory, and that cast will be explicit in our code.

To sum up, adapters in Lena can be used for several purposes:

  • provide a different name for a method (Run(my_obj, run=”my_run”)),

  • hide unused methods to prevent ambiguity (if an element has many methods, we can wrap that in an adapter to expose only the needed ones),

  • automatically convert objects of one type to another in sequences (FillCompute to Run),

  • explicitly cast object of one type to another (FillRequest to FillCompute).

Split

In the examples above, Split contained several FillComputeSeq sequences. However, it can be used with all other sequences we know.

Split has a keyword initialization argument bufsize, which is the size of the buffer for the input flow.

During Split.run(flow), the flow is divided into subslices of bufsize. Each subslice is processed by sequences in the order of their initializer list (the first positional argument in Split.__init__).

If a sequence is a Source, it doesn’t accept the incoming flow, but produces its own complete flow and becomes inactive (is not called any more).

A FillRequestSeq is filled with the buffer contents. After the buffer is finished, it yields all values from request().

A FillComputeSeq is filled with values from each buffer, but yields values from compute only after the whole flow is finished.

A Sequence is called with run(buffer) instead of the whole flow. The results are yielded for each buffer. If the whole flow must be analysed at once, don’t use such a sequence in Split.

If the flow was empty, each __call__ (from Source), compute, request or run is called nevertheless.

Source within Split can be used to add new data to flow. For example, we can create Split([source, ()]), and in this place of a sequence first all data from source will be generated, then all data from preceding elements will be passed (empty Sequence passes values unchanged). This can be used to provide several flows to a further element (like data, Monte Carlo and analytical approximation).

Split acts both as a sequence (because it contains sequences) and as an element. If all its elements (sequences, to be precise) have the same type, Split will have methods of this type. For example, if Split has only FillComputeSeq inside, it will create methods fill and compute. During fill all its sequences will be filled. During compute their results will be yielded in turn (all results from the first sequence, then from the second, etc). Split with Source sequences will act as a Source. Of course, Split can be used within a Split.

Context. Performance and safety

Dictionaries in Python are mutable, that is their content can change. If an element stores the current context, that may be changed by some other element. The simplest example: if your original data has context, it will be changed after being processed by a sequence.

This is how a typical Run element deals with context. To be most useful, it must be prepared to accept data with and without context:

class RunEl():
    def __init__(self):
        self._context = {"subcontext": "el"}

    def run(self, flow):
        for val in flow:
            data, context = lena.flow.get_data_context(val)
            # ... do something ...
            lena.flow.update_recursively(context, self._context)
            yield (new_data, context)

lena.flow.get_data_context(value) splits value into a pair of (data, context). If value contained only data without context, the context part will be an empty dictionary (therefore it is safe to use get_data_context with any value). If only one part is needed, lena.flow.get_data or lena.flow.get_context can be used.

If subcontext can contain other elements except el, then to preserve them we call not context.update, but lena.flow.update_recursively. This function doesn’t overwrite subdictionaries, but only conflicting keys within them. In this case context.subcontext key will always be set to el, but if self._context.subcontext were a dictionary {“el”: “el1”}, then all context.subcontext keys (if present) except el would remain.

Usually elements in a Sequence yield computed data and context, and never use or change that again. In Split, however, several sequences use the same data simultaneously. This is why Split makes a deep copy of the incoming flow in its buffer. A deep copy of a context is completely independent of the original or its other copies. However, to copy an entire dictionary requires some computational cost.

Split can be initialized with a keyword argument copy_buf. By default it is True, but can be set to False to disable deep copy of the flow. This may be a bit faster, but do it only if you are absolutely sure that your analysis will remain correct.

There are several things in Lena that help against context interference:

  • elements change their own context (Write changes context.output and not context.variable),

  • if Split has several sequences, it makes a deep copy of the flow before feeding that to them,

  • FillCompute and FillRequest elements make a deep copy of context before yielding 3.

This is how a FillCompute element is usually organised in Lena:

class MyFillComputeEl():
    def __init__(self):
        self._val = 0
        self._context = {"subcontext": "el"}
        self._cur_context = {}

    def fill(self, val):
        data, context = lena.flow.get_data_context(val)
        self._val += data
        self._cur_context = context

    def compute(self):
        context = copy.deepcopy(self._cur_context)
        # or copy.deepcopy(self._context):
        lena.flow.update_recursively(context, self._context)
        yield (self._val, context)

During fill the last context is saved. During compute a deep copy of that is made (since compute is called only once, this can be done without performance loss), and it is updated with self._context.

Performance is not the highest priority in Lena, but it is always nice to have. When possible, optimizations are made. Performance measurements show that deepcopy can take most time in Lena analysis 2. A linear Sequence or Run elements don’t do a deep copy of data. If Split contains several sequences, it doesn’t do a deep copy of the flow for the last sequence. It is possible to circumvent all copying of data in Split to gain more performance at the cost of more precautions and more streamlined code.


Summary

Several analyses can be performed on one flow using an element Split. It accepts a list of sequences as its first initialization argument.

Since Split divides the flow into buffered slices, elements must be prepared for that. In this part of the tutorial we introduced the FillCompute and the FillRequest elements. The former yields the results when its compute method is called. It is supposed that FillCompute is run only once and that it is memory safe (that it reduces data). If an element can consume much memory, it must be a FillRequest element.

If we add Call elements before and Run and Call elements after our FillCompute or FillRequest elements, we can generalize them to sequences FillComputeSeq and FillRequestSeq. They are created implicitly during Split initialization.

Variables connect functions with context. They have names and can have LaTeX names, units and other parameters, which helps to create plots and write output files. Compose corresponds to function composition, while Combine creates multidimensional variables for multidimensional plots.

If an element has methods with unusual names, adapters can be used to relate them to the framework names. Adapters are also used to explicitly cast one type of element to another or to implicitly convert an element to an appropriate type during a sequence initialization.

To be most useful, elements should be prepared to accept values consisting of only data or data with context. To work safely with a mutable context, a deep copy of that must be made in compute or request. On the other hand, unnecessary deep copies (in run, fill or __call__) may slightly decrease the performance. Lena allows optimizations if they are needed.

Exercises

  1. Extend the Sum example in this chapter so that it could handle context. Check that it works.

  2. In the analysis example main4.py there are two MakeFilename elements. Is it possible to use only one of them? How?

  3. We developed the example main2.py and joined lambda and filename into a Variable. We could also add a name to the Histogram. Which design decision would be better?

  4. What are the consequences of calling compute even for an empty flow?

  5. Alexander writes a diploma thesis involving some data analysis and wants to choose a framework for that. He asks colleagues and professors, and stops at three possible options. One library is easy to use and straight to the point, and is sufficient for most diploma theses. Another library is very rich and used by seasoned professionals, and its full power surpasses even its documentation. The third framework doesn’t provide a plenty of mathematical functions, but promises structured and beautiful code. Which one would you advise?

Footnotes

1

PEP 544 – Protocols: Structural subtyping (static duck typing): https://www.python.org/dev/peps/pep-0544

2

One can use tutorial/2_split/performance.py to make a quick analysis. To create 3 histograms (like in main4.py example above) for one million generated events it took 82 seconds in Python 2 on a laptop. The longest total time was spent for copy.deepcopy (20 seconds). For Python 3, PyPy and PyPy 3 the total time was 71, 23 and 16 seconds. These numbers are approximate (the second measurement for PyPy gave 19 seconds). If we change Variables into lambdas, add MakeFilename after Histogram and set copy_buf=False in Split, the total time will be 18 seconds for Python 2 and 4 seconds for PyPy 3.

This difference may be not important in practice: for example, the author usually deals with data sets of several tens of thousands events, and a large amount of time is spent to create 2-dimensional plots with pdflatex.

3

For framework elements this is obligatory, for user code this is recommended.