Split ===== In this part of the tutorial we'll learn how to make several analyses reading input data only once and without storing that in memory. .. contents:: Contents :local: Introduction ------------ If we want to process same data flow "simultaneously" by *sequence1* and *sequence2*, we use the element *Split*: .. code-block:: python from lena.core import Split s = Sequence( ReadData(), Split([ sequence1, sequence2, # ... ]), ToCSV(), # ... ) The first argument of *Split* is a list of sequences, which are applied to the incoming flow "in parallel" (not in the sense of processes or threads). However, not every sequence can be used in parallel with others. Recall the example of an element *Sum* from the first part of the tutorial: .. code-block:: python class Sum1(): def run(self, flow): s = 0 for val in flow: s += val yield s The problem is that if we pass it a *flow*, it will consume it completely. After we call *Sum1().run(flow)*, there is no way to stop iteration in the inner cycle and resume that later. To reiterate the *flow* in another sequence we would have to store that in memory or reread all data once again. To run analyses in parallel, we need another type of element. Here is *Sum* refactored: .. _sum: .. code-block:: python class Sum(): def __init__(self): self._sum = 0 def fill(self, val): self._sum += val def compute(self): yield self._sum .. self._sum = 0 This *Sum* has methods *fill(value)* and *compute()*. *Fill* is called by some external code (for example, by *Split*). After there is nothing more to fill, the results can be generated by *compute*. The method name *fill* makes its class similar to a histogram. *Compute* in this example is trivial, but it may include some larger computations. We call an element with methods *fill* and *compute* a *FillCompute* element. An element with a *run* method can be called a *Run* element. A *FillCompute* element can be generalized. We can place before that simple functions, which will transform values before they fill the element. We can also add other elements after *FillCompute*. Since *compute* is a generator, these elements can be either simple functions or *Run* elements. A sequence with a *FillCompute* element is called a *FillComputeSeq*. Here is a working example: .. code-block:: python :caption: tutorial/2_split/main1.py data_file = os.path.join("..", "data", "normal_3d.csv") s = Sequence( ReadData(), Split([ ( lambda vec: vec[0], Histogram(mesh((-10, 10), 10)), ToCSV(), Write("output", "x"), ), ( lambda vec: vec[1], Histogram(mesh((-10, 10), 10)), ToCSV(), Write("output", "y"), ), ]), RenderLaTeX("histogram_1d.tex", "templates"), Write("output"), LaTeXToPDF(), PDFToPNG(), ) results = s.run([data_file]) for res in results: print(res) Lena Histogram is a FillCompute element. The elements of the list in *Split* (tuples in this example) during the initialization of *Split* are transformed into FillCompute sequences. The *lambdas* select parts of vectors, which will fill the corresponding histogram. After the histogram is filled, it is given appropriate name by *Write* (so that they could be distinguished in the following flow). .. .. admonition:: Write *Write* has two initialization parameters: the default directory and the default file name. Write only writes strings (and *unicode* in Python 2). Its corresponding context is called *output* (as its module). If *output* is missing in the context, values pass unchanged. Otherwise, file name and extension are searched in *context.output*. If *output.filename* or *output.fileext* are missing, then the default file name or "txt" are used. The default file name should be used only when you are sure that only one file is going to be written, otherwise it will be rewritten every time. The defaults *Write*'s parameters are empty string (current directory) and "output" (resulting in *output.txt*). *ToCSV* yields a string and sets *context.output.fileext* to *"csv"*. In the example above Write objects write CSV data to *output/x.csv* and *output/y.csv*. For each file written, *Write* yields a tuple *(file path, context)*, where *context.output.filepath* is updated with the path to file. After the histograms are filled and written, *Split* yields them into the following flow in turn. The containing sequence *s* doesn't distinguish *Split* from other elements, because *Split* acts as any *Run* element. Variables --------- One of the basic principles in programming is "don't repeat yourself" (*DRY*). In the example above, we wanted to give distinct names to histograms in different analysis branches, and used two *writes* to do that. However, we can move *ToCSV* and *Write* outside the *Split* (and make our code one line shorter): .. _main2_py: .. code-block:: python :caption: tutorial/2_split/main2.py from lena.output import MakeFilename s = Sequence( ReadData(), Split([ ( lambda vec: vec[0], Histogram(mesh((-10, 10), 10)), MakeFilename("x"), ), ( lambda vec: vec[1], Histogram(mesh((-10, 10), 10)), MakeFilename("y"), ), ]), ToCSV(), Write("output"), # ... as earlier ... ) Element *MakeFilename* adds file name to *context.output*. *Write* doesn't need a default file name anymore. Now it writes two different files, because *context.output.filename* is different. The code that we've written now is very explicit and flexible. We clearly see each step of the analysis and it as a whole. We control output names and we can change the logic as we wish by adding another element or *lambda*. The structure of our analysis is very transparent, but the code is not beautiful enough. Lambdas don't improve readability. Indices *0* and *1* look like magic constants. They are connected to names *x* and *y* in the following flow, but let us unite them in one element (and improve the *cohesion* of our code): .. code-block:: python :caption: tutorial/2_split/main3.py from lena.variables import Variable def main(): data_file = os.path.join("..", "data", "normal_3d.csv") write = Write("output") s = Sequence( ReadData(), Split([ ( Variable("x", lambda vec: vec[0]), Histogram(mesh((-10, 10), 10)), ), ( Variable("y", lambda vec: vec[1]), Histogram(mesh((-10, 10), 10)), ), ( Variable("z", lambda vec: vec[2]), Histogram(mesh((-10, 10), 10)), ), ]), MakeFilename("{{variable.name}}"), ToCSV(), write, RenderLaTeX("histogram_1d.tex", "templates"), write, LaTeXToPDF(), PDFToPNG(), ) results = s.run([data_file]) for res in results: print(res) A *Variable* is essentially a function with a name. It transforms data and adds its own name to *context.variable.name*. In this example we initialize a variable with a name and a function. It can accept arbitrary keyword arguments, which will be added to its context. For example, if our data is a series of *(positron, neutron)* events, then we can make a variable to select the second event: .. code-block:: python neutron = Variable( "neutron", lambda double_ev: double_ev[1], latex_name="n", type="particle" ) In this case *context.variable* will be updated not only with *name*, but also *latex_name* and *type*. In code their values can be got as variable's attributes (e.g. *neutron.latex_name*). Variable's function can be initialized with the keyword *getter* and is available as a method *getter*. *MakeFilename* accepts not only constant, but also format strings, which take arguments from context. In our example, *MakeFilename("{{variable.name}}")* creates file name from *context.variable.name*. Note also that since two *Writes* do the same thing, we rewrote them as one object. Combine ^^^^^^^ Variables can be joined into a multidimensional variable using *Combine*. *Combine(var1, var2, ...)* applied to a *value* is a tuple *((var1.getter(value), var2.getter(value), ...), context)*. The first element of the tuple is *value* transformed by each of the composed variables. *Variable.getter* is a function that returns only data without context. *Combine* is a subclass of a *Variable*, and it accepts arbitrary keywords during initialization. All positional arguments must be *Variables*. Name of the combined variable can be passed as a keyword argument. If not provided, it is its variables' names joined with '_'. The resulting context is that of a usual *Variable* updated with *context.variable.combine*, where *combine* is a tuple of each variable's context. *Combine* has an attribute *dim*, which is the number of its variables. A constituting variable can be accessed using its index. For example, if *cv* is *Combine(var1, var2)*, then *cv.dim* is 2, *cv.name* is *var1.name_var2.name*, and *cv[1]* is *var2*. *Combine* variables are used for multidimensional plots. Compose ^^^^^^^ When we put several variables or functions into a sequence, we obtain their composition. In the Lena framework we want to preserve as much context as possible. If some previous element was a *Variable*, its context is moved into *variable.compose* subcontext. Function composition can be also defined as *variables.Compose*. In this example we first select the *neutron* part of the data, and then the *x* coordinate: .. code-block:: python >>> from lena.variables import Variable, Compose >>> # data is pairs of (positron, neutron) coordinates >>> data = [((1.05, 0.98, 0.8), (1.1, 1.1, 1.3))] >>> x = Variable( ... "x", lambda coord: coord[0], type="coordinate" ... ) >>> neutron = Variable( ... "neutron", latex_name="n", ... getter=lambda double_ev: double_ev[1], type="particle" ... ) >>> x_n = Compose(neutron, x) >>> x_n(data[0])[0] # data 1.1 Data part of the result, as expected, is the composition of variables *neutron* and *x*. Same result could be obtained as a sequence of variables: *Sequence(neutron, x).run(data)*, but the context of *Compose* is created differently. The name of the composed variable is names of its variables (from left to right) joined with underscore. If there are two variables, LaTeX name will be also created from their names (or LaTeX names, if present) as a subscript in reverse order. In our example the context will be this: .. code-block:: python >>> x_n(data[0])[1] { 'variable': { 'name': 'neutron_x', 'particle': 'neutron', 'latex_name': 'x_{n}', 'coordinate': 'x', 'type': 'coordinate', 'compose': { 'type': 'particle', 'latex_name': 'n', 'name': 'neutron', 'particle': 'neutron' }, } } Context of the composed variable is updated with a *compose* subcontext, which makes it similar to the context produced by variables in a sequence. As for any variable, *name* or other parameters can be passed as keyword arguments during initialization. Keyword *type* has a special meaning. If present, then during initialization of a variable its context is updated with *{variable.type: variable.name}* pair. During variable composition (in *Compose* or by subsequent application to the *flow*) *context.variable* is updated with new variable's context, but if its type is different, it will persist. This allows access to *context.variable.particle* even if it was later composed with other variables. Analysis example ---------------- Let us combine what we've learnt before and use it in a real analysis. An important change would be that if we create 2-dimensional plots, we add another template for that. Below is a small example. All template commands were explained in the first part of the tutorial. .. .. code-block:: LaTeX .. literalinclude:: ../../examples/tutorial/2_split/templates/histogram_2d.tex :language: LaTeX :caption: tutorial/2_split/templates/histogram_2d.tex If an axis has a *unit*, it will be added to its label (like *x [cm]*). *RenderLaTeX* accepts a function as the first initialization argument or as a keyword *select_template*. That function must accept a value (presumably a *(data, context)* pair) from the flow, and return a template file name (to be found inside *template_path*). .. _main4_py: .. literalinclude:: ../../examples/tutorial/2_split/main4.py :caption: tutorial/2_split/main4.py We import *ReadDoubleEvents* from a separate file. That class is practically the same as earlier, but it yields pairs of events instead of one by one. We define *coordinates_1d* as a simple list of coordinates' composition. Note that we could make all combinations directly using the language. We could also do that in *Split*, but if we use all these coordinates together in different analyses or don't want to clutter the algorithm code, we can separate them. In our new function *select_template* we use *lena.context.get_recursively*. This function is needed because we often have nested dictionaries, and Python's *dict.get* method doesn't recurse. We provide the default return value ``None``, so that it doesn't raise an exception in case of a missing key. .. not used here. Another useful function from *lena.context* is *update_recursively(d, other)*, which updates *d* with *other* preserving the nested structure (not overwriting nested sub-dictionaries in *d*, but extending them with values from *other*). In the *Split* element we fill histograms for 1- and 2-dimensional plots in one run. There are two *MakeFilename* elements, but *MakeFilename* doesn't overwrite file names set previously. We created our first 2-dimensional histogram using *lena.math.mesh*. It accepts parameters *ranges* and *nbins*. In a multidimensional case these parameters are tuples of ranges and number of bins in corresponding dimensions, as in *mesh(((-10, 10), (-10, 10)), (10, 10))*. After we run this script, we obtain two subdirectories in *output* for *positron* and *neutron*, each containing 4 plots (both *pdf* and *png*); in total 8 plots with proper names, units, axes labels, etc. It is straightforward to add other plots if we want, or to disable some of them in *Split* by commenting them out. The variables that we defined at the top level could be reused in other modules or moved to a separate module. Note the overall design of our algorithm. We prepare all necessary data in *ReadDoubleEvents*. After that, *Split* uses different parts of these double events to create different plots. All important parameters should be contained in data itself. These allows a separation of data from presentation. The knowledge we'll learn by the end of this chapter will be sufficient for most of practical analyses. Following sections give more details about Lena elements and usage. Adapters, elements and sequences -------------------------------- Objects don't need to inherit from *Lena* classes to be used in the framework. Instead, they have to implement methods with specified names (like *run*, *fill*, etc). This is called structural subtyping in Python [#f1]_. The specified method names can be changed using *adapters*. For example, if we have a legacy class .. code-block:: python class MyEl(): def my_run(self, flow): for val in flow: yield val then we can create a *Run* element from a *MyEl* object with the adapter *Run*: .. code-block:: python >>> from lena.core import Run >>> my_run = Run(MyEl(), run="my_run") >>> list(my_run.run([1, 2, 3])) [1, 2, 3] The adapter receives method name as a keyword argument. After it is created, it can be called with a method named *run* or inserted into a *Lena* sequence. Similarly, a *FillCompute* adapter accepts names for methods *fill* and *compute*: .. code-block:: python FillCompute(el, fill='fill', compute='compute') If callable methods *fill* and *compute* were not found in *el*, *LenaTypeError* is raised. What other types of elements are possible in data analysis? A common algorithm in physics is event selection. We analyse a large set of data looking for specific events. These events can be missing there or contained in a large quantity. To deal with this, we have to be prepared not to consume all flow (as a *Run* element does) and not to store all flow in the element before that is yielded. We create an element with a *fill* method, and call the second method *request*. A *FillRequest* element is similar to *FillCompute*, but *request* can be called multiple times. As with *FillComputeSeq*, we can add *Call* elements (lambdas) before a *FillRequest* element and *Call* or *Run* elements after that to create a sequence *FillRequestSeq*. Elements can be transformed one into another. During initialization a *Sequence* checks for each its argument whether it has a *run* method. If it is missing, it tries to convert the element to a *Run* element using the adapter. *Run* can be initialized from a *Call* or a *FillCompute* element. A callable is run as a transformation function, which accepts single values from the flow and returns their transformations for each value: .. code-block:: python for val in flow: yield self._el(val) A *FillCompute* element is run the following way: first, *fill(value)* is called for the whole flow. After the flow is exhausted, *compute()* is called. There are algorithms and structures which are inherently not memory safe. For example, *lena.structures.Graph* stores all filled data as its points, and it is a *FillRequest* element. Since *FillRequest* can't be used directly in a *Sequence*, or if we want to yield only the final result once, we cast that with *FillCompute(Graph())*. We can do that when we are sure that our data won't overflow memory, and that cast will be explicit in our code. To sum up, adapters in Lena can be used for several purposes: - provide a different name for a method (*Run(my_obj, run="my_run")*), - hide unused methods to prevent ambiguity (if an element has many methods, we can wrap that in an adapter to expose only the needed ones), - automatically convert objects of one type to another in sequences (*FillCompute* to *Run*), - explicitly cast object of one type to another (*FillRequest* to *FillCompute*). .. It is also possible to create a *Run* element using a single function without an element. For that, use ``None`` as the element: *Run(None, run=)*. Split ----- In the examples above, *Split* contained several *FillComputeSeq* sequences. However, it can be used with all other sequences we know. *Split* has a keyword initialization argument *bufsize*, which is the size of the buffer for the input flow. During *Split.run(flow)*, the *flow* is divided into subslices of *bufsize*. Each subslice is processed by sequences in the order of their initializer list (the first positional argument in *Split.__init__*). If a sequence is a *Source*, it doesn't accept the incoming *flow*, but produces its own complete flow and becomes inactive (is not called any more). A *FillRequestSeq* is filled with the buffer contents. After the buffer is finished, it yields all values from *request()*. A *FillComputeSeq* is filled with values from each buffer, but yields values from *compute* only after the whole *flow* is finished. A *Sequence* is called with *run(buffer)* instead of the whole flow. The results are yielded for each buffer. If the whole flow must be analysed at once, don't use such a sequence in *Split*. If the *flow* was empty, each *__call__* (from *Source*), *compute*, *request* or *run* is called nevertheless. *Source* within *Split* can be used to add new data to *flow*. For example, we can create *Split([source, ()])*, and in this place of a sequence first all data from *source* will be generated, then all data from preceding elements will be passed (empty *Sequence* passes values unchanged). This can be used to provide several flows to a further element (like data, Monte Carlo and analytical approximation). *Split* acts both as a sequence (because it contains sequences) and as an element. If all its elements (sequences, to be precise) have the same type, *Split* will have methods of this type. For example, if *Split* has only *FillComputeSeq* inside, it will create methods *fill* and *compute*. During *fill* all its sequences will be filled. During *compute* their results will be yielded in turn (all results from the first sequence, then from the second, etc). *Split* with *Source* sequences will act as a *Source*. Of course, *Split* can be used within a *Split*. Context. Performance and safety ------------------------------- Dictionaries in Python are *mutable*, that is their content can change. If an element stores the current context, that may be changed by some other element. The simplest example: if your original data has context, it will be changed after being processed by a sequence. This is how a typical *Run* element deals with context. To be most useful, it must be prepared to accept data with and without context: .. code-block:: python class RunEl(): def __init__(self): self._context = {"subcontext": "el"} def run(self, flow): for val in flow: data, context = lena.flow.get_data_context(val) # ... do something ... lena.flow.update_recursively(context, self._context) yield (new_data, context) *lena.flow.get_data_context(value)* splits *value* into a pair of (data, context). If *value* contained only data without context, the *context* part will be an empty dictionary (therefore it is safe to use *get_data_context* with any *value*). If only one part is needed, *lena.flow.get_data* or *lena.flow.get_context* can be used. If *subcontext* can contain other elements except *el*, then to preserve them we call not *context.update*, but *lena.flow.update_recursively*. This function doesn't overwrite subdictionaries, but only conflicting keys within them. In this case *context.subcontext* key will always be set to *el*, but if *self._context.subcontext* were a dictionary *{"el": "el1"}*, then all *context.subcontext* keys (if present) except *el* would remain. Usually elements in a *Sequence* yield computed data and context, and never use or change that again. In *Split*, however, several sequences use the same data simultaneously. This is why *Split* makes a deep copy of the incoming flow in its buffer. A deep copy of a context is completely independent of the original or its other copies. However, to copy an entire dictionary requires some computational cost. *Split* can be initialized with a keyword argument *copy_buf*. By default it is ``True``, but can be set to ``False`` to disable deep copy of the flow. This may be a bit faster, but do it only if you are absolutely sure that your analysis will remain correct. There are several things in Lena that help against context interference: - elements change their own context (*Write* changes *context.output* and not *context.variable*), - if *Split* has several sequences, it makes a deep copy of the flow before feeding that to them, - *FillCompute* and *FillRequest* elements make a deep copy of context before yielding [#f3]_. This is how a *FillCompute* element is usually organised in Lena: .. code-block:: python class MyFillComputeEl(): def __init__(self): self._val = 0 self._context = {"subcontext": "el"} self._cur_context = {} def fill(self, val): data, context = lena.flow.get_data_context(val) self._val += data self._cur_context = context def compute(self): context = copy.deepcopy(self._cur_context) # or copy.deepcopy(self._context): lena.flow.update_recursively(context, self._context) yield (self._val, context) During *fill* the last context is saved. During *compute* a deep copy of that is made (since *compute* is called only once, this can be done without performance loss), and it is updated with *self._context*. .. There is no protection if a programmer reuses same context or doesn't make a deep copy where appropriate. It is user's responsibility to understand Fortunately, most algorithms Performance is not the highest priority in Lena, but it is always nice to have. When possible, optimizations are made. Performance measurements show that *deepcopy* can take most time in Lena analysis [#f2]_. A linear *Sequence* or *Run* elements don't do a deep copy of data. If *Split* contains several sequences, it doesn't do a deep copy of the flow for the last sequence. It is possible to circumvent all copying of data in *Split* to gain more performance at the cost of more precautions and more streamlined code. ----------- .. rubric:: Summary Several analyses can be performed on one flow using an element *Split*. It accepts a list of sequences as its first initialization argument. Since *Split* divides the flow into buffered slices, elements must be prepared for that. In this part of the tutorial we introduced the *FillCompute* and the *FillRequest* elements. The former yields the results when its *compute* method is called. It is supposed that *FillCompute* is run only once and that it is memory safe (that it reduces data). If an element can consume much memory, it must be a *FillRequest* element. If we add *Call* elements before and *Run* and *Call* elements after our *FillCompute* or *FillRequest* elements, we can generalize them to sequences *FillComputeSeq* and *FillRequestSeq*. They are created implicitly during *Split* initialization. *Variables* connect functions with context. They have names and can have LaTeX names, units and other parameters, which helps to create plots and write output files. *Compose* corresponds to function composition, while *Combine* creates multidimensional variables for multidimensional plots. If an element has methods with unusual names, adapters can be used to relate them to the framework names. Adapters are also used to explicitly cast one type of element to another or to implicitly convert an element to an appropriate type during a sequence initialization. To be most useful, elements should be prepared to accept values consisting of only data or data with context. To work safely with a mutable context, a deep copy of that must be made in *compute* or *request*. On the other hand, unnecessary deep copies (in *run*, *fill* or *__call__*) may slightly decrease the performance. Lena allows optimizations if they are needed. .. rubric:: Exercises #. Extend the :ref:`Sum ` example in this chapter so that it could handle context. Check that it works. #. In the analysis example :ref:`main4.py ` there are two *MakeFilename* elements. Is it possible to use only one of them? How? #. We developed the example :ref:`main2.py ` and joined *lambda* and *filename* into a *Variable*. We could also add a name to the *Histogram*. Which design decision would be better? #. What are the consequences of calling *compute* even for an empty flow? #. Alexander writes a diploma thesis involving some data analysis and wants to choose a framework for that. He asks colleagues and professors, and stops at three possible options. One library is easy to use and straight to the point, and is sufficient for most diploma theses. Another library is very rich and used by seasoned professionals, and its full power surpasses even its documentation. The third framework doesn't provide a plenty of mathematical functions, but promises structured and beautiful code. Which one would you advise? .. rubric:: Footnotes .. [#f1] PEP 544 -- Protocols: Structural subtyping (static duck typing): https://www.python.org/dev/peps/pep-0544 .. [#f2] One can use *tutorial/2_split/performance.py* to make a quick analysis. To create 3 histograms (like in :ref:`main4.py ` example above) for one million generated events it took 82 seconds in Python 2 on a laptop. The longest total time was spent for *copy.deepcopy* (20 seconds). For Python 3, PyPy and PyPy 3 the total time was 71, 23 and 16 seconds. These numbers are approximate (the second measurement for PyPy gave 19 seconds). If we change *Variables* into *lambdas*, add *MakeFilename* after *Histogram* and set *copy_buf=False* in *Split*, the total time will be 18 seconds for Python 2 and 4 seconds for PyPy 3. This difference may be not important in practice: for example, the author usually deals with data sets of several tens of thousands events, and a large amount of time is spent to create 2-dimensional plots with *pdflatex*. .. [#f3] For framework elements this is obligatory, for user code this is recommended.