| Maximising MicroPython Speed |
| ============================ |
| |
| .. contents:: |
| |
| This tutorial describes ways of improving the performance of MicroPython code. |
| Optimisations involving other languages are covered elsewhere, namely the use |
| of modules written in C and the MicroPython inline assembler. |
| |
| The process of developing high performance code comprises the following stages |
| which should be performed in the order listed. |
| |
| * Design for speed. |
| * Code and debug. |
| |
| Optimisation steps: |
| |
| * Identify the slowest section of code. |
| * Improve the efficiency of the Python code. |
| * Use the native code emitter. |
| * Use the viper code emitter. |
| * Use hardware-specific optimisations. |
| |
| Designing for speed |
| ------------------- |
| |
| Performance issues should be considered at the outset. This involves taking a view |
| on the sections of code which are most performance critical and devoting particular |
| attention to their design. The process of optimisation begins when the code has |
| been tested: if the design is correct at the outset optimisation will be |
| straightforward and may actually be unnecessary. |
| |
| Algorithms |
| ~~~~~~~~~~ |
| |
| The most important aspect of designing any routine for performance is ensuring that |
| the best algorithm is employed. This is a topic for textbooks rather than for a |
| MicroPython guide but spectacular performance gains can sometimes be achieved |
| by adopting algorithms known for their efficiency. |
| |
| RAM Allocation |
| ~~~~~~~~~~~~~~ |
| |
| To design efficient MicroPython code it is necessary to have an understanding of the |
| way the interpreter allocates RAM. When an object is created or grows in size |
| (for example where an item is appended to a list) the necessary RAM is allocated |
| from a block known as the heap. This takes a significant amount of time; |
| further it will on occasion trigger a process known as garbage collection which |
| can take several milliseconds. |
| |
| Consequently the performance of a function or method can be improved if an object is created |
| once only and not permitted to grow in size. This implies that the object persists |
| for the duration of its use: typically it will be instantiated in a class constructor |
| and used in various methods. |
| |
| This is covered in further detail :ref:`Controlling garbage collection <controlling_gc>` below. |
| |
| Buffers |
| ~~~~~~~ |
| |
| An example of the above is the common case where a buffer is required, such as one |
| used for communication with a device. A typical driver will create the buffer in the |
| constructor and use it in its I/O methods which will be called repeatedly. |
| |
| The MicroPython libraries typically provide support for pre-allocated buffers. For |
| example, objects which support stream interface (e.g., file or UART) provide `read()` |
| method which allocates new buffer for read data, but also a `readinto()` method |
| to read data into an existing buffer. |
| |
| Floating Point |
| ~~~~~~~~~~~~~~ |
| |
| Some MicroPython ports allocate floating point numbers on heap. Some other ports |
| may lack dedicated floating-point coprocessor, and perform arithmetic operations |
| on them in "software" at considerably lower speed than on integers. Where |
| performance is important, use integer operations and restrict the use of floating |
| point to sections of the code where performance is not paramount. For example, |
| capture ADC readings as integers values to an array in one quick go, and only then |
| convert them to floating-point numbers for signal processing. |
| |
| Arrays |
| ~~~~~~ |
| |
| Consider the use of the various types of array classes as an alternative to lists. |
| The `array` module supports various element types with 8-bit elements supported |
| by Python's built in `bytes` and `bytearray` classes. These data structures all store |
| elements in contiguous memory locations. Once again to avoid memory allocation in critical |
| code these should be pre-allocated and passed as arguments or as bound objects. |
| |
| When passing slices of objects such as `bytearray` instances, Python creates |
| a copy which involves allocation of the size proportional to the size of slice. |
| This can be alleviated using a `memoryview` object. `memoryview` itself |
| is allocated on heap, but is a small, fixed-size object, regardless of the size |
| of slice it points too. |
| |
| .. code:: python |
| |
| ba = bytearray(10000) # big array |
| func(ba[30:2000]) # a copy is passed, ~2K new allocation |
| mv = memoryview(ba) # small object is allocated |
| func(mv[30:2000]) # a pointer to memory is passed |
| |
| A `memoryview` can only be applied to objects supporting the buffer protocol - this |
| includes arrays but not lists. Small caveat is that while memoryview object is live, |
| it also keeps alive the original buffer object. So, a memoryview isn't a universal |
| panacea. For instance, in the example above, if you are done with 10K buffer and |
| just need those bytes 30:2000 from it, it may be better to make a slice, and let |
| the 10K buffer go (be ready for garbage collection), instead of making a |
| long-living memoryview and keeping 10K blocked for GC. |
| |
| Nonetheless, `memoryview` is indispensable for advanced preallocated buffer |
| management. `readinto()` method discussed above puts data at the beginning |
| of buffer and fills in entire buffer. What if you need to put data in the |
| middle of existing buffer? Just create a memoryview into the needed section |
| of buffer and pass it to `readinto()`. |
| |
| Identifying the slowest section of code |
| --------------------------------------- |
| |
| This is a process known as profiling and is covered in textbooks and |
| (for standard Python) supported by various software tools. For the type of |
| smaller embedded application likely to be running on MicroPython platforms |
| the slowest function or method can usually be established by judicious use |
| of the timing ``ticks`` group of functions documented in `utime`. |
| Code execution time can be measured in ms, us, or CPU cycles. |
| |
| The following enables any function or method to be timed by adding an |
| ``@timed_function`` decorator: |
| |
| .. code:: python |
| |
| def timed_function(f, *args, **kwargs): |
| myname = str(f).split(' ')[1] |
| def new_func(*args, **kwargs): |
| t = utime.ticks_us() |
| result = f(*args, **kwargs) |
| delta = utime.ticks_diff(utime.ticks_us(), t) |
| print('Function {} Time = {:6.3f}ms'.format(myname, delta/1000)) |
| return result |
| return new_func |
| |
| MicroPython code improvements |
| ----------------------------- |
| |
| The const() declaration |
| ~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| MicroPython provides a ``const()`` declaration. This works in a similar way |
| to ``#define`` in C in that when the code is compiled to bytecode the compiler |
| substitutes the numeric value for the identifier. This avoids a dictionary |
| lookup at runtime. The argument to ``const()`` may be anything which, at |
| compile time, evaluates to an integer e.g. ``0x100`` or ``1 << 8``. |
| |
| .. _Caching: |
| |
| Caching object references |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Where a function or method repeatedly accesses objects performance is improved |
| by caching the object in a local variable: |
| |
| .. code:: python |
| |
| class foo(object): |
| def __init__(self): |
| ba = bytearray(100) |
| def bar(self, obj_display): |
| ba_ref = self.ba |
| fb = obj_display.framebuffer |
| # iterative code using these two objects |
| |
| This avoids the need repeatedly to look up ``self.ba`` and ``obj_display.framebuffer`` |
| in the body of the method ``bar()``. |
| |
| .. _controlling_gc: |
| |
| Controlling garbage collection |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| When memory allocation is required, MicroPython attempts to locate an adequately |
| sized block on the heap. This may fail, usually because the heap is cluttered |
| with objects which are no longer referenced by code. If a failure occurs, the |
| process known as garbage collection reclaims the memory used by these redundant |
| objects and the allocation is then tried again - a process which can take several |
| milliseconds. |
| |
| There may be benefits in pre-empting this by periodically issuing `gc.collect()`. |
| Firstly doing a collection before it is actually required is quicker - typically on the |
| order of 1ms if done frequently. Secondly you can determine the point in code |
| where this time is used rather than have a longer delay occur at random points, |
| possibly in a speed critical section. Finally performing collections regularly |
| can reduce fragmentation in the heap. Severe fragmentation can lead to |
| non-recoverable allocation failures. |
| |
| The Native code emitter |
| ----------------------- |
| |
| This causes the MicroPython compiler to emit native CPU opcodes rather than |
| bytecode. It covers the bulk of the MicroPython functionality, so most functions will require |
| no adaptation (but see below). It is invoked by means of a function decorator: |
| |
| .. code:: python |
| |
| @micropython.native |
| def foo(self, arg): |
| buf = self.linebuf # Cached object |
| # code |
| |
| There are certain limitations in the current implementation of the native code emitter. |
| |
| * Context managers are not supported (the ``with`` statement). |
| * Generators are not supported. |
| * If ``raise`` is used an argument must be supplied. |
| |
| The trade-off for the improved performance (roughly twices as fast as bytecode) is an |
| increase in compiled code size. |
| |
| The Viper code emitter |
| ---------------------- |
| |
| The optimisations discussed above involve standards-compliant Python code. The |
| Viper code emitter is not fully compliant. It supports special Viper native data types |
| in pursuit of performance. Integer processing is non-compliant because it uses machine |
| words: arithmetic on 32 bit hardware is performed modulo 2**32. |
| |
| Like the Native emitter Viper produces machine instructions but further optimisations |
| are performed, substantially increasing performance especially for integer arithmetic and |
| bit manipulations. It is invoked using a decorator: |
| |
| .. code:: python |
| |
| @micropython.viper |
| def foo(self, arg: int) -> int: |
| # code |
| |
| As the above fragment illustrates it is beneficial to use Python type hints to assist the Viper optimiser. |
| Type hints provide information on the data types of arguments and of the return value; these |
| are a standard Python language feature formally defined here `PEP0484 <https://www.python.org/dev/peps/pep-0484/>`_. |
| Viper supports its own set of types namely ``int``, ``uint`` (unsigned integer), ``ptr``, ``ptr8``, |
| ``ptr16`` and ``ptr32``. The ``ptrX`` types are discussed below. Currently the ``uint`` type serves |
| a single purpose: as a type hint for a function return value. If such a function returns ``0xffffffff`` |
| Python will interpret the result as 2**32 -1 rather than as -1. |
| |
| In addition to the restrictions imposed by the native emitter the following constraints apply: |
| |
| * Functions may have up to four arguments. |
| * Default argument values are not permitted. |
| * Floating point may be used but is not optimised. |
| |
| Viper provides pointer types to assist the optimiser. These comprise |
| |
| * ``ptr`` Pointer to an object. |
| * ``ptr8`` Points to a byte. |
| * ``ptr16`` Points to a 16 bit half-word. |
| * ``ptr32`` Points to a 32 bit machine word. |
| |
| The concept of a pointer may be unfamiliar to Python programmers. It has similarities |
| to a Python `memoryview` object in that it provides direct access to data stored in memory. |
| Items are accessed using subscript notation, but slices are not supported: a pointer can return |
| a single item only. Its purpose is to provide fast random access to data stored in contiguous |
| memory locations - such as data stored in objects which support the buffer protocol, and |
| memory-mapped peripheral registers in a microcontroller. It should be noted that programming |
| using pointers is hazardous: bounds checking is not performed and the compiler does nothing to |
| prevent buffer overrun errors. |
| |
| Typical usage is to cache variables: |
| |
| .. code:: python |
| |
| @micropython.viper |
| def foo(self, arg: int) -> int: |
| buf = ptr8(self.linebuf) # self.linebuf is a bytearray or bytes object |
| for x in range(20, 30): |
| bar = buf[x] # Access a data item through the pointer |
| # code omitted |
| |
| In this instance the compiler "knows" that ``buf`` is the address of an array of bytes; |
| it can emit code to rapidly compute the address of ``buf[x]`` at runtime. Where casts are |
| used to convert objects to Viper native types these should be performed at the start of |
| the function rather than in critical timing loops as the cast operation can take several |
| microseconds. The rules for casting are as follows: |
| |
| * Casting operators are currently: ``int``, ``bool``, ``uint``, ``ptr``, ``ptr8``, ``ptr16`` and ``ptr32``. |
| * The result of a cast will be a native Viper variable. |
| * Arguments to a cast can be a Python object or a native Viper variable. |
| * If argument is a native Viper variable, then cast is a no-op (i.e. costs nothing at runtime) |
| that just changes the type (e.g. from ``uint`` to ``ptr8``) so that you can then store/load |
| using this pointer. |
| * If the argument is a Python object and the cast is ``int`` or ``uint``, then the Python object |
| must be of integral type and the value of that integral object is returned. |
| * The argument to a bool cast must be integral type (boolean or integer); when used as a return |
| type the viper function will return True or False objects. |
| * If the argument is a Python object and the cast is ``ptr``, ``ptr``, ``ptr16`` or ``ptr32``, |
| then the Python object must either have the buffer protocol with read-write capabilities |
| (in which case a pointer to the start of the buffer is returned) or it must be of integral |
| type (in which case the value of that integral object is returned). |
| |
| The following example illustrates the use of a ``ptr16`` cast to toggle pin X1 ``n`` times: |
| |
| .. code:: python |
| |
| BIT0 = const(1) |
| @micropython.viper |
| def toggle_n(n: int): |
| odr = ptr16(stm.GPIOA + stm.GPIO_ODR) |
| for _ in range(n): |
| odr[0] ^= BIT0 |
| |
| A detailed technical description of the three code emitters may be found |
| on Kickstarter here `Note 1 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/664832>`_ |
| and here `Note 2 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/665145>`_ |
| |
| Accessing hardware directly |
| --------------------------- |
| |
| .. note:: |
| |
| Code examples in this section are given for the Pyboard. The techniques |
| described however may be applied to other MicroPython ports too. |
| |
| This comes into the category of more advanced programming and involves some knowledge |
| of the target MCU. Consider the example of toggling an output pin on the Pyboard. The |
| standard approach would be to write |
| |
| .. code:: python |
| |
| mypin.value(mypin.value() ^ 1) # mypin was instantiated as an output pin |
| |
| This involves the overhead of two calls to the `Pin` instance's :meth:`~machine.Pin.value()` |
| method. This overhead can be eliminated by performing a read/write to the relevant bit |
| of the chip's GPIO port output data register (odr). To facilitate this the ``stm`` |
| module provides a set of constants providing the addresses of the relevant registers. |
| A fast toggle of pin ``P4`` (CPU pin ``A14``) - corresponding to the green LED - |
| can be performed as follows: |
| |
| .. code:: python |
| |
| import machine |
| import stm |
| |
| BIT14 = const(1 << 14) |
| machine.mem16[stm.GPIOA + stm.GPIO_ODR] ^= BIT14 |