- Dec 2023
-
tonybaloney.github.io tonybaloney.github.io
-
Once an interpreter is running (remembering what I said that it is preferable to leave them running) you can share data using a channel. The channels module is also part of PEP554 and available using a secret-import:
``` import _xxsubinterpreters as interpreters import _xxinterpchannels as channels
interp_id = interpreters.create(site=site) channel_id = channels.create()
interpreters.run_string( interp_id, """ import _xxinterpchannels as channels channels.send('hello!') """, shared={ "channel_id": channel_id } )
print(channels.recv(channel_id)) ```
-
To share data, you can use the shared argument and provide a dictionary with shareable (int, float, bool, bytes, str, None, tuple) values:
``` import _xxsubinterpreters as interpreters
interp_id = interpreters.create(site=site)
interpreters.run_string( interp_id, "print(message)", shared={ "message": "hello world!" } )
interpreters.run_string( interp_id, """ for message in messages: print(message) """, shared={ "messages": ("hello world!", "this", "is", "me") } )
interpreters.destroy(interp_id) ```
-
To start an interpreter that sticks around, you can use interpreters.create() which returns the interpreter ID. This ID can be used for subsequent .run_string calls:
``` import _xxsubinterpreters as interpreters
interp_id = interpreters.create(site=site)
interpreters.run_string(interp_id, "print('hello world')") interpreters.run_string(interp_id, "print('hello universe')")
interpreters.destroy(interp_id) ```
-
Starting a sub interpreter is a blocking operation, so most of the time you want to start one inside a thread.
``` from threading import Thread import _xxsubinterpreters as interpreters
t = Thread(target=interpreters.run, args=("print('hello world')",)) t.start() ```
-
You can create, run and stop a sub interpreter with the .run() function which takes a string or a simple function
``` import _xxsubinterpreters as interpreters
interpreters.run(''' print("Hello World") ''') ```
-
Inter-Worker communication
Whether using sub interpreters or multiprocessing you cannot simply send existing Python objects to worker processes.
Multiprocessing uses
pickle
by default. When you start a process or use a process pool, you can use pipes, queues and shared memory as mechanisms to sending data to/from the workers and the main process. These mechanisms revolve around pickling. Pickling is the builtin serialization library for Python that can convert most Python objects into a byte string and back into a Python object.Pickle is very flexible. You can serialize a lot of different types of Python objects (but not all) and Python objects can even define a method for how they can be serialized. It also handles nested objects and properties. However, with that flexibility comes a performance hit. Pickle is slow. So if you have a worker model that relies upon continuous inter-worker communication of complex pickled data you’ll likely see a bottleneck.
Sub interpreters can accept pickled data. They also have a second mechanism called shared data. Shared data is a high-speed shared memory space that interpreters can write to and share data with other interpreters. It supports only immutable types, those are:
- Strings
- Byte Strings
- Integers and Floats
- Boolean and None
- Tuples (and tuples of tuples)
To share data with an interpreter, you can either set it as initialization data or you can send it through a channel.
-
The next point when using a parallel execution model like multiprocessing or sub interpreters is how you share data.
Once you get over the hurdle of starting one, this quickly becomes the most important point. You have two questions to answer:
- How do we communicate between workers?
- How do we manage the state of workers?
-
Half of the time taken to start an interpreter is taken up running “site import”. This is a special module called site.py that lives within the Python installation. Interpreters have their own caches, their own builtins, they are effectively mini-Python processes. Starting a thread or a coroutine is so fast because it doesn’t have to do any of that work (it shares that state with the owning interpreter), but it’s bound by the lock and isn’t parallel.
-
What is the difference between threading, multiprocessing, and sub interpreters?
The Python standard library has a few options for concurrent programming, depending on some factors:
- Is the task you’re completing IO-bound (e.g. reading from a network, writing to disk)
- Does the task require CPU-heavy work, e.g. computation
- Can the tasks be broken into small chunks or are they large pieces of work?
Here are the models:
- Threads are fast to create, you can share any Python objects between them and have a small overhead. Their drawback is that Python threads are bound to the GIL of the process, so if the workload is CPU-intensive then you won’t see any performance gains. Threading is very useful for background, polling tasks like a function that waits and listens for a message on a queue.
- Coroutines are extremely fast to create, you can share any Python objects between them and have a miniscule overhead. Coroutines are ideal for IO-based activity that has an underlying API that supports async/await.
- Multiprocessing is a Python wrapper that creates Python processes and links them together. These processes are slow to start, so the workload that you give them needs to be large enough to see the benefit of parallelising the workload. However, they are truly parallel since each one has it’s own GIL.
- Sub interpreters have the parallelism of multiprocessing, but with a much faster startup time.
-