Hypothesis

27 Matching Annotations

Dec 2023
testdriven.io testdriven.io

Speeding Up Python with Concurrency, Parallelism, and asyncio

6
1. GadjiMurad 17 Dec 2023
  
  in Public
  
  When should you use multiprocessing vs asyncio or threading?
  
  Use multiprocessing when you need to do many heavy calculations and you can split them up.
  
  Use asyncio or threading when you're performing I/O operations -- communicating with external resources or reading/writing from/to files.
  
  Multiprocessing and asyncio can be used together, but a good rule of thumb is to fork a process before you thread/use asyncio instead of the other way around -- threads are relatively cheap compared to processes.
  
  multiprocessing asyncio threading comparing
2. GadjiMurad 17 Dec 2023
  
  in Public
  
  When should you use threading, and when should you use asyncio?
  
  When you're writing new code, use asyncio. If you need to interface with older libraries or those that don't support asyncio, you might be better off with threading.
  
  asyncio threading comparing
3. GadjiMurad 17 Dec 2023
  
  in Public
  
  Why is the asyncio method always a bit faster than the threading method?
  
  This is because when we use the "await" syntax, we essentially tell our program "hold on, I'll be right back," but our program keeps track of how long it takes us to finish what we're doing. Once we're done, our program will know, and will pick back up as soon as it's able. Threading in Python allows asynchronicity, but our program could theoretically skip around different threads that may not yet be ready, wasting time if there are threads ready to continue running.
  
  asyncio threading comparing
4. GadjiMurad 17 Dec 2023
  
  in Public
  
  What is a thread?
  
  A thread is a way of allowing your computer to break up a single process/program into many lightweight pieces that execute in parallel. Somewhat confusingly, Python's standard implementation of threading limits threads to only being able to execute one at a time due to something called the Global Interpreter Lock (GIL). The GIL is necessary because CPython's (Python's default implementation) memory management is not thread-safe. Because of this limitation, threading in Python is concurrent, but not parallel. To get around this, Python has a separate multiprocessing module not limited by the GIL that spins up separate processes, enabling parallel execution of your code. Using the multiprocessing module is nearly identical to using the threading module.
  
  Asynchronous nature of threading: as one function waits, another one begins, and so on.
  
  threading python GIL definition
5. GadjiMurad 16 Dec 2023
  
  in Public
  
  when we join threads with thread.join(), all we're doing is ensuring the thread has finished before continuing on with our code.
  
  python threading
6. GadjiMurad 16 Dec 2023
  
  in Public
  
  Creating a thread is not the same as starting a thread, however. To start your thread, use {the name of your thread}.start(). Starting a thread means "starting its execution."
  
  threading python
Visit annotations in context

Tags

asyncio

multiprocessing

comparing

python

GIL

threading

definition

Annotators

GadjiMurad

URL

testdriven.io/blog/concurrency-parallelism-asyncio/
pythonspeed.com pythonspeed.com

Python’s multiprocessing performance problem

2
1. GadjiMurad 15 Dec 2023
  
  in Public
  
  Running the code in a subprocess is much slower than running a thread, not because the computation is slower, but because of the overhead of copying and (de)serializing the data. So how do you avoid this overhead?
  
  Reducing the performance hit of copying data between processes:
  
  Option #1: Just use threads
  
  Processes have overhead, threads do not. And while it’s true that generic Python code won’t parallelize well when using multiple threads, that’s not necessarily true for your Python code. For example, NumPy releases the GIL for many of its operations, which means you can use multiple CPU cores even with threads.
  
``` # numpy_gil.py import numpy as np from time import time from multiprocessing.pool import ThreadPool

arr = np.ones((1024, 1024, 1024))

start = time() for i in range(10): arr.sum() print("Sequential:", time() - start)

expected = arr.sum()

start = time() with ThreadPool(4) as pool: result = pool.map(np.sum, [arr] * 10) assert result == [expected] * 10 print("4 threads:", time() - start) ```

When run, we see that NumPy uses multiple cores just fine when using threads, at least for this operation:

$ python numpy_gil.py Sequential: 4.253053188323975 4 threads: 1.3854241371154785

Pandas is built on NumPy, so many numeric operations will likely release the GIL as well. However, anything involving strings, or Python objects in general, will not. So another approach is to use a library like Polars which is designed from the ground-up for parallelism, to the point where you don’t have to think about it at all, it has an internal thread pool.

Option #2: Live with it

If you’re stuck with using processes, you might just decide to live with the overhead of pickling. In particular, if you minimize how much data gets passed and forth between processes, and the computation in each process is significant enough, the cost of copying and serializing data might not significantly impact your program’s runtime. Spending a few seconds on pickling doesn’t really matter if your subsequent computation takes 10 minutes.

Option #3: Write the data to disk

Instead of passing data directly, you can write the data to disk, and then pass the path to this file: * to the subprocess (as an argument) * to parent process (as the return value of the function running in the worker process).

The recipient process can then parse the file.

``` import pandas as pd import multiprocessing as mp from pathlib import Path from tempfile import mkdtemp from time import time
  
  def noop(df: pd.DataFrame): # real code would process the dataframe here pass
  
  def noop_from_path(path: Path): df = pd.read_parquet(path, engine="fastparquet") # real code would process the dataframe here pass
  
  def main(): df = pd.DataFrame({"column": list(range(10_000_000))})
  
  with mp.get_context("spawn").Pool(1) as pool: # Pass the DataFrame to the worker process # directly, via pickling: start = time() pool.apply(noop, (df,)) print("Pickling-based:", time() - start) # Write the DataFrame to a file, pass the path to # the file to the worker process: start = time() path = Path(mkdtemp()) / "temp.parquet" df.to_parquet( path, engine="fastparquet", # Run faster by skipping compression: compression="uncompressed", ) pool.apply(noop_from_path, (path,)) print("Parquet-based:", time() - start)
  
  if name == "main": main() `` **Option #4:multiprocessing.shared_memory`**
  
  Because processes sometimes do want to share memory, operating systems typically provide facilities for explicitly creating shared memory between processes. Python wraps this facilities in the multiprocessing.shared_memory module.
  
  However, unlike threads, where the same memory address space allows trivially sharing Python objects, in this case you’re mostly limited to sharing arrays. And as we’ve seen, NumPy releases the GIL for expensive operations, which means you can just use threads, which is much simpler. Still, in case you ever need it, it’s worth knowing this module exists.
  
  Note: The module also includes ShareableList, which is a bit like a Python list but limited to int, float, bool, small str and bytes, and None. But this doesn’t help you cheaply share an arbitrary Python object.
  
  A bad option for Linux: the "fork" context
  
  You may have noticed we did multiprocessing.get_context("spawn").Pool() to create a process pool. This is because Python has multiple implementations of multiprocessing on some OSes. "spawn" is the only option on Windows, the only non-broken option on macOS, and available on Linux. When using "spawn", a completely new process is created, so you always have to copy data across.
  
  On Linux, the default is "fork": the new child process has a complete copy of the memory of the parent process at the time of the child process’ creation. This means any objects in the parent (arrays, giant dicts, whatever) that were created before the child process was created, and were stored somewhere helpful like a module, are accessible to the child. Which means you don’t need to pickle/unpickle to access them.
  
  Sounds useful, right? There’s only one problem: the "fork" context is super-broken, which is why it will stop being the default in Python 3.14.
  
  Consider the following program:
  
``` import threading import sys from multiprocessing import Process

def thread1(): for i in range(1000): print("hello", file=sys.stderr)

threading.Thread(target=thread1).start()

def foo(): pass

Process(target=foo).start() ```

On my computer, this program consistently deadlocks: it freezes and never exits. Any time you have threads in the parent process, the "fork" context can cause in potential deadlocks, or even corrupted memory, in the child process.

You might think that you’re fine because you don’t start any threads. But many Python libraries start a thread pool on import, for example NumPy. If you’re using NumPy, Pandas, or any other library that depends on NumPy, you are running a threaded program, and therefore at risk of deadlocks, segfaults, or data corruption when using the "fork" multiprocessing context. For more details see this article on why multiprocessing’s default is broken on Linux.

You’re just shooting yourself in the foot if you take this approach.

subprocess comparing threading thread pools multiprocessing python parallelism
2. GadjiMurad 15 Dec 2023
  
  in Public
  
  Threads vs. processes
  
  Multiple threads let you run code in parallel, potentially on multiple CPUs. On Python, however, the global interpreter lock makes this parallelism harder to achieve.
  
  Multiple processes also let you run code in parallel—so what’s the difference between threads and processes?
  
  All the threads inside a single process share the same memory address space. If thread 1 in a process stores some memory at address 0x7f0cd1a88810, thread 2 can access the same memory at the same address. That means passing objects between threads is cheap: you just need to get the pointer to the memory address from one thread to the other. A memory address is 8 bytes: this is not a lot of data to move around.
  
  In contrast, processes do not share the same memory space. There are some shared memory facilities provided by the operating system, typically, and we’ll get to that later. But by default, no memory is shared. That means you can’t just share the address of your data across processes: you have to copy the data.
  
  python threading processes shared_data tips
Visit annotations in context

Tags

multiprocessing

processes

thread pools

comparing

python

shared_data

parallelism

threading

tips

subprocess

Annotators

GadjiMurad

URL

pythonspeed.com/articles/faster-multiprocessing-pickle/
www.bitecode.dev www.bitecode.dev

The easy way to concurrency and parallelism with Python stdlib

1
1. GadjiMurad 11 Dec 2023
  
  in Public
  
  You can distribute work to a bunch of process workers or thread workers with a few lines of code:
  
```python from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=5) as executor: executor.submit(do_something_blockint) ```

python concurrency parallelism thread_pools threading
Visit annotations in context

Tags

parallelism

threading

concurrency

thread_pools

python

Annotators

GadjiMurad

URL

bitecode.dev/p/the-easy-way-to-concurrency-and-parallelism
tonybaloney.github.io tonybaloney.github.io

Running Python Parallel Applications with Sub Interpreters

3
1. GadjiMurad 07 Dec 2023
  
  in Public
  
  Half of the time taken to start an interpreter is taken up running “site import”. This is a special module called site.py that lives within the Python installation. Interpreters have their own caches, their own builtins, they are effectively mini-Python processes. Starting a thread or a coroutine is so fast because it doesn’t have to do any of that work (it shares that state with the owning interpreter), but it’s bound by the lock and isn’t parallel.
  
  python subinterpreter comparing threading couroutine
2. GadjiMurad 07 Dec 2023
  
  in Public
  
  Threads are only parallel with IO-bound tasks
  
  threading
3. GadjiMurad 07 Dec 2023
  
  in Public
  
  What is the difference between threading, multiprocessing, and sub interpreters?
  
  The Python standard library has a few options for concurrent programming, depending on some factors:
  
  Is the task you’re completing IO-bound (e.g. reading from a network, writing to disk)
  
  Does the task require CPU-heavy work, e.g. computation
  
  Can the tasks be broken into small chunks or are they large pieces of work?
  
  Here are the models:
  
  Threads are fast to create, you can share any Python objects between them and have a small overhead. Their drawback is that Python threads are bound to the GIL of the process, so if the workload is CPU-intensive then you won’t see any performance gains. Threading is very useful for background, polling tasks like a function that waits and listens for a message on a queue.
  
  Coroutines are extremely fast to create, you can share any Python objects between them and have a miniscule overhead. Coroutines are ideal for IO-based activity that has an underlying API that supports async/await.
  
  Multiprocessing is a Python wrapper that creates Python processes and links them together. These processes are slow to start, so the workload that you give them needs to be large enough to see the benefit of parallelising the workload. However, they are truly parallel since each one has it’s own GIL.
  
  Sub interpreters have the parallelism of multiprocessing, but with a much faster startup time.
  
  threading multiprocessing subinterpreter python comparing
Visit annotations in context

Tags

couroutine

multiprocessing

comparing

python

threading

subinterpreter

Annotators

GadjiMurad

URL

tonybaloney.github.io/posts/sub-interpreter-web-workers.html
Sep 2023
github.com github.com

Make returning replies separately optional · hypothesis/h@dc499a9

1
1. kael 13 Sep 2023
  
  in Public
  
  Add a new, undocumented separate_replies=True option to the search API. If separate_replies=True option is _not_ given to the search API, then it reverts to its previous behaviour: _do_ include replies in the "rows" list returned. This is the same behaviour that the search API had befor: it returns both top-level annotations and replies in the one "rows" list, but without any guarantee that if some annotations/replies from a given thread are in the list then all annotations/replies from that thread will be in it. If separate_replies=True _is_ given then the API follows the new behaviour: "rows" contains top-level annotations only, and a separate "replies" list containing all replies to the annotations in rows is also inserted into the result.
  
  hypothesis api annotations threading
Visit annotations in context

Tags

hypothesis

api

annotations

threading

Annotators

kael

URL

github.com/hypothesis/h/commit/dc499a9023a8243ca35515aa4298db7d04bdf12d
Dec 2022
support.google.com support.google.com

Learn when to use & organize a space - Computer - Gmail Help

3
1. TylerRick 07 Dec 2022
  
  in Public
  
  Google new feature in-line threading Gmail
2. TylerRick 07 Dec 2022
  
  in Public
  
  Easy to scan and understand what’s discussed in the space. Fewer distractions to help you focus on topics you care about. Easy to browse topics because they’re all in one place in the thread navigation panel. Thread replies don’t interrupt the main conversation. You can toggle history on and off.
  
  advantages/merits/pros in-line threading
3. TylerRick 07 Dec 2022
  
  in Public
  
  You can find some benefits and limitations of each kind of space organization below.
  
  trade-offs understand the trade-offs in-line threading see content below pros & cons
Visit annotations in context

Tags

pros & cons

see content below

understand the trade-offs

trade-offs

new feature

Google

Gmail

advantages/merits/pros

in-line threading

Annotators

TylerRick

URL

support.google.com/mail/answer/12176488
Aug 2022
www.slideshare.net www.slideshare.net

Salmon Protocol - OpenWebTO

1
1. kael 22 Aug 2022
  
  in Public
  
  handled5 what the receiver does with the content is (wisely) out of scope suggestions for two patterns: reply: specify atom thr:in-reply- to mention: include rel="mentioned"
  
```xml

<entry xmlns='http://www.w3.org/2005/Atom'> <id>tag:example.com,2009:cmt-0.44775718</id> <author> <name>test@example.com</name <uri>bob@example.com</uri> </author> <thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' ref='tag:blogger.com,1999:blog-893591374313312737.post-3861663258538857954'> tag:blogger.com,1999:blog-893591374313312737.post-3861663258538857954</thr:in-reply-to> <content>Salmon swim upstream!</content> <title>Salmon swim upstream!</title> <updated>2009-12-18T20:04:03Z</updated> </entry> ```

salmon social comments webmention threading cito:cites=urn:ietf:rfc:4685
Visit annotations in context

Tags

comments

cito:cites=urn:ietf:rfc:4685

threading

social

salmon

webmention

Annotators

kael

URL

slideshare.net/walkah/salmon-protocol-openwebto
Jul 2022
developer.twitter.com developer.twitter.com

Conversation ID

1
1. kael 10 Jul 2022
  
  in Public
  
  twitter id threading
Visit annotations in context

Tags

threading

id

twitter

Annotators

kael

URL

developer.twitter.com/en/docs/twitter-api/conversation-id
May 2022
datatracker.ietf.org datatracker.ietf.org

RFC 4685 - Atom Threading Extensions

1
1. kael 06 May 2022
  
  in Public
  
  atom xml threading in-reply-to urn:ietf:rfc:4685
Visit annotations in context

Tags

in-reply-to

threading

atom

xml

urn:ietf:rfc:4685

Annotators

kael

URL

datatracker.ietf.org/doc/html/rfc4685
Apr 2022
developers.google.com developers.google.com

IMAP Extensions | Gmail IMAP | Google Developers

1
1. kael 23 Apr 2022
  
  in Public
  
  gmail imap threading tags urn:google:rfc:gmail-imap-x-gm-ext-1
Visit annotations in context

Tags

urn:google:rfc:gmail-imap-x-gm-ext-1

threading

gmail

imap

tags

Annotators

kael

URL

developers.google.com/gmail/imap/imap-extensions
klara.student.utwente.nl klara.student.utwente.nl

Untitled document

1
1. kael 23 Apr 2022
  
  in Public
  
  The XCONVERSATIONS extension to the Internet Message Access Protocol messages to be linked into conversations, automatically and persistently, and entirely in the server.
  
  imap fastmail threading conversations urn:ietf:id:draft-banks-imap-conversations
Visit annotations in context

Tags

conversations

threading

urn:ietf:id:draft-banks-imap-conversations

imap

fastmail

Annotators

kael

URL

klara.student.utwente.nl/~stephan/draft-banks-imap-conversations.txt
github.com github.com

draft-rfcs/draft-banks-imap-conversations.xml at master · gnb/draft-rfcs

1
1. kael 23 Apr 2022
  
  in Public
  
  XCONVERSATIONS - FastMail IMAP Extension for Conversations
  
  imap fastmail threading conversations urn:ietf:id:draft-banks-imap-conversations
Visit annotations in context

Tags

conversations

threading

urn:ietf:id:draft-banks-imap-conversations

imap

fastmail

Annotators

kael

URL

github.com/gnb/draft-rfcs/blob/master/draft-banks-imap-conversations.xml
Dec 2021
cr.yp.to cr.yp.to

Untitled document

1
1. kael 01 Dec 2021
  
  in Public
  
  Writers use References to indicate that a message has a parent. The last identifier in References identifies the parent. The first identifier in References identifies the first article in the same thread. There may be more identifiers in References, with grandparents preceding parents, etc. (The basic idea is that a writer should copy References from the parent and append the parent's Message-ID. However, if there are more than about ten identifiers listed, the writer should eliminate the second one.)
  
  threading email
Visit annotations in context

Tags

email

threading

Annotators

kael

URL

cr.yp.to/immhf/thread.html
www.jwz.org www.jwz.org

message threading

1
1. kael 01 Dec 2021
  
  in Public
  
  threading email
Visit annotations in context

Tags

email

threading

Annotators

kael

URL

jwz.org/doc/threading.html
Oct 2020
robsteranium.github.io robsteranium.github.io

One Billion Rows

1
1. SamRose 18 Oct 2020
  
  in Public
  
  clojure rdf grafter swirrl threading
Visit annotations in context

Tags

rdf

clojure

threading

grafter

swirrl

Annotators

SamRose

URL

robsteranium.github.io/presentations/grafter/
Nov 2019
realpython.com realpython.com

An Intro to Threading in Python – Real Python

1
1. kip2 19 Nov 2019
  
  in Public
  
  threading python
Visit annotations in context

Tags

python

threading

Annotators

kip2

URL

realpython.com/intro-to-python-threading/
Nov 2018
github.com github.com

Single Annotation card view of a reply should link to it's parent · Issue #1959 · hypothesis/h

1
1. judell 09 Nov 2018
  
  in Public
  
  h_github_repo threading
Visit annotations in context

Tags

h_github_repo

threading

Annotators

judell

URL

github.com/hypothesis/h/issues/1959

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL