As our choral practice groups have been growing in size, I’ve been keeping a closer eye on scalability. Today, I was able to complete a (synthetic) test that simulated a performance involving
254 500 live participants, all connected via a single JackTrip server! This article is for the larger groups out there, about how to scale.
Any peer-to-peer based solution (JamKazaam, SoundJack, etc.) has inherent architectural limitations that will prevent it from scaling beyond several participants. I’ll save a more detailed explanation for another article, on another day, and focus instead on the two most promising client-server based solutions: Jamulus and JackTrip.
Jamulus uses a single “worker” thread to do most of its work. I’m not familiar with the reasons for this design, but generally it is much more difficult to build software that runs efficiently across multiple threads. The significant downside is that Jamulus fails to utilize the many cores that are available in any modern CPU.
Jamulus currently has a hard-coded maximum of 50 “channels,” where each channel is equivalent to one participant. This is for good reason, because after about 30 or 40 channels, you will max out the capacity of its worker thread. At best, this will lead to a very bad audio experience, and at worse a crash. It doesn’t matter how beefy your server is. In its current form, Jamulus is simply incapable of surpassing this limit.
JackTrip is multi-threaded, and very efficient at utilizing CPU resources. It scaled linearly for my test, using about 8 cores for every 100 clients. Even at 400 clients, memory usage was extremely low and there were no errors in the jacktrip or jackd process logs. However, the journey of scaling was not without pitfalls.
The first bottleneck you will likely encounter is hard-coded into JackTrip: it does not allow more client connections than vCPUs available. There is a patch currently available in the “chris” branch that removes this.
The next bottleneck I hit was a compile-time limit in the jackd server used by JackTrip. This sets the maximum number of clients, and the default is 62. You can read more about it here. Thankfully, it’s not hard to build a custom jackd from source to increase this limit to 512 (see below).
The next limit I hit was at 100. From the jackd logs:
client 220.127.116.11-99 has 99 extra instances already
Cannot read socket fd = 308 err = No such file or directory
Unknown request 0
My first attempts used a c5.24xlarge (96 vCPU) EC2 instance to run the JackTrip server, and a separate c5.24xlarge instance to run this jacktrip_load.py script. I learned that when Jack assigns names to clients, it uses a two-digit format, making “99” the max. This was a problem specific to my synthetic test. My subsequent attempts worked around it by using multiple c5.9xlarge (36 vCPU) instances to run jacktrip_load.py, each running up to 100 clients.
After this, I hit another limit at around 144 clients: jackd ran out of file handles and crashed. I’m using Linux for these tests, so it was easy enough to fix with a few extra limits.conf lines (also bumping number of processes to be safe):
@audio soft nproc 200000
@audio hard nproc 200000
@audio soft nofile 200000
@audio hard nofile 200000
At 254 clients, I hit another jackd limit that was causing JackTrip to crash with the following output:
Waiting for Peer… JackTrip HUB SERVER: Total Running Threads: 254
===============================================================Received Connection from Peer! spawning jacktripWorker so change patch Cannot open lsp client Cannot read socket fd = 1788 err = Success CheckRes error JackSocketClientChannel read fail JackShmReadWritePtr1::~JackShmReadWritePtr1 - Init not done for -1, skipping unlock jack_client_open() failed, status = 0x%2.0x 33
jackd was logging numerous errors, but they started out with this:
shm registry full
Cannot create shared memory segment of size = 426
JackShmMem::new bad alloc
Cannot open client
Cannot create new client
CheckSize error size = 0 Size() = 12
CheckSize error size = 3 Size() = 12
Unknown request 0
Unknown request 4294967295
Some digging in the jackd source code led me to the MAX_SHM_ID constant in shm.h, which is set by default to 256. You can increase this and the number of clients using the following patch:
Here are the steps to build a custom jackd from source that includes these changes:
git clone https://github.com/jackaudio/jack2.git cd jack2 patch -p1 < PATH_TO_FILE/jack_limits.patch ./waf configure ./waf ./waf install
All of these changes enabled me to scale my JackTrip server up to handle 500 clients:
------------------------------------------------------------ UDP Socket Receiving in Port: 61499 ------------------------------------------------------------ Waiting for Peer… Received Connection from Peer! JackTrip HUB SERVER: Total Running Threads: 500 ============================================================ spawning jacktripWorker so change patch JackTrip HUB SERVER: Waiting for client connections… JackTrip HUB SERVER: Hub auto audio patch setting = 0 ============================================================
At this point, I was reaching the ceiling for another worker thread that appears inside of jackd:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND11694 ubuntu -11 0 175268 102424 92532 R 95.1 0.1 18:11.67 jackd 11942 ubuntu 20 0 60.3g 118964 96040 R 7.1 0.1 1:30.34 UdpDataProtocol 12117 ubuntu 20 0 60.3g 118964 96040 S 7.1 0.1 1:27.72 UdpDataProtocol 12207 ubuntu 20 0 60.3g 118964 96040 S 7.1 0.1 1:26.54 UdpDataProtocol
I received no errors from JackTrip, but jackd started recording frequent Xruns and other errors. Also, the CPU utilization overall on my server was pretty heavily taxed:
load average: 64.61, 69.02, 55.29
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND11696 ubuntu 20 0 60.3g 118308 96040 S 4297 0.1 721:01.47 jacktrip 11692 ubuntu 20 0 175268 102424 92532 S 97.4 0.1 18:47.87 jackd
On some of my test runs, the 500th client even failed to connect. I would consider this to be the upper bound; trying to push it any further would be unrealistically time consuming.
For comparison, here is how the resources looked with 400 clients:
load average: 36.83, 34.12, 33.19 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11696 ubuntu 20 0 57.9g 108280 95636 S 2939 0.1 171:40.36 jacktrip 11692 ubuntu 20 0 174476 101600 91740 S 72.5 0.1 5:17.36 jackd
With 400 clients, the jackd worker thread was very busy, but still had some room to go:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND11694 ubuntu -11 0 174476 101632 91740 R 75.9 0.1 19:15.28 jackd 13981 ubuntu 20 0 57.9g 125236 95644 S 7.2 0.1 1:13.20 UdpDataProtocol 14059 ubuntu 20 0 57.9g 125236 95644 S 7.2 0.1 1:12.51 UdpDataProtocol
I expect that with appropriate tuning, a single 96 vCPU JackTrip server can successfully handle up to around 400 live participants. Beyond that, you are pushing the capabilities of both jackd and the system itself. This is order of magnitude more than what Jamulus can handle, and more than enough even for large musical ensembles.