Structural changes for +48-89% throughput in a Rust web service

2025-04-29T17:12:40+03:00

Fantastic analysis! Have you given any thought to doing a similar analysis with a thread per core model using io-uring/IoRing?

LikeLike

Reply

2025-04-29T19:23:52+03:00

Indeed, taking greater advantage of thread-isolated execution combined with high performance I/O APIs is a fascinating topic and there are significant gains to be made compared to the Rust status quo, especially on Windows where Tokio I/O efficiency is rather poor.

I try to target my public writing toward generally applicable multiplatform approaches, which complicates things somewhat given that some of the more known existing implementations like monoio are Linux-only. That said, some more portable competition has appeared in recent years (e.g. Compio). Still, there are problems applying high performance I/O in practice – for example, the Hyper HTTP stack is very much designed for poll-based I/O APIs. Unfortunately the modern I/O APIs are not poll based, so there are some conceptual incompatibilities that need to be solved there to support high performance I/O without eating away most of the gains higher in the stack due to copying that results from this architectural mismatch.

Essentially, I would much rather write about solutions than problems, so sharing more on this first requires more of these problems to be solved. The topic is definitely on my radar and I have some hope that the Rust ecosystem will evolve toward making these techniques more approachable in the future.

LikeLike

Reply

	pub fn is_forbidden_text_static(s: &str) -> bool {
	FORBIDDEN_TEXTS
	.iter()
	.any(\|candidate\| s.starts_with(candidate))
	}

	import http from 'k6/http';

	export default function () {
	let params = {
	timeout: "10000ms",
	};

	let payload = '.. elided for snippet brevity ..';
	http.post('http://10.0.0.8:1234/check', payload, params);
	}

	#[tokio::main]
	async fn main() {
	let app = Router::new().route("/check", post(check));

	let addr = SocketAddr::from(([0, 0, 0, 0], 1234));
	println!("Server running on http://{}", addr);

	let listener = tokio::net::TcpListener::bind(addr).await.unwrap();
	axum::serve(listener, app.into_make_service())
	.await
	.unwrap();
	}

	async fn check(body: String) -> impl IntoResponse {
	if is_forbidden_text_static(&body) {
	(StatusCode::OK, "true")
	} else {
	(StatusCode::OK, "false")
	}
	}

	Req/s	P95 ms
Initial version	544	2450

	async fn main() {
	let num_workers = num_cpus::get();
	let mut work_txs = Vec::with_capacity(num_workers);

	for _ in 0..num_workers {
	const WORKER_QUEUE_SIZE: usize = 4;

	let (tx, rx) = channel(WORKER_QUEUE_SIZE);
	work_txs.push(tx);

	thread::spawn(move \|\| worker_entrypoint(rx));
	}

	listener_entrypoint(addr, work_txs).await;
	}

Structural changes for +48-89% throughput in a Rust web service

Example scenario

Measuring performance

Starting point

From multithreading to many-threading

Hey, that is a different number of digits!

Where is the data set?

The Windows memory manager can be picky

Applicability

References

Published by Sander Saares

2 thoughts on “Structural changes for +48-89% throughput in a Rust web service”

Leave a comment Cancel reply

	fn worker_entrypoint(mut rx: Receiver<TcpStream>) {
	let runtime = tokio::runtime::Builder::new_current_thread()
	.enable_all()
	.build()
	.unwrap();

	runtime.block_on(async move {
	// We build a new Axum app on every worker, ensuring that workers are independent.
	let app = Router::new().route("/check", post(check));
	let service_factory = app.into_make_service_with_connect_info::<SocketAddr>();

	while let Some(stream) = rx.recv().await {
	let peer_addr = stream.peer_addr().unwrap();

	// For each connection, we spawn a new task to handle it.
	tokio::spawn({
	let mut service_factory = service_factory.clone();

	async move {
	let service = service_factory.call(peer_addr).await.unwrap();
	let hyper_service = TowerToHyperService::new(service);

	let http = hyper::server::conn::http1::Builder::new();

	// We do not care if the request handling succeeds or fails, so ignore result.
	_ = http
	.serve_connection(TokioIo::new(stream), hyper_service)
	.await;
	}
	});
	}
	});
	}

	Req/s (Win)	P95 ms (Win)	Req/s (Linux)	P95 ms (Linux)
Initial version	544	2450	1842	637
Thread-isolated	614 (+13%)	1000 (-59%)	1839 (-0%)	305 (-52%)

	let processors = ProcessorSet::default();
	let num_workers = processors.len();

	let mut work_txs = Vec::with_capacity(num_workers);

	for _ in 0..num_workers {
	const WORKER_QUEUE_SIZE: usize = 4;

	let (tx, rx) = channel(WORKER_QUEUE_SIZE);
	work_txs.push(tx);

	// In each loop iteration, we spawn a new worker thread that the OS is allowed to assign
	// to any of the processors in the set to balance load among them. This is almost entirely
	// equivalent to `thread::spawn()`, except by using `ProcessorSet::default()` we ensure that
	// all processors are made available for these threads on Windows, even on many-processor
	// systems with multiple processor groups where threads can otherwise be limited to only
	// one processor group.
	processors.spawn_thread(move worker_entrypoint(rx));
	}

	Req/s (Win)	P95 ms (Win)	Req/s (Linux)	P95 ms (Linux)
Initial version	544	2450	1842	637
Thread-isolated	614 (+13%)	1000 (-59%)	1839 (-0%)	305 (-52%)
many_cpus	1192 (+119%)	792 (-68%)	1864 (+1%)	313 (-51%)