Direct IO for predictable performance

About direct IO, I almost learned all from the project of foyer.

Direct IO is widely used by the database or storage systems, which will bypass the page cache to manage memory by self for better performance.

Nevertheless, for the majority of other systems that are not bound by CPU or I/O constraints, buffer IO is good enough.

So direct IO is not suitable for the most systems, but today I want to introduce this for the bigdata shuffle system. Let me tell you why and how to do this!

Motivation

The issue of https://github.com/zuston/R1/pull/19 has described the problem of buffer IO for the Apache Uniffle.

Without direct IO, the performance is unstable when the page cache is flushing back to the disk under the tight system memory, which will make the RPC of getting data from memory/local disk latency high.

Actually, for the shuffle based storage, the flush disk blocks size is large like 128M, and for Uniffle, it also will use the large memory to cache data to speed up. From this point, there is no need to use the system page cache.

Requirements

Having enough memory to maintain alignment memory blocks pool for your system
Having the Rust knowledge
Having big enough data(MB+) into the disk

Just do it

Rust standard library of direct flag

The article will only be scoped on the Linux system. And we can activate direct IO directly from the standard library. The code follows as

Write

let path = "/tmp/";
let mut opts = OpenOptions::new();  
opts.create(true).write(true);  
#[cfg(target_os = "linux")]  
{  
    use std::os::unix::fs::OpenOptionsExt;  
    opts.custom_flagsO_DIRECT;
}
let file = opts.open(path)?;
file.write_at(data, offset)?;
file.sync_all()?;

Read

let mut file = File::open(path)?;  
  
#[cfg(target_family = "unix")]  
use std::os::unix::fs::FileExt;

// read_size indicated the size of read
// read_buf is to fill the read data
// read_offset indicates the reading start position
let read_size = file.read_at(&mut read_buf[..], read_offset)?;

That's all, so easy, right?

Alignment of 4K to write and read

The key point is not described in the above code examples. All read/write buffer must be aligned with the 4K, that means the 16K buffer size is legal, but 15K is illegal. Because for the layer of disk hardware, it only accept alignment buffer.

So if you want to write the 1K data into disk, you should do the following steps

Creating the 4K buffer which should be placed in continuous memory region
Filling the 1K data into the 4K buffer from the 0 position
Using the above code example to write the 4K buffer into the disk

Ways to handle different sizes write and read

For different operations to handle 4K alignment as follows

Read with 5K (>4K)

Creating the 8K buffer
Reading
Remove the tail 3K data for invoking side

Read with 3K

Creating the 4K buffer
Reading
Remove the tail 1K data

Write with 5K

Creating the 8K buffer
Write the data into disk

Attention: when reading, you must to filter out the extra data that is the cost.

Append with 5K in multi times

For the first round that the file haven't any data

Creating the 8K buffer
Write the data into disk

Pasted image 20250214161149.png

For the second round with 5K that the file has data

Pasted image 20250214160559.png

Reading the tail of 1K (5K - 4K) data from disk (that is written by the last round writing)
Use the step1's 1K + 5K = 6K to align with 4K, so you have to use the 4K * 2 = 8K buffer, creating it
Append the data

Alignment memory buffer pool

Above the examples, we know we have to request the continuous memory region. If the machine don't have the enough memory, the page fault will occur.

for the direct io, we have to self manage the align buffer and disk block, if you don't use aligned buffer pool, the page fault will occur frequently, that will also make system load high. (this has be shown in the above screenshot)

For the project of riffle, I created total 4 * 1024 / 16 number buffers and one buffer has 16M.

Let's see the metrics of machine with and without buffer pool.

Pasted image 20250214161709.png
The cpu system user load is lower than the previous load without buffer pool, because the previous case cost too much time on request memory.

Small but effective optimization trick

When flushing the 10G big data into the file, you can split this operation into multi append operations to avoid requesting too large continuous memory region, that will burden the system load and slow down service.