, 8 min read

Performance Comparison of mmap() versus read() versus fread()

Original post is here eklausmeier.goip.de/blog/2016/02-03-performance-comparison-mmap-versus-read-versus-fread.


I recently read in Computers are fast! by Julia Evans about a comparison between fread() and mmap() suggesting that both calls deliver roughly the same performance. Unfortunately the codes mentioned there and referenced in bytesum.c for fread() and bytesum_mmap.c for mmap() do not really compare the same thing. The first adds size_t, the second adds up uint8_t. My computer showed that these programs do behave differently and therefore give different performance.

I reprogrammed the comparison adding read() to fread() and mmap(). The code is in GitHub. Compile with

cc -Wall -O3 tbytesum1.c -o tbytesum1

For this program the results are as follows:

/home/klm/c: time ./tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.187s
user    0m0.077s
sys     0m0.110s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.193s
user    0m0.100s
sys     0m0.090s
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.186s
user    0m0.080s
sys     0m0.103s
/home/klm/c: time tbytesum1 -r ~ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.196s
user    0m0.110s
sys     0m0.083s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.152s
user    0m0.110s
sys     0m0.040s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.159s
user    0m0.113s
sys     0m0.043s

The file in question is the Ubuntu ISO-image for the server edition, roughly 564 MB, stored on a classical hard-drive (Seagate 2TB drive). fread() and read() don't make a difference. This demonstrates that the mmap()'ed version needs roughly half the system time (83ms down to 40ms), leading to a reduction of 20% of the total running time (186ms down to 152ms).

A similar test with a short video from a SSD:

/home/klm/c: time tbytesum1 -r CLIP0627.AVI
The answer is: -122687217

real    0m0.097s
user    0m0.033s
sys     0m0.063s
/home/klm/c: time tbytesum1 -r CLIP0627.AVI
The answer is: -122687217

real    0m0.097s
user    0m0.050s
sys     0m0.043s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217

real    0m0.093s
user    0m0.050s
sys     0m0.040s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217

real    0m0.098s
user    0m0.043s
sys     0m0.053s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217

real    0m0.079s
user    0m0.050s
sys     0m0.027s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217

real    0m0.084s
user    0m0.050s
sys     0m0.030s

The AVI-file is roughly 259 MB. Again, fread() and read() don't differ, but mmap() is roughly 30% faster system-time-wise.

These tests were conducted on a 4.3.3-3-ARCH x86_64 system utilizing an AMD FX-8120 Eight-Core processor running up to 3.1 GHz. gcc used for compiling was 5.3.0.

Linus Torvalds gave the following remarks on read() versus mmap() in mmap/mlock performance versus read:

People love `mmap()` and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap():

  1. quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
  2. page faulting is expensive. That's how the mapping gets populated, and it's quite slow.

Upsides of mmap():

  1. if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
    This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
  2. if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.

And the automatic sharing is obviously a case of this..

It is interesting to note that the performance is kernel-dependent. The same tests conducted on Ubuntu 14.04.3 LTS, kernel version 3.13.0-74-generic #118-Ubuntu SMP, x86_64, on the exact same hardware, give a more blurred view:

/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.191s
user    0m0.072s
sys     0m0.119s
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.177s
user    0m0.073s
sys     0m0.104s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.187s
user    0m0.092s
sys     0m0.095s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.181s
user    0m0.077s
sys     0m0.104s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.174s
user    0m0.104s
sys     0m0.072s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.175s
user    0m0.102s
sys     0m0.071s

Again, read() and fread() make no real difference. The difference in system-time between read() and mmap() is 25% (95ms down to 71ms). Compiler was gcc 4.8.4.

Testing a file on SSD:

root@chieftec:~# time ~klm/c/tbytesum1 -r CLIP0627.AVI 
The answer is: -122687217

real    0m0.098s
user    0m0.039s
sys     0m0.059s
/home/klm/c: time tbytesum1 -r CLIP0627.AVI 
The answer is: -122687217

real    0m0.093s
user    0m0.047s
sys     0m0.047s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI 
The answer is: -122687217

real    0m0.092s
user    0m0.036s
sys     0m0.056s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI 
The answer is: -122687217

real    0m0.087s
user    0m0.040s
sys     0m0.047s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI 
The answer is: -122687217

real    0m0.091s
user    0m0.047s
sys     0m0.043s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI 
The answer is: -122687217

real    0m0.086s
user    0m0.047s
sys     0m0.039s

With the file on SSD the result is even more fading away to the file stored on hard-disk: running times between read() and mmap() are almost identical, contrary to the result on kernel 4.3.3.

A kernel dependency was also hinted in CPU Usage Time Is Dependant on Load.

Both binaries compiled by gcc either version 5.3.0 on Arch or 4.8.4 on Ubuntu use loop-unrolling for all three functions mmaptst(), freadtst(), and readtst(), which can be seen by:

objdump -d tbytesum1

Here is the assembler code:

00000000004008e0 <mmaptst>:
  4008e0:	41 55                	push   %r13
  4008e2:	41 54                	push   %r12
  4008e4:	31 c0                	xor    %eax,%eax
  4008e6:	55                   	push   %rbp
  4008e7:	53                   	push   %rbx
  4008e8:	48 89 f5             	mov    %rsi,%rbp
. . .
  400a5a:	66 0f fe c1          	paddd  %xmm1,%xmm0
  400a5e:	66 0f 7e c0          	movd   %xmm0,%eax
  400a62:	01 c3                	add    %eax,%ebx
  400a64:	89 f8                	mov    %edi,%eax
  400a66:	48 01 c2             	add    %rax,%rdx
  400a69:	41 39 fa             	cmp    %edi,%r10d
  400a6c:	0f 84 a7 00 00 00    	je     400b19 <mmaptst+0x239>
  400a72:	0f be 02             	movsbl (%rdx),%eax
  400a75:	01 c3                	add    %eax,%ebx
  400a77:	83 fe 01             	cmp    $0x1,%esi
  400a7a:	0f 84 99 00 00 00    	je     400b19 <mmaptst+0x239>
  400a80:	0f be 42 01          	movsbl 0x1(%rdx),%eax
  400a84:	01 c3                	add    %eax,%ebx
  400a86:	83 fe 02             	cmp    $0x2,%esi
  400a89:	0f 84 8a 00 00 00    	je     400b19 <mmaptst+0x239>
  400a8f:	0f be 42 02          	movsbl 0x2(%rdx),%eax
. . .
  400b06:	74 11                	je     400b19 <mmaptst+0x239>
  400b08:	0f be 42 0d          	movsbl 0xd(%rdx),%eax
  400b0c:	01 c3                	add    %eax,%ebx
  400b0e:	83 fe 0e             	cmp    $0xe,%esi
  400b11:	74 06                	je     400b19 <mmaptst+0x239>
  400b13:	0f be 42 0e          	movsbl 0xe(%rdx),%eax
  400b17:	01 c3                	add    %eax,%ebx
  400b19:	44 89 ef             	mov    %r13d,%edi
. . .

Compiling with gcc 5.3.0 and option march=native, i.e.,

cc -Wall -O3 -march=native tbytesum1.c -o tbytesum1N

and reading the Ubuntu ISO-file from HD reduces real-time by roughly 10% (152ms down to 138ms), and reduces user-time by roughly 15% (110ms down to 93ms). The generated code uses AMD's vpadd, vpsrldq, vpmovsxwd instructions.

Added 05-Nov-2017: In the blog article "ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}" by Andrew Gallant on the performance comparison between ag (Silver Searcher) and rg (ripgrep) he says:

Naively, it seems like (1) would be obviously faster. Surely, all of the bookkeeping and copying in (2) would make it much slower! In fact, this is not at all true. (1) may not require much bookkeeping from the perspective of the programmer, but there is a lot of bookkeeping going on inside the Linux kernel to maintain the memory map. (That link goes to a mailing list post that is quite old, but it still appears relevant today.)

When I first started writing ripgrep, I used the memory map approach. It took me a long time to be convinced enough to start down the second path with an intermediate buffer (because neither a CPU profile nor the output of strace ever showed any convincing evidence that memory maps were to blame), but as soon as I had a prototype of (2) working, it was clear that it was much faster than the memory map approach.

With all that said, memory maps aren’t all bad. They just happen to be bad for the particular use case of “rapidly open, scan and close memory maps for thousands of small files.” For a different use case, like, say, “open this large file and search it once,” memory maps turn out to be a boon. We’ll see that in action in our single-file benchmarks later.

Added 30-Oct-2023: This post is referenced in Prequel to liburing_b3sum. This article goes into much detail about O_DIRECT, io_uring and shows many performance benchmarks. Highly recommended to read.