, 8 min read
Performance Comparison of mmap() versus read() versus fread()
Original post is here eklausmeier.goip.de/blog/2016/02-03-performance-comparison-mmap-versus-read-versus-fread.
I recently read in Computers are fast! by Julia Evans about a comparison between fread()
and mmap()
suggesting that both calls deliver roughly the same performance. Unfortunately the codes mentioned there and referenced in bytesum.c for fread()
and bytesum_mmap.c for mmap()
do not really compare the same thing. The first adds size_t
, the second adds up uint8_t
. My computer showed that these programs do behave differently and therefore give different performance.
I reprogrammed the comparison adding read()
to fread()
and mmap()
. The code is in GitHub. Compile with
cc -Wall -O3 tbytesum1.c -o tbytesum1
For this program the results are as follows:
/home/klm/c: time ./tbytesum1 -f ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.187s
user 0m0.077s
sys 0m0.110s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.193s
user 0m0.100s
sys 0m0.090s
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.186s
user 0m0.080s
sys 0m0.103s
/home/klm/c: time tbytesum1 -r ~ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.196s
user 0m0.110s
sys 0m0.083s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.152s
user 0m0.110s
sys 0m0.040s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.159s
user 0m0.113s
sys 0m0.043s
The file in question is the Ubuntu ISO-image for the server edition, roughly 564 MB, stored on a classical hard-drive (Seagate 2TB drive). fread()
and read()
don't make a difference. This demonstrates that the mmap()
'ed version needs roughly half the system time (83ms down to 40ms), leading to a reduction of 20% of the total running time (186ms down to 152ms).
A similar test with a short video from a SSD:
/home/klm/c: time tbytesum1 -r CLIP0627.AVI
The answer is: -122687217
real 0m0.097s
user 0m0.033s
sys 0m0.063s
/home/klm/c: time tbytesum1 -r CLIP0627.AVI
The answer is: -122687217
real 0m0.097s
user 0m0.050s
sys 0m0.043s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217
real 0m0.093s
user 0m0.050s
sys 0m0.040s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217
real 0m0.098s
user 0m0.043s
sys 0m0.053s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217
real 0m0.079s
user 0m0.050s
sys 0m0.027s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217
real 0m0.084s
user 0m0.050s
sys 0m0.030s
The AVI-file is roughly 259 MB. Again, fread()
and read()
don't differ, but mmap()
is roughly 30% faster system-time-wise.
These tests were conducted on a 4.3.3-3-ARCH x86_64 system utilizing an AMD FX-8120 Eight-Core processor running up to 3.1 GHz. gcc used for compiling was 5.3.0.
Linus Torvalds gave the following remarks on read()
versus mmap()
in mmap/mlock performance versus read:
People love `mmap()` and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.Downsides to
mmap()
:
- quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
- page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of
mmap()
:
- if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in,
mmap()
is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns thatmmap()
just wins. You may have random access patterns, and usemmap()
as a way of keeping track of what data you actually needed.- if the data is large,
mmap()
is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.And the automatic sharing is obviously a case of this..
It is interesting to note that the performance is kernel-dependent. The same tests conducted on Ubuntu 14.04.3 LTS, kernel version 3.13.0-74-generic #118-Ubuntu SMP, x86_64, on the exact same hardware, give a more blurred view:
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.191s
user 0m0.072s
sys 0m0.119s
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.177s
user 0m0.073s
sys 0m0.104s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.187s
user 0m0.092s
sys 0m0.095s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.181s
user 0m0.077s
sys 0m0.104s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.174s
user 0m0.104s
sys 0m0.072s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
The answer is: -98049011
real 0m0.175s
user 0m0.102s
sys 0m0.071s
Again, read()
and fread()
make no real difference. The difference in system-time between read()
and mmap()
is 25% (95ms down to 71ms). Compiler was gcc 4.8.4.
Testing a file on SSD:
root@chieftec:~# time ~klm/c/tbytesum1 -r CLIP0627.AVI
The answer is: -122687217
real 0m0.098s
user 0m0.039s
sys 0m0.059s
/home/klm/c: time tbytesum1 -r CLIP0627.AVI
The answer is: -122687217
real 0m0.093s
user 0m0.047s
sys 0m0.047s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217
real 0m0.092s
user 0m0.036s
sys 0m0.056s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217
real 0m0.087s
user 0m0.040s
sys 0m0.047s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217
real 0m0.091s
user 0m0.047s
sys 0m0.043s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217
real 0m0.086s
user 0m0.047s
sys 0m0.039s
With the file on SSD the result is even more fading away to the file stored on hard-disk: running times between read()
and mmap()
are almost identical, contrary to the result on kernel 4.3.3.
A kernel dependency was also hinted in CPU Usage Time Is Dependant on Load.
Both binaries compiled by gcc either version 5.3.0 on Arch or 4.8.4 on Ubuntu use loop-unrolling for all three functions mmaptst()
, freadtst()
, and readtst()
, which can be seen by:
objdump -d tbytesum1
Here is the assembler code:
00000000004008e0 <mmaptst>:
4008e0: 41 55 push %r13
4008e2: 41 54 push %r12
4008e4: 31 c0 xor %eax,%eax
4008e6: 55 push %rbp
4008e7: 53 push %rbx
4008e8: 48 89 f5 mov %rsi,%rbp
. . .
400a5a: 66 0f fe c1 paddd %xmm1,%xmm0
400a5e: 66 0f 7e c0 movd %xmm0,%eax
400a62: 01 c3 add %eax,%ebx
400a64: 89 f8 mov %edi,%eax
400a66: 48 01 c2 add %rax,%rdx
400a69: 41 39 fa cmp %edi,%r10d
400a6c: 0f 84 a7 00 00 00 je 400b19 <mmaptst+0x239>
400a72: 0f be 02 movsbl (%rdx),%eax
400a75: 01 c3 add %eax,%ebx
400a77: 83 fe 01 cmp $0x1,%esi
400a7a: 0f 84 99 00 00 00 je 400b19 <mmaptst+0x239>
400a80: 0f be 42 01 movsbl 0x1(%rdx),%eax
400a84: 01 c3 add %eax,%ebx
400a86: 83 fe 02 cmp $0x2,%esi
400a89: 0f 84 8a 00 00 00 je 400b19 <mmaptst+0x239>
400a8f: 0f be 42 02 movsbl 0x2(%rdx),%eax
. . .
400b06: 74 11 je 400b19 <mmaptst+0x239>
400b08: 0f be 42 0d movsbl 0xd(%rdx),%eax
400b0c: 01 c3 add %eax,%ebx
400b0e: 83 fe 0e cmp $0xe,%esi
400b11: 74 06 je 400b19 <mmaptst+0x239>
400b13: 0f be 42 0e movsbl 0xe(%rdx),%eax
400b17: 01 c3 add %eax,%ebx
400b19: 44 89 ef mov %r13d,%edi
. . .
Compiling with gcc 5.3.0 and option march=native
, i.e.,
cc -Wall -O3 -march=native tbytesum1.c -o tbytesum1N
and reading the Ubuntu ISO-file from HD reduces real-time by roughly 10% (152ms down to 138ms), and reduces user-time by roughly 15% (110ms down to 93ms). The generated code uses AMD's vpadd
, vpsrldq
, vpmovsxwd
instructions.
Added 05-Nov-2017: In the blog article "ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}" by Andrew Gallant on the performance comparison between ag
(Silver Searcher) and rg
(ripgrep) he says:
Naively, it seems like (1) would be obviously faster. Surely, all of the bookkeeping and copying in (2) would make it much slower! In fact, this is not at all true. (1) may not require much bookkeeping from the perspective of the programmer, but there is a lot of bookkeeping going on inside the Linux kernel to maintain the memory map. (That link goes to a mailing list post that is quite old, but it still appears relevant today.)When I first started writing ripgrep, I used the memory map approach. It took me a long time to be convinced enough to start down the second path with an intermediate buffer (because neither a CPU profile nor the output of strace ever showed any convincing evidence that memory maps were to blame), but as soon as I had a prototype of (2) working, it was clear that it was much faster than the memory map approach.
With all that said, memory maps aren’t all bad. They just happen to be bad for the particular use case of “rapidly open, scan and close memory maps for thousands of small files.” For a different use case, like, say, “open this large file and search it once,” memory maps turn out to be a boon. We’ll see that in action in our single-file benchmarks later.
Added 30-Oct-2023: This post is referenced in Prequel to liburing_b3sum. This article goes into much detail about O_DIRECT
, io_uring
and shows many performance benchmarks. Highly recommended to read.