Writing files 5 to 9 times faster than fprintf

I recently added a new unsynchronized file output API to the {fmt} library. Together with format string compilation, zero memory allocations and locale-independent formatting by default this gives you a high performance file output from a single thread.

Here’s a small example demonstrating the new API:

#include <fmt/os.h>

int main() {
  auto f = fmt::output_file("guide");
  f.print("The answer is {}.", 42);
}

Initially I picked BUFSIZ as the default buffer size because according to the glibc manual:

The value of BUFSIZ is chosen on each system so as to make stream I/O efficient.

Later I added the ability to pass a buffer size

auto f = fmt::output_file("guide", fmt::buffer_size=4096);

and decided to use the new powers to check if BUFSIZ is actually a good default.

To this end I created a little benchmark that writes a 10MiB file with different values of the buffer size starting from BUFSIZ:

#include <benchmark/benchmark.h>
#include <fmt/compile.h>
#include <fmt/os.h>
#include <stdio.h>

auto test_data = "test data";
auto num_iters = 1'000'000;

const char* removed(benchmark::State& state, const char* path) {
  state.PauseTiming();
  std::remove(path);
  state.ResumeTiming();
  return path;
}

void fprintf(benchmark::State& state) {
  for (auto s : state) {
    auto f = fopen(removed(state, "/tmp/fprintf-test"), "wb");
    for (int i = 0; i < num_iters; ++i)
      fprintf(f, "%s\n", test_data);
    fclose(f);
  }
}
BENCHMARK(fprintf);

void fmt_print_compile(benchmark::State& state) {
  for (auto s : state) {
    auto f = fmt::output_file(removed(state, "/tmp/fmt-compile-test"),
                              fmt::buffer_size=state.range(0));
    for (int i = 0; i < num_iters; ++i)
      f.print(FMT_COMPILE("{}\n"), test_data);
  }
}
BENCHMARK(fmt_print_compile)->RangeMultiplier(2)->Range(BUFSIZ, 1 << 20);

BENCHMARK_MAIN();

The full benchmark is available here.

Running it on macOS with an AP1024M SSD shows that BUFSIZ which is equal to 1024 on this system is suboptimal to put it mildly. By switching to a larger buffer we can make {fmt}’s print more than 9 times faster than fprintf:

Run on (8 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 2.21, 1.96, 2.02
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
fprintf                    101143652 ns    100765429 ns            7
fmt_print_compile/1024      46979210 ns     46620467 ns           15
fmt_print_compile/2048      34731403 ns     34565400 ns           20
fmt_print_compile/4096      29536168 ns     29223478 ns           23
fmt_print_compile/8192      19930252 ns     19783206 ns           34
fmt_print_compile/16384     17627005 ns     15691674 ns           46
fmt_print_compile/32768     12819629 ns     12684212 ns           52
fmt_print_compile/65536     17585901 ns     12323373 ns           59
fmt_print_compile/131072    17223877 ns     11012742 ns           62
fmt_print_compile/262144    10815320 ns     10711619 ns           63
fmt_print_compile/524288    16649142 ns     10812969 ns           64
fmt_print_compile/1048576   16610093 ns     10747453 ns           64

The green baseline shows fprintf time with the default (fixed) buffer size.

There are some fluctuations in the wall clock time but you can see that increasing the buffer size even by a small factor gives a big performance boost. There is a trade off between memory and speed with diminishing returns for increasing the buffer size. A good default seems to be somewhere in the range 8kiB - 128kiB.

Running the same benchmark on GNU/Linux with Samsung SSD 970 PRO gives completely different results. BUFSIZ is 8kiB there and making the buffer bigger gives only moderate ~9% improvement. However, this time {fmt} is more than 5 times faster than fprintf even with the default buffer size.

Run on (16 X 5000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 256K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.44, 0.27, 0.15
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
fprintf                     40097543 ns     40098696 ns           18
fmt_print_compile/8192       7661507 ns      7661109 ns           90
fmt_print_compile/16384      7326631 ns      7326616 ns           95
fmt_print_compile/32768      7153550 ns      7152801 ns           97
fmt_print_compile/65536      7196112 ns      7057630 ns           97
fmt_print_compile/131072     7023975 ns      7024247 ns           98
fmt_print_compile/262144     7052366 ns      7051588 ns           98
fmt_print_compile/524288     7033374 ns      7033412 ns           98
fmt_print_compile/1048576    7028619 ns      7028467 ns           98

Note that in this case wall clock and CPU time are similar so only the former is shown.

Based on this findings the default file buffer size in {fmt} has been increased to max(BUFSIZ, 32768) which gives 3.5x improvement on macOS and ~7% improvement on Linux on the above benchmark. As mentioned earlier, it’s possible to pass a different size when opening a file which, unlike a similar stdio API, avoids reallocation.

Summary

With an increased default buffer size {fmt} now provides a simple and efficient file output API which is up to 5-9 times faster than fprintf (possibly more on numeric formatting).

comments powered by Disqus