0%

1. Introduction

There are already many articles about SPDK benifits. This article will not delve deeply into specific technical details; regarding specific technical implementations, I will place them in the reference links at the end.

2. What is SPDK

First, it must be clear that SPDK is a framework, not a distributed system. The foundation of SPDK (the official website uses the word ‘bedrock’) is a user space, polled-mode, asynchronous, lockless NVMe driver that provides zero-copy, high concurrency direct access to SSDs from user space. Its initial purpose was to optimize block storage write operations. However, with the continuous evolution of SPDK, people discovered that SPDK can optimize various aspects of the storage software stack.

Many distributed storage systems are considering how to incorporate the SPDK framework, or adopt the high-performance storage technology represented by SPDK to optimize the entire IO path.

3. SPDK Design Philosophy

SPDK mainly achieves its high-performance solution by introducing the following technologies:

  1. Moving storage-related drivers to user space to avoid performance loss caused by system calls, and incidentally enabling zero-copy by directly using user space memory for write operations.
  2. Using polling mode
    1. Polling hardware queues, unlike the previous interrupt mode which brought unstable performance and latency increases.
    2. Any business can register a polling function as a poller in the SPDK thread. After registration, this function will be executed periodically in SPDK, avoiding the overhead caused by event notification mechanisms like epoll.
  3. Avoiding the use of locks in the IO path. Using lock-free queues to pass messages/IO.
    • One of the main design goals of SPDK is to achieve linear performance improvement as hardware (e.g. SSD, NIC, CPU) increases. To achieve this goal, SPDK designers must eliminate the overhead caused by using more system resources, such as: more threads, inter-process communication, accessing more storage hardware, and network cards.
    • To reduce this performance overhead, SPDK introduced lock-free queues, using lock-free programming to avoid the performance loss caused by locks.
    • SPDK’s lock-free queues mainly rely on DPDK’s implementation, which essentially uses CAS (Compare and Swap) to implement a multi-producer multi-consumer FIFO queue. For the implementation of lock-free queues, you can refer to this article.

In simple terms, the SPDK runtime occupies specified CPU cores fully; its essence is a large while infinite loop that occupies a CPU core completely. It continuously runs user-specified pollers, polling queues, network interfaces, etc. Therefore, the most basic principle of SPDK programming is to avoid process context switches on SPDK cores. This would break SPDK’s high-performance framework, causing performance degradation or even failure to work.

Process context switches can occur for many reasons, roughly listed as follows. We must avoid them when programming with SPDK. I once encountered a situation where an inconspicuous system call mmap in the SPDK thread entered the kernel, causing the entire SPDK process to become unserviceable until it crashed.

  • CPU time slice exhaustion.
  • When a process has insufficient system resources (such as insufficient memory), it must wait until resources are met before it can run. During this time, the process will also be suspended, and the system will schedule other processes to run.
  • When a higher priority process needs to run, to ensure the operation of the high-priority process, the current process will be suspended, and the high-priority process will run.
  • Hardware interrupts will cause the process on the CPU to be suspended, switching to execute the kernel’s interrupt service routine.

4. Using SPDK to Accelerate NVMe Storage

SPDK aims to directly access NVMe SSD from user space, bypassing the kernel NVMe driver.

SPDK unbinds the NVMe SSD from the kernel driver and binds it to the VFIO or UIO driver. Although these two drivers themselves do not perform any initialization operations on the NVMe device, they give SPDK the ability to directly access the NVMe device. Subsequent initialization and command issuance are all handled by SPDK. Therefore, SPDK’s access to NVMe SSD calls are basically corresponding to NVMe commands, such as admin cmd spdk_nvme_ctrlr_cmd_set_feature, spdk_nvme_ctrlr_cmd_get_log_page, and io cmd spdk_nvme_ctrlr_alloc_io_qpair, spdk_nvme_ns_cmd_read, etc. Of course, io_uring is already somewhat similar to the io cmd mentioned here :)

5. SPDK Bdev

Based on the above accelerated access to NVMe storage, SPDK provides a block device (bdev) software stack. This block device is not the block device in the Linux system; the block device in SPDK is just an interface layer abstracted by software.

SPDK has already provided various bdevs to meet different backend storage methods and testing requirements. Such as NVMe (NVMe bdev includes both NVMe physical disks and NVMe-oF), memory (malloc bdev), no write operation directly returns (null bdev), etc. Users can also customize their own bdev. ‌A very common way to use SPDK is‌ for users to define their own bdev to access their distributed storage cluster.

Through the bdev interface layer, SPDK unifies the calling methods of block devices. Users can use various bdevs by adding different block devices to the SPDK process through different RPCs without modifying the code. Moreover, it is very simple for users to add their own bdev, which greatly expands the applicable scenarios of SPDK.

At this point, students should understand that SPDK’s current application scenarios are mainly targeted at block storage. It can be said that block storage is the foundation of the entire storage system. On top of it, we have built various file storage, object storage, table storage, databases, etc. We can, like various cloud-native databases, build upper-layer distributed systems directly on distributed block storage and object storage. We can also push down metadata and indexes that other storage needs to manage to the block layer, directly using SPDK to optimize upper-layer storage. For example, current block storage uses LBA as the index granularity for management. We can change the index to files/objects and build file/object storage systems on top of them.

6. SPDK Applications

The most direct way to use SPDK is as a storage engine to accelerate access to NVMe SSD. On top of this, SPDK has abstracted the bdev layer. Various businesses can expose block storage devices to users in some way by binding to bdev. This is what we will discuss next regarding various application scenarios.

The purpose of enterprises using block storage is to expose backend storage (which can be local disks or distributed storage clusters) to users. Its implementation forms, such as public clouds, private clouds, etc., we will not discuss for now. In terms of presentation methods alone, we have many ways to expose this block device to users. I have mainly encountered the following types:

  1. Using network storage protocols such as iSCSI/NBD/NVMe-oF, establishing a client (called initiator in iscsi/nvmeof protocols) on the user’s host machine to access the block device.
  2. Through virtio virtualization, providing some type of block device on the host OS to virtual machines. Among these, different device types correspond to different drivers, and their IO paths also vary, such as vhost-user-blk, vfio-pci, virtio-blk, etc.
  3. Bare-metal/smart-NIC/DPU by establishing PF/VF, simulating NVMe/virtio block devices.

For the first two methods, SPDK provides corresponding backend drivers, such as iSCSI target, NVMe-oF target, vhost target, etc. The third method varies in specific implementation details among different vendors and may not be open source. We use SPDK as the backend driver for these methods to receive IO from clients and process it. The advantage is that we can leverage SPDK’s high-performance storage framework, which is the previously mentioned user space, polled-mode, asynchronous, lockless. The SPDK official website has many test documents comparing the performance of SPDK with other open-source implementations, which is still quite considerable.

7. Summary

I’m using SPDK for two years and is generally quite satisfied. The SPDK community provides various means for everyone to communicate. The SPDK China team also frequently publishes technical articles, technical videos, etc. Moreover, SPDK has been continuously evolving, absorbing and supporting various software and hardware new features. By learning SPDK, we can be exposed to various aspects of the storage technology stack. Therefore, I believe that as a storage professional, whether or not you use SPDK in your work, you must understand the various high-performance storage technologies behind SPDK.

Network Stack

SPDK covers the entire path from front-end to network transmission to back-end write operations.

  1. Storage Performance Development Kits SPDK official documentation
  2. Virtualization Technology - Overview [Part 1]
  3. 20.01 SPDK NVMe-oF RDMA Performance Report

QoS Rate Limiting Algorithms Introduction

Rate limiting strategies mainly include token bucket and leaky bucket, which are introduced as follows.

Token Bucket

Wiki’s algorithm description of token bucket is as follows:

  • A token is added to the bucket every 1/r seconds.
  • The bucket can hold at the most b tokens. If a token arrives when the bucket is full, it is discarded.
  • When a packet (network layer PDU) of n bytes arrives,
    • if at least n tokens are in the bucket, n tokens are removed from the bucket, and the packet is sent to the network.
    • if fewer than n tokens are available, no tokens are removed from the bucket, and the packet is considered to be non-conformant.

A bucket with fixed capacity holds a certain number of tokens, where the bucket’s capacity is the upper limit of token count. The number of tokens in the bucket is replenished at fixed intervals until the bucket is full. An IO request will consume one token; if there are tokens in the bucket, the IO request is allowed after consuming a token; otherwise, it cannot be allowed (the algorithm can choose whether to discard the IO request). If rate limiting is based on byte count, each IO will consume tokens equivalent to the iosize.

Based on the above description, we can understand that the token bucket algorithm can achieve the following effects:

  1. The token bucket algorithm can control the rate of processing IO requests by controlling the token replenishment rate;
  2. The token bucket algorithm allows a certain degree of burst—as long as the tokens in the bucket are not exhausted, IO requests can immediately consume tokens and be allowed through. During this period, the IO request processing rate will be greater than the token replenishment rate, where the token replenishment rate actually represents the average processing rate;
  3. **The token bucket algorithm cannot control the upper limit of burst rate and burst duration. The burst duration is determined by the actual IO request rate. If the actual IO request rate is greater than the token replenishment rate and remains constant, then: Burst duration = Token bucket capacity / (Actual IO request rate - Token replenishment rate)

Leaky Bucket

Leaky bucket as a meter

Wiki defines Leaky bucket as a meter as follows:

  • A fixed capacity bucket, associated with each virtual connection or user, leaks at a fixed rate.
  • If the bucket is empty, it stops leaking.
  • For a packet to conform, it has to be possible to add a specific amount of water to the bucket: The specific amount added by a conforming packet can be the same for all packets, or can be proportional to the length of the packet.
  • If this amount of water would cause the bucket to exceed its capacity then the packet does not conform and the water in the bucket is left unchanged.

We can understand it as follows:

A bucket leaks water at a fixed rate. Passing IO request packets add water to the bucket. The amount of water added is based on the aspect of flow control, which could be bytes or IOPS. If adding water causes overflow, the IO cannot pass; otherwise, it can be allowed through.

It can be seen that this algorithm description is basically similar to the token bucket. We can consider Leaky bucket as a meter and Token Bucket to be equivalent.

Leaky bucket as a queue

Wiki’s description of this rate limiting strategy is: The leaky bucket consists of a finite queue. When a packet arrives, if there is room on the queue it is appended to the queue; otherwise it is discarded. At every clock tick one packet is transmitted (unless the queue is empty)

It can be considered that Leaky bucket as a queue is the scenario where the token bucket size equals 1.

Mainstream Block Device Flow Control Solutions

The mainstream flow control strategies widely applied in engineering mainly include three types: qemu, librbd, and spdk, which are introduced respectively below.

Qemu

Qemu supported block device IO rate limiting as early as version 1.1, providing 6 configuration items to set rate limits for IOPS and bandwidth in 6 different scenarios. Version 1.7 added support for bursts to block device IO rate limiting. Version 2.6 improved the burst support functionality, allowing control over burst rate and duration. The parameters are as follows:

Scenario Basic Rate Limit Config Burst Rate Config Burst Duration Config
Total IOPS iops-total iops-total-max iops-total-max-length
Read IOPS iops-read iops-read-max iops-read-max-length
Write IOPS iops-write iops-write-max iops-write-max-length
Total BPS bps-total bps-total-max bps-total-max-length
Read BPS bps-read bps-read-max bps-read-max-length
Write BPS bps-write bps-write-max bps-write-max-length

The core data structure of its implementation is described as follows:

1
2
3
4
5
6
7
typedef struct LeakyBucket {
uint64_t avg; /* IO limit target rate */
uint64_t max; /* IO burst limit rate */
double level; /* bucket level in units */
double burst_level; /* bucket level in units (for computing bursts) */
uint64_t burst_length; /* Burst duration, default unit is seconds */
} LeakyBucket

Qemu’s flow control algorithm uses a leaky bucket implementation. The goal of the algorithm is that the user can have a burst rate of bkt.max for bkt.burst_length seconds, after which the rate drops to bkt.avg.

To achieve this goal, qemu implements two buckets:

  1. Main bucket: Size bucket_size is bkt.max * bkt.burst_length, leaking at rate bkt.avg. Normal IOs are processed by the main bucket first.
  2. Burst bucket: Size burst_bucket_size is set to one-tenth of the main bucket, leaking at rate bkt.max.

If the main bucket is full, then it needs to wait for the leaky bucket. If the main bucket is not full and a burst bucket is set, it needs to check whether the burst bucket can allow passage. This way, we ensure the IO burst rate through the burst bucket, and guarantee the burst duration through the size of the main bucket.

The key function controlling whether IO can be allowed is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* This function compute the wait time in ns that a leaky bucket should trigger
*
* @bkt: the leaky bucket we operate on
* @ret: the resulting wait time in ns or 0 if the operation can go through
*/
int64_t throttle_compute_wait(LeakyBucket *bkt)
{
double extra; /* the number of extra units blocking the io */
double bucket_size; /* I/O before throttling to bkt->avg */
double burst_bucket_size; /* Before throttling to bkt->max */

if (!bkt->avg) {
return 0;
}

if (!bkt->max) {
/* If bkt->max is 0 we still want to allow short bursts of I/O
* from the guest, otherwise every other request will be throttled
* and performance will suffer considerably. */
bucket_size = (double) bkt->avg / 10;
burst_bucket_size = 0;
} else {
/* If we have a burst limit then we have to wait until all I/O
* at burst rate has finished before throttling to bkt->avg */
bucket_size = bkt->max * bkt->burst_length;
burst_bucket_size = (double) bkt->max / 10;
}

/* If the main bucket is full then we have to wait */
extra = bkt->level - bucket_size;
if (extra > 0) {
return throttle_do_compute_wait(bkt->avg, extra);
}

/* If the main bucket is not full yet we still have to check the
* burst bucket in order to enforce the burst limit */
if (bkt->burst_length > 1) {
assert(bkt->max > 0); /* see throttle_is_valid() */
extra = bkt->burst_level - burst_bucket_size;
if (extra > 0) {
return throttle_do_compute_wait(bkt->max, extra);
}
}

return 0;
}

librbd

Ceph supported IO rate limiting for RBD images starting from version 13.2.0 (Mimic). This version only supported rate limiting for total IOPS scenario and supported bursts, allowing configuration of burst rate but not controlling burst duration (effectively equivalent to setting burst duration to 1 second and unmodifiable). Version 14.2.0 (Nautilus) added support for rate limiting in 5 additional scenarios: read IOPS, write IOPS, total BPS, read BPS, and write BPS, while maintaining the same burst support effects.

Librbd’s rate limiting mechanism supports bursts and allows configuration of burst rates, but does not support controlling burst duration. It is implemented using a token bucket. The token bucket refill rate can be adjusted using the rbd_qos_schedule_tick_min parameter, defaulting to 50ms. Users can configure the base rate and burst rate through the following parameters:

Scenario Basic Rate Limit Config Burst Rate Config
Total IOPS rbd_qos_iops_limit rbd_qos_iops_burst
Read IOPS rbd_qos_iops_read_limit rbd_qos_iops_read_burst
Write IOPS rbd_qos_iops_write_limit rbd_qos_iops_write_burst
Total BPS rbd_qos_bps_limit rbd_qos_bps_burst
Read BPS rbd_qos_bps_read_limit rbd_qos_bps_read_burst
Write BPS rbd_qos_bps_write_limit rbd_qos_bps_write_burst

spdk

SPDK’s QoS rate limiting is implemented at the bdev layer and uses a token bucket. It supports separate configuration for IOPS and BW but does not support burst rates. Configuration is done via the RPC request bdev_set_qos_limit. The configuration parameters are as follows:

Parameter Explanation
rw_ios_per_sec IOPS limit
rw_mbytes_per_sec Read/Write bandwidth limit
r_mbytes_per_sec Read bandwidth limit
w_mbytes_per_sec Write bandwidth limit

SPDK refills the token bucket by registering a poller function bdev_channel_poll_qos, with a frequency hardcoded as SPDK_BDEV_QOS_TIMESLICE_IN_USEC, defaulting to 1ms. The amount of tokens added per refill is Total rate / Timeslice.

An IO must pass through all configured token buckets before it can be allowed. A token bucket can be decremented to a negative value in a single operation. Once it becomes negative, all IOs cannot be allowed until the function bdev_channel_poll_qos refills the token bucket to a positive value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
static int
bdev_channel_poll_qos(void *arg)
{
struct spdk_bdev_qos *qos = arg;
uint64_t now = spdk_get_ticks();
int i;

if (now < (qos->last_timeslice + qos->timeslice_size)) {
/* We received our callback earlier than expected - return
* immediately and wait to do accounting until at least one
* timeslice has actually expired. This should never happen
* with a well-behaved timer implementation.
*/
return SPDK_POLLER_IDLE;
}

/* Reset for next round of rate limiting */
for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {
/* We may have allowed the IOs or bytes to slightly overrun in the last
* timeslice. remaining_this_timeslice is signed, so if it's negative
* here, we'll account for the overrun so that the next timeslice will
* be appropriately reduced.
*/
if (qos->rate_limits[i].remaining_this_timeslice > 0) {
qos->rate_limits[i].remaining_this_timeslice = 0;
}
}

while (now >= (qos->last_timeslice + qos->timeslice_size)) {
qos->last_timeslice += qos->timeslice_size;
for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {
qos->rate_limits[i].remaining_this_timeslice +=
qos->rate_limits[i].max_per_timeslice;
}
}

return bdev_qos_io_submit(qos->ch, qos);
}

Impact of Rate Limiting Strategies on Block Devices

Different QoS strategies result in different experiences for the upper-layer block device, mainly reflected in IO latency and %util.

Latency

Latency is determined by the burst performance and refill frequency of the QoS strategy.

  • Burst Performance: Configuring a larger leaky bucket or token bucket, or setting up two buckets like Qemu, can enhance the burst performance of the block device, resulting in lower latency when the block device handles burst traffic.
    • This can be tested using the fio command: fio --group_reporting --rw=randwrite --bs=1M --numjobs=1 --iodepth=64 --ioengine=libaio --direct=1 --name test --size=2000G --filename=/dev/vdb -iodepth_low=0 -iodepth_batch_submit=64 -thinktime=950ms -thinktime_blocks=64. This issues 1M IOs with a queue depth of 64 each time, waits 950ms after completion, and repeats. If the block device’s burst performance is poor, the observed phenomenon is high iowait latency and an inability to achieve high bandwidth. Furthermore, because we wait a long time between each IO issue, the io util is also low.
  • Refill Frequency: A lower refill frequency can cause severe tail latency. For example, if the token bucket refills every 1 second, and if the IOs issued within that 1 second exceed the limit, then some IOs will inevitably experience latency exceeding 1 second, leading to high tail latency.

IO util

Disk util value is defined as the proportion of time the disk spends processing IOs relative to the total time. It is the ratio of the time the disk queue has IOs to the total time. If the rate limiting algorithm causes the IO processing time to be very evenly distributed (like Leaky bucket as a queue, where IOs are processed one by one intermittently), and the disk queue always has IOs, then the util naturally is high.

For block devices configured with high burst performance, even very high queue depths can be processed quickly, resulting in naturally low Util.

Impact on Database Applications

Here we take the database MySQL built on distributed block devices as an example to discuss the impact of rate limiting strategies on SQL performance.

In MySQL, there are mainly two parts of IO that significantly affect performance:

  1. Flushing Dirty Pages
    1. To reduce the number of IOs and improve read/write performance, MySQL introduces the buffer pool. Modifications to data in MySQL are first made in the buffer pool and flushed at an appropriate time. When the memory data page and the disk data page content are inconsistent, we call this memory page a “dirty page”. After the memory data is written to disk, the content of the memory and disk data pages becomes consistent, called a “clean page”.
    2. From the usage scenario, we can infer that each dirty page flush must involve large IOs and high queues. If the block device’s burst performance is poor, it will lead to slow dirty page flushing rates, and as mentioned above, this scenario is highly likely to have low ioutil. That is to say, this IO model does not leverage the performance of the rate limiting mechanism.
    3. The solutions are simple:
      1. Reduce the queue depth (the size of each dirty page flush) to offset the limitation of insufficient burst performance.
      2. Increase the flush frequency to thereby improve ioutil and enhance the utilization of the rate limiting strategy.
  2. Flushing Redo-log. Bin-log is not discussed here because the official stance is that the performance loss from enabling bin-log is less than 1%.
    1. Redo-log, to ensure correctness, is written sequentially by a single thread. If the block device’s burst performance is poor, it will cause high latency in issuing redo-log, dragging down the entire system’s TPS.
    2. If redo-log shares the same disk with other IOs, its own priority cannot be reflected, potentially increasing redo-log latency due to dirty page flushing triggering rate limiting.
    3. The solutions include the following:
      1. Increase the block device’s burst capability.
      2. Elevate the priority of redo-log so that it is issued first. If the block device system does not support IO priorities, you can apply for another disk to be used exclusively for redo-log.
      3. The upper-layer MySQL application supports concurrently and randomly written redo-log (PolarDB should have already implemented this).
  1. Wiki Token bucket
  2. Wiki Leaky bucket
  3. Token Bucket Rate Limiting_Qemu and Librbd QoS Rate Limiting Mechanism Comparison and Algorithm Analysis
  4. Written on the 10th Anniversary of Work
  5. qemu leaky bucket
  6. Huawei Cloud Disk EVS Burst Capability Introduction
  7. How to Test Cloud Disk Performance
  8. Stress Testing ESSD Cloud Disk IOPS Performance
  9. Rate Limiting Algorithm: Sliding Window Method

Background

Due to business requirements, our company uses a closed-source C++ program from Mellanox. Mellanox’s recommended customization approach is to perform custom development on the dynamically linked libraries to add additional functionality.

During the solution discussion phase, I found that many colleagues were not very clear about the meanings represented by dynamic/static libraries, ‌especially when same-name functions exist‌. There was also no clear understanding of what the compilation, linking, and runtime results would be, hence this article was written.

Basic Concepts

Program function libraries can be divided into the following types:

‌1. Static libraries‌: During compile-time, static libraries are entirely copied into the compilation target, typically existing as .a files
‌2. Shared libraries‌: Loaded into the program when it starts, they can be shared by different programs, typically existing as .so files
‌Dynamically loaded libraries‌: During process runtime, use functions from dlfcn.h to load, call, and close dynamic libraries

##Testing Same-name Functions

Using two .c files test1.c and test2.c containing the same-name function void test()

1
2
3
4
5
6
// test1.c
#include <stdio.h>

void test() {
printf("call from test1.c");
}
1
2
3
4
5
6
// test2.c
#include <stdio.h>

void test() {
printf("call from test2.c");
}

File containing the main function main.c

1
2
3
4
5
// main.c
extern void test();
int main() {
test();
}

Test 1: .o Object Files

Using the following command line, generate object files from test1.c and test2.c, and compile the executable:

1
2
3
gcc -c ./test1.c
gcc -c ./test2.c
gcc -o main ./test1.o ./test2.o ./main.c

Resulting in error:

1
2
3
4
./test2.o: In function `test':
test2.c:(.text+0x0): multiple definition of `test'
./test1.o:test1.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

As we can see, linking object files containing same-name functions in the same namespace will result in a multiple definition error.

Test 2: Static Libraries

Using the following command line to compile static libraries libtest1.a and libtest2.a:

1
2
3
4
g++ -c ./test1.c
g++ -c ./test2.c
ar crv libtest1.a test1.o
ar crv libtest2.a test2.o

Then we link and compile:

1
gcc -L. ./main.c -ltest1 -ltest2 -o main

Compilation succeeds without errors. Execution result:

1
2
$ LD_LIBRARY_PATH=. ./main
call from test1.c

Some might ask: “Why no error? I clearly linked two static libraries containing same-name functions into the same executable.”

To investigate why no error occurred, let’s add the ld option -Wl,--verbose to see what exactly happens during linking. Re-executing compilation, we get the output:

1
2
3
4
5
6
7
8
9
...

attempt to open ./libtest1.so failed
attempt to open ./libtest1.a succeeded
(./libtest1.a)test1.o
attempt to open ./libtest2.so failed
attempt to open ./libtest2.a succeeded

...

We can discover that in the final linking result, the output binary only linked the test1.o file from libtest1.a, but did not link libtest2.a. The compiler’s behavior means:

  1. The compiler searches link libraries sequentially according to linking order.
  2. First, it finds libtest1.a and discovers it has the function void test() needed by the main function, so it links it.
  3. When scanning to libtest2.a, since void test() is already provided by the symbol from libtest1.a, it’s no longer linked.
    A question on Stack Overflow also discusses this point.

If we use the ld parameter --whole-archive to forcibly link libtest1.a and libtest2.a, we’ll see the same error as in Test 1:

1
2
3
4
5
$ gcc -L. ./main.c -Wl,--whole-archive -ltest1 -ltest2 -Wl,--no-whole-archive -o main
./libtest2.a(test2.o): In function `test':
test2.c:(.text+0x0): multiple definition of `test'
./libtest1.a(test1.o):test1.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

Test 3: Dynamic Libraries

Using the following command line to compile dynamic libraries libtest1.so and libtest2.so and compile the executable:

1
2
3
gcc -shared -fPIC -o libtest1.so test1.c
gcc -shared -fPIC -o libtest2.so test2.c
gcc -L. ./main.c -ltest1 -ltest2 -o main

Compilation succeeds without errors. Checking with ldd confirms that both libtest1.so and libtest2.so are indeed linked into the main executable. Execution result:

1
2
$ LD_LIBRARY_PATH=. ./main
call from test1.c

This shows that during dynamic linking, different link libraries can have same-name functions without affecting compilation. This is determined by the nature of dynamic link libraries, which are only dynamically loaded at runtime, and the loading order is determined by the linking order during compilation. This means symbols are resolved on a first match basis.

We can also use LD_PRELOAD to preload a certain dynamic library into memory.

Applications of Same-name Functions

Some might question: Can it be used in daily work? The answer is definitely yes.

The simplest application scenario: for example, if there’s a function in an open-source library that I don’t like, and I want to write my own version to replace it, I can completely use the above knowledge to link my implemented function into the executable file dynamically or statically, replacing the version I don’t like.

Common industrial applications include:

‌1. Library Replacement‌: The famous tcmalloc operates in this way. We link tcmalloc into the program, and as long as the tcmalloc library’s search order precedes libc, we can replace the native memory management functions with the tcmalloc version.
‌2. Mock Testing‌: Chen Shuo detailed in an article how to mock system calls in C++ unit testing. The ‌link seams‌ method utilizes the characteristic that libc is generally dynamically linked to mock system calls in the process.

Why Need Memory Alignment

  1. Nowadays computer processor does not read from and write to memory in byte-sized chunks. Instead, it accesses memory in two-, four-, eight- 16- or even 32-byte chunks granularity. So accessing unaligned memory would give us great overhead.
  2. All modern processors offer atomic instructions. These special instructions are crucial for synchronizing two or more concurrent tasks. For atomic instructions to perform correctly, the addresses you pass them must be at least four-byte aligned(to avoid memory access across pages). Otherwise, it would cause failure, or worse, silent corruption.
  3. Some instructions(like some AVX-512 instructions) are designed to have memory alignment requirements, for speed concerns.
  4. modern compilers sometimes automatically padded the structure for backward compatibility and efficiency concerns.
  5. cache line: alignment of data may determine whether an operation touches one or two cache lines. Reducing false sharing problem.

C++ in Practice

In most of the cases, C++ itself has already dealt with the memory alignment automatically. But sometimes we need better control of the memory arrangement to achieve better performance. Moreover, as any overzealous C++ programmer would do. We want to understand anything behind the programming language so we can abuse it.

specify alignment requirement for structure(on the stack)

An object, in C, is region of data storage in the execution environment. So every object has size(can be determined with sizeof) and alignment requirement (can be determined by alignof(since C11)) attributes. Each basic type has a default alignment, meaning that it will unless otherwise requested by the programmer, be aligned on a pre-determined boundary. The only notable differences in alignment for an LP64 64-bit system when compared to a 32-bit system are:

type 32-bit 64-bit
long 4-byte 8-byte
double 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option) 8-byte
long long 4-byte 8-byte
long double 8-byte aligned with Visual C++, and 4-byte aligned with GCC 8-byte aligned with Visual C++ and 16-byte aligned with GCC
pointer 4-byte 8-byte

Although the compiler normally allocates individual data items on aligned boundaries, data structures often have members with different alignment requirements. To maintain proper alignment the translator normally inserts additional unnamed data members so that each member is properly aligned. In addition, the data structure as a whole may be padded with a final unnamed member. This allows each member of an array of structures to be properly aligned.

Padding is only inserted when a structure member is followed by a member with a larger alignment requirement or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the amount of padding required to maintain alignment. For example, if members are sorted by descending alignment requirements a minimal amount of padding is required. The minimal amount of padding required is always less than the largest alignment in the structure. Computing the maximum amount of padding required is more complicated, but is always less than the sum of the alignment requirements for all members minus twice the sum of the alignment requirements for the least aligned half of the structure members.

For example, here is a structure with members of various types, totaling 8 bytes before compilation:

1
2
3
4
5
6
7
struct MixedData
{
char Data1;
short Data2;
int Data3;
char Data4;
};

After compilation the data structure will be supplemented with padding bytes to ensure a proper alignment for each of its members:

1
2
3
4
5
6
7
8
9
10
struct MixedData  /* After compilation in 64-bit x86 machine */
{
char Data1; /* 1 byte */
char Padding1[1]; /* 1 byte for the following 'short' to be aligned on a 2 byte boundary
assuming that the address where structure begins is an even number */
short Data2; /* 2 bytes */
int Data3; /* 4 bytes - largest structure member */
char Data4; /* 1 byte */
char Padding2[3]; /* 3 bytes to make total size of the structure 12 bytes */
};

Also, we could use pragma pack(n) to specify the packing alignment for structure, union, and class members. n becomes the new packing alignment value. Moreover, we could also use #pragma pack(1) to not align anything.

alignas

since c++11, we could use alignas to Specify the alignment requirement of a type or an object. If multiple alignas are met, the strictest(largest) alignment would be chosen.

1
2
3
4
5
6
7
8
9
10
11
12
struct alignas(16) Bar
{
int i; // 4 bytes;
int n; // 4 bytes;
alignas(4) char arr[3];
short s; // 2 types
};

int main()
{
std::cout << alignof(Bar) << std::endl; // output 16
}

memory alignment for heap memory allocation

The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or 16 on 64-bit systems). If we need a block whose address is a multiple of a higher power of two than that, use aligned_alloc or posix_memalign. (aligned_alloc and posix_memalign are declared in stdlib.h)

if you use a gcc/clang compiler supporting C++17 and above, you can use aligned_alloc to get spcific alignment.

1
void *aligned_alloc( size_t alignment, size_t size );

And C++17 also have a new feature called aligned new to support allocation alignment:

1
void* operator new  ( std::size_t count, std::align_val_t al );

Moreover, in C++17(GCC>=7, clang>5, MSVC>=19.12) the standard allocators have been updated to respect type’s alignment, so containers can allocate appropriate memory meets memory alignment requirement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class alignas(32) Vec3d {
double x, y, z;
};

class foo {
int x;
};

int main() {
std::cout << sizeof(Vec3d) << std::endl; // output 32
std::cout << alignof(Vec3d) << std::endl; // output 32

// specify align_val_t, but need manually call destructor.
auto p_aligned_type = new (std::align_val_t{32}) foo;
p_aligned_type->~foo();
::operator delete(p_aligned_type, std::align_val_t{32});

// using container to allocate aligned memory.
std::vector<__m256> vec(10);
vec.push_back(_mm256_set_ps(0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f));
asssert(reinterpret_cast<uintptr_t>(vec.data()) % alignof(__m256) == 0);
};

Preferences

  1. Data structure alignment
  2. Purpose of memory alignment
  3. Data alignment: Straighten up and fly right
  4. Gallery of Processor Cache Effects
  5. Alignment
  6. Allocating Aligned Memory Blocks