Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion assets/contributors.csv
Original file line number Diff line number Diff line change
Expand Up @@ -104,4 +104,5 @@ Alejandro Martinez Vicente,Arm,,,,
Mohamad Najem,Arm,,,,
Ruifeng Wang,Arm,,,,
Zenon Zhilong Xiu,Arm,,zenon-zhilong-xiu-491bb398,,
Zbynek Roubalik,Kedify,,,,
Zbynek Roubalik,Kedify,,,,
Yahya Abouelseoud,Arm,,,,
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: Overview
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Linux kernel profiling with Arm Streamline

Performance tuning is not limited to user-space applications—kernel modules can also benefit from careful analysis. [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) is a powerful software profiling tool that helps developers understand performance bottlenecks, hotspots, and memory usage, even inside the Linux kernel. This learning path explains how to use Arm Streamline to profile a simple kernel module.

### Why profile a kernel module?

Kernel modules often operate in performance-critical paths, such as device drivers or networking subsystems. Even a small inefficiency in a module can affect the overall system performance. Profiling enables you to:

- Identify hotspots (functions consuming most CPU cycles)
- Measure cache and memory behavior
- Understand call stacks for debugging performance issues
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: Build Linux image
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Build a debuggable kernel image

For this learning path we will be using [Buildroot](https://github.com/buildroot/buildroot) to build a Linux image for Raspberry Pi 3B+ with a debuggable Linux kernel. We will profile Linux kernel modules built out-of-tree and Linux device drivers built in the Linux source code tree.

1. Clone the Buildroot Repository and initialize the build system with the default configurations.

```bash
git clone https://github.com/buildroot/buildroot.git
cd buildroot
make raspberrypi3_64_defconfig
make menuconfig
make -j$(nproc)
```

2. Change Buildroot configurations to enable debugging symbols and SSH access.

```plaintext
Build options --->
[*] build packages with debugging symbols
gcc debug level (debug level 3)
[*] build packages with runtime debugging info
gcc optimization level (optimize for debugging) --->
System configuration --->
[*] Enable root login with password
(****) Root password # Choose root password here
Kernel --->
Linux Kernel Tools --->
[*] perf
Target packages --->
Networking applications --->
[*] openssh
[*] server
[*] key utilities
```

You might also need to change your default `sshd_config` file according to your network settings. To do that, you need to modify System configuration→ Root filesystem overlay directories to add a directory that contains your modified `sshd_config` file.

3. By default the Linux kernel images are stripped so we will need to make the image debuggable as we'll be using it later.
```bash
make linux-menuconfig
```
```plaintext
Kernel hacking --->
-*- Kernel debugging
Compile-time checks and compiler options --->
Debug information (Rely on the toolchain's implicit default DWARF version)
[ ] Reduce debugging information #un-check
```

4. Now we can build the Linux image and flash it to the the SD card to run it on the Raspberry Pi.

```bash
make -j$(nproc)
```

It will take some time to build the Linux image. When it completes, the output will be in `<buildroot dir>/output/images/sdcard.img`
For details on flashing the SD card image, see [this helpful article](https://www.ev3dev.org/docs/tutorials/writing-sd-card-image-ubuntu-disk-image-writer/).
Now that we have a target running Linux with a debuggable kernel image, we can start writing our kernel module that we want to profile.
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
---
title: Build out-of-tree kernel module
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Creating the Linux Kernel Module

We will now learn how to create an example Linux kernel module (Character device) that demonstrates a cache miss issue caused by traversing a 2D array in column-major order. This access pattern is not cache-friendly, as it skips over most of the neighboring elements in memory during each iteration.

To build the Linux kernel module, start by creating a new directory—We will call it **example_module**—in any location of your choice. Inside this directory, add two files: `mychardrv.c` and `Makefile`.

**Makefile**

```makefile
obj-m += mychardrv.o
BUILDROOT_OUT := /opt/rpi-linux/buildroot/output # Change this to your buildroot output directory
KDIR := $(BUILDROOT_OUT)/build/linux-custom
CROSS_COMPILE := $(BUILDROOT_OUT)/host/bin/aarch64-buildroot-linux-gnu-
ARCH := arm64

all:
$(MAKE) -C $(KDIR) M=$(PWD) ARCH=$(ARCH) CROSS_COMPILE=$(CROSS_COMPILE) modules

clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean
```

{{% notice Note %}}
Change **BUILDROOT_OUT** to the correct buildroot output directory on your host machine
{{% /notice %}}

**mychardrv.c**

```c
// SPDX-License-Identifier: GPL-2.0
#include "linux/printk.h"
#include <linux/cdev.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>

// Using fixed major and minor numbers just for demonstration purposes.
// Major number 42 is for demo/sample uses according to
// https://www.kernel.org/doc/Documentation/admin-guide/devices.txt
#define MAJOR_VERSION_NUM 42
#define MINOR_VERSION_NUM 0
#define MODULE_NAME "mychardrv"
#define MAX_INPUT_LEN 64

static struct cdev my_char_dev;

/**
* @brief Traverse a 2D matrix and calculate the sum of its elements.
*
* @size: The size of the matrix (number of rows and columns).
*
* This function allocates a 2D matrix of integers, initializes it with the sum
* of its indices, and then calculates the sum of its elements by accessing them
* in a cache-unfriendly column-major order.
*
* Return: 0 on success, or -ENOMEM if memory allocation fails.
*/
int char_dev_cache_traverse(long size) {
int i, j;
long sum = 0;

int **matrix;

// Allocate rows
matrix = kmalloc_array(size, sizeof(int *), GFP_KERNEL);
if (!matrix)
return -ENOMEM;

// Allocate columns and initialize matrix
for (i = 0; i < size; i++) {
matrix[i] = kmalloc_array(size, sizeof(int), GFP_KERNEL);
if (!matrix[i]) {
for (int n = 0; n < i; n++) {
kfree(matrix[n]);
}
kfree(matrix);
return -ENOMEM;
}

for (j = 0; j < size; j++)
matrix[i][j] = i + j;
}

// Access in cache-UNFRIENDLY column-major order
for (j = 0; j < size; j++) {
for (i = 0; i < size; i++) {
sum += matrix[i][j];
}
}

pr_info("Sum: %ld\n", sum);

// Free memory
for (i = 0; i < size; i++)
kfree(matrix[i]);
kfree(matrix);

return 0;
}

/**
* @brief Gets the size of the list to be created from user space.
*
*/
static ssize_t char_dev_write(struct file *file, const char *buff,
size_t length, loff_t *offset) {
(void)file;
(void)offset;

ssize_t ret = 0;
char *kbuf;
long size_value;

// Allocate kernel buffer
kbuf = kmalloc(MAX_INPUT_LEN, GFP_KERNEL);
if (!kbuf)
return -ENOMEM;

// copy data from user space to kernel space
if (copy_from_user(kbuf, buff, length)) {
ret = -EFAULT;
goto out;
}
kbuf[length] = '\0';

// Convert string to long (Base 10)
ret = kstrtol(kbuf, 10, &size_value);
if (ret)
goto out;

// Call cache traversal function
ret = char_dev_cache_traverse(size_value);
if (ret)
goto out;

ret = length;

out:
kfree(kbuf);
return ret;
}

static int char_dev_open(struct inode *node, struct file *file) {
(void)file;
pr_info("%s is open - Major(%d) Minor(%d)\n", MODULE_NAME,
MAJOR(node->i_rdev), MINOR(node->i_rdev));
return 0;
}

static int char_dev_release(struct inode *node, struct file *file) {
(void)file;
pr_info("%s is released - Major(%d) Minor(%d)\n", MODULE_NAME,
MAJOR(node->i_rdev), MINOR(node->i_rdev));
return 0;
}

// File operations structure
static const struct file_operations dev_fops = {.owner = THIS_MODULE,
.open = char_dev_open,
.release = char_dev_release,
.write = char_dev_write};

static int __init char_dev_init(void) {
int ret;
// Allocate Major number
ret = register_chrdev_region(MKDEV(MAJOR_VERSION_NUM, MINOR_VERSION_NUM), 1,
MODULE_NAME);
if (ret < 0)
return ret;

// Initialize cdev structure and add it to kernel
cdev_init(&my_char_dev, &dev_fops);
ret = cdev_add(&my_char_dev, MKDEV(MAJOR_VERSION_NUM, MINOR_VERSION_NUM), 1);

if (ret < 0) {
unregister_chrdev_region(MKDEV(MAJOR_VERSION_NUM, MINOR_VERSION_NUM), 1);
return ret;
}

return ret;
}

static void __exit char_dev_exit(void) {
cdev_del(&my_char_dev);
unregister_chrdev_region(MKDEV(MAJOR_VERSION_NUM, MINOR_VERSION_NUM), 1);
}

module_init(char_dev_init);
module_exit(char_dev_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Yahya Abouelseoud");
MODULE_DESCRIPTION("A simple char driver with cache misses issue");
```

The module above receives the size of a 2D array as a string through the `char_dev_write()` function, converts it to an integer, and passes it to the `char_dev_cache_traverse()` function. This function then creates the 2D array, initializes it with simple data, traverses it in a column-major (cache-unfriendly) order, computes the sum of its elements, and prints the result to the kernel log.

## Building and Running the Kernel Module

1. To compile the kernel module, run make inside the example_module directory. This will generate the output file `mychardrv.ko`.

2. Transfer the .ko file to the target using scp command and then insert it using insmod command. After inserting the module, we create a character device node using mknod command. Finally, we can test the module by writing a size value (e.g., 10000) to the device file and measuring the time taken for the operation using the `time` command.

```bash
scp mychardrv.ko root@<target-ip>:/root/
```

{{% notice Note %}}
Replace \<target-ip> with your own target IP address
{{% /notice %}}

3. To run the module on the target, we need to run the following commands on the target:

```bash
ssh root@<your-target-ip>

#The following commands should be running on target device

insmod /root/mychardrv.ko
mknod /dev/mychardrv c 42 0
```

{{% notice Note %}}
42 and 0 are the major and minor number we chose in our module code above
{{% /notice %}}

4. Now if you run dmesg you should see something like:

```log
[12381.654983] mychardrv is open - Major(42) Minor(0)
```

5. To make sure it's working as expected you can use the following command:

```bash { output_lines = "2-4" }
time echo '10000' > /dev/mychardrv
# real 0m 38.04s
# user 0m 0.00s
# sys 0m 38.03s
```

The command above passes 10000 to the module, which specifies the size of the 2D array to be created and traversed. The **echo** command takes a long time to complete (around 38 seconds) due to the cache-unfriendly traversal implemented in the `char_dev_cache_traverse()` function.

With the kernel module built, the next step is to profile it using Arm Streamline. We will use it to capture runtime behavior, highlight performance bottlenecks, and help identifying issues such as the cache-unfriendly traversal in our module.
Loading
Loading