SRIOV

In compute host

1) Enable VT-d in BIOS

2) Check kernel config. (vim /boot/config-`uname -r`) These should be there (=y, or =m):
CONFIG_PCI_IOV=y
CONFIG_PCI_STUB=m
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_VFIO=m
CONFIG_VFIO_PCI=m
CONFIG_INTEL_IOMMU_DEFAULT_ON (If this is not set, it should be passed in step 2)

3) enable intel_iommu=1 in bootloader
change /etc/default/grub to add intel_iommu=on, and reboot the node

GRUB_CMDLINE_LINUX=” console=ttyS0,9600 console=tty0 rootdelay=90 nomodeset intel_iommu=on”

#update-grub /* to update /boot/grub/grub.cfg */
#reboot

4) After kernel 3.7, do this:
how to find the number 82, use lspci -n | grep
so far the id could be, 8086:154d, 8086:10fb. there might be more
#echo 4 > /sys/bus/pci/device/0000:82:…/sriov_numvfs

5) lspci -v to check VFs are available in compute host
#modprobe pci_stub

readlink …/driver pci_stub means kvm passthrough configured towared vm, ixgbevf means no one used it
…/net find out the eth name
ls -l …/physfn point to the NIC
…/numa_node point to socket number

#modprobe kvm
#modprobe kvm_intel

Add them to /etc/modules: pci_stub, kvm, kvm_intel

Posted in Uncategorized | Leave a comment

Kernel-based Virtual Machines (KVM) with PCI passthrough

This note describes how to use KVM (Kernel-based Virtual Machine), and its PCI passthrough capability (where a PCI device can be assigned to a virtual machine).

PCI passthrough (the ‘-pcidevice’ option) is supported from KVM-79 onward. Current top-of-tree is KVM-84. As the KVM-79 release notes indicate, 2.6.28 kernel is required:

http://www.linux-kvm.com/content/kvm-79-released-pci-device-assignment-pci-device-hot-plug

There are three components to KVM:

The 2.6.28 kernel and kvm.ko loadable module.
The KVM loadable module (kvm-intel.ko or kvm-amd.ko).
The KVM-modified QEMU userspace binary. As some point, the KVM changes will be merged with QEMU development.
For our Fedora 10 systems, I downloaded a prebuilt 2.6.28 kernel from here:

http://koji.fedoraproject.org/koji/buildinfo?buildID=79697

Note: although the release notes say that 2.6.28 kernel is required for KVM, 2.6.29 kernel is required for MSI support (which we are interested in). See the ‘Virtualization’ section of the 2.6.29 release notes:

http://kernelnewbies.org/Linux_2_6_29

After installing the kernel, I downloaded KVM-84 from here:

http://kvm.qumranet.com/kvmwiki/Downloads

There does exist a HOWTO document on this wiki page for installing and running KVM, but I found that their instructions for creating a root filesystem for your virtual machine were incorrect. The steps I followed are described here.

After unpacking the tarball, I did the following:

user% cd kvm-84
user% ./configure –prefix=/usr/local/kvm
user% make
Setting ‘–prefix’ specifies that the qemu binaries will be installed under /usr/local/kvm/bin. You may need to install zlib, SDL, and kernel-devel on your system in order for the make to succeed.

Then, as root:

root% make install
root% modprobe kvm-intel (or kvm-amd if you are running on an AMD system).
You can verify the module is loaded:

% lsmod | grep kvm
kvm_intel 56424 0
kvm 164256 1 kvm_intel

Your system must support Intel’s VT-d or AMD’s SVM in order for KVM to work.

Now, we have to create a root filesystem. The simplest method is to use dd to create a sparse file:

% dd if=/dev/zero of=vdisk.img bs=1 seek=10G count=0
There does also exist a qemu-img tool, which can generate a QCOW format filesystem, but I did not try it; the sparse file worked fine for me.

Then, you can invoke the emulator to install an OS. Download a .iso for an OS to your system. Review the list of supported guest OSes here:

http://kvm.qumranet.com/kvmwiki/Guest_Support_Status

I used Debian Lenny on my system. Install the OS as follows:

root% /usr/local/kvm/bin/qemu-system-x86_64 -hda vdisk.img -cdrom ./debian-500-i386-netinst.iso -boot d -m 384

After completing the OS install, you can run the virtual machine as follows:

root% /usr/local/kvm/bin/qemu-system-x86_64 vdisk.img -m 384

‘-m’ specifies the amount of memory on the system. Note that you must be root in order to run qemu (due to permissions for accessing /dev/kvm and other files).

You can run QEMU from the vga console (through the ‘other’ KVM) as root, or through an ssh session with X forwarding:

% ssh -X ${REMOTE_HOST}
When I attempted to run QEMU from within a VNC session as root, I got the following error:

Invalid MIT-MAGIC-COOKIE-1 keyInvalid MIT-MAGIC-COOKIE-1 keyCould not initialize SDL – exiting
Sometimes I see X11 forwarding fail when running as root/sudo on the server (you ‘ssh -X’ to the machine as a regular user, but then switch to root to run qemu); one fix I’ve discovered is to copy ${HOME}/.Xauthority to /root. Then you can forward X windows while running as root.

In order to boot a virtual machine with a PCI device assigned, first obtain the device ID of your device (in the form “bus:device.function”). lspci will display PCI device information:

% lspci
% lspci -v (verbose mode)
% lspci -t (device IDs are displayed in tree form)
For example, if our device was assigned ID 02:00.0, we invoke qemu as follows:

root% /usr/local/kvm/bin/qemu-system-x86_64 vdisk.img -m 384 -pcidevice host=02:00.0
PCI passthrough has a limitation where the device to be passed must use a non-shared IRQ (we cannot share an IRQ between both the host and the guest), or the device must support MSI. In my system, I attempted to passthrough a Xilinx FPGA and both ports of a dual ethernet card.

% lspci
[…]
02:00.0 Ethernet controller: Intel Corporation Device 10c9 (rev 01)
02:00.1 Ethernet controller: Intel Corporation Device 10c9 (rev 01)
03:00.0 Memory controller: Xilinx Corporation Device 0007
[…]
Unfortunately, I was only able to passthrough the second ethernet port (02:00.1). ‘lspci -v’ showed us the IRQ information:

02:00.0 Ethernet controller: Intel Corporation Device 10c9 (rev 01)
Subsystem: Intel Corporation Device a03c
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at e1820000 (32-bit, non-prefetchable) [size=128K]
Memory at e1400000 (32-bit, non-prefetchable) [size=4M]
I/O ports at 3020 [size=32]
Memory at e18c4000 (32-bit, non-prefetchable) [size=16K]
Expansion ROM at e2000000 [disabled] [size=4M]
Capabilities:
Kernel driver in use: igb
Kernel modules: igb

02:00.1 Ethernet controller: Intel Corporation Device 10c9 (rev 01)
Subsystem: Intel Corporation Device a03c
Flags: fast devsel, IRQ 17
Memory at e1800000 (32-bit, non-prefetchable) [size=128K]
Memory at e1000000 (32-bit, non-prefetchable) [size=4M]
I/O ports at 3000 [size=32]
Memory at e18c0000 (32-bit, non-prefetchable) [size=16K]
Expansion ROM at e2400000 [disabled] [size=4M]
Capabilities:
Kernel modules: igb
03:00.0 Memory controller: Xilinx Corporation Device 0007
Subsystem: Xilinx Corporation Device 0007
Flags: bus master, fast devsel, latency 0, IRQ 10
Memory at e1b00000 (64-bit, non-prefetchable) [size=1K]
Capabilities:

The first ethernet port and the Xilinx fpga shared IRQ 16 along with the USB hub device and the SATA controller. The second ethernet port used IRQ 17, which was unshared with any other device. None of the devices support MSI, so only the second ethernet port was assignable to the virtual machine. On some systems, it may be possible to reassign IRQs in BIOS, but our xc5-pc* machines do not support that functionality. Sometimes changing the PCI slot of the device can also change the IRQ assignment, but again, that is not the case with our machines–both PCIe slots share the same IRQ.

When we assigned the second ethernet port to the virtual machine, we saw the following in the virtual machine:

% lspci
[…]
00:06.0 Ethernet controller: Intel Corporation Device 10c9 (rev 01)
[…]

Posted in Uncategorized | Leave a comment

VM Performance tuning

Overview
Here come the descriptions on how to reduce the disturbance on the vswitch and virtual cores which handle dpdk. The disturbance could be

HW interrupts in the host machine or virtual HW interrupts in the guest machine
Several virtual cores compete on the same physical core where dpdk executes. This means
– Two or more virtual core executes on the same physical core

– Two or more virtual core executes on the hyperthread.

The virtual core which execute dpdk should/must execute on the same socket as the vswitch
No vmexit occurs in the core which executes vmexit. Each vmexit cause the virtual core to be halted for a couple of us.
CPU Cores
Core 0,2,4,….belongs to socket 0

Core 1,3,5,7.. belongs to socket 1

Core 0 and 20 are siblings, 2 are 22 are sibling, and so on.

root@node-3:~# cat /proc/cpuinfo | grep -E ‘processor|physical id|core id’ | more

processor : 0
physical id : 0
core id : 0
processor : 1
physical id : 1
core id : 0
processor : 2
physical id : 0
core id : 1
processor : 3
physical id : 1
core id : 1

….

processor : 20
physical id : 0
core id : 0
processor : 21
physical id : 1
core id : 0
processor : 22
physical id : 0
core id : 1
processor : 23
physical id : 1
core id : 1

HW interrupts
The /proc/irq/”irq number”/smp_affinity file contains a cpu mask for HW interrupts. In our case we must make sure that vswitch and dpdk cores do not get any HW interrupts. Assume that we run dual socket Ivy bridge machine with 40 cores in total. We assign core 0 and core 20 (the corresponding hyperthread to core 0) to handle HW interrupts. I suggest that you put all the setting in a script file. This needs to be configured on all compute host machines.

#!/bin/sh

echo stop irqbalance
/etc/init.d/irqbalance stop
echo “Set HW interrupts to core 0 and 20 (hyperthread)”
dev=`grep ‘eth’ /proc/interrupts | cut -c1-4 |awk ‘{print $1}’`
echo $dev
for di in $dev; do
echo 0001 > /proc/irq/$di/smp_affinity
done
Check with cat /proc/interrupts that all interrupts occurs on core 0.
Core isolation
Make sure that virtual machines do not use core 0 and 1. This means that 2 cores are allocated for the infrastructure. Not sure that this is needed but it’s simpler for our test. Not sure either how many virtual cores are used for dpdk. Assume therefore 4 cores as an example. Allocate core 4,6,8,10 for virtual cores running dpdk.

1) Set core affinity for all qemu thread to not exeute on core 0,2,4,6,8,10 and 20,22,24,26,28,30 (corresponding hypertherad)

#! /bin/sh
pids=`pgrep qemu`
for pid in $pids; do
threads=`ls /proc/$pid/task/`
for thread in $threads; do
taskset -cp 11-19,31-39,1,3,5,7,9,21,23,25,27,29 $thread
done
done
2) Identify which qemu the qemu which handle dpdk
Stop traffic and type top in the host mashine and type H to see the load on each thread. The qemu thread which execute 100 %load is now our virtual cores which execute dpdk. Note the pid for each qemu thread which execute dpdk.

3) Set affinity on the qemu thread which execute dpdk

taskset cp 4 pid number
taskset cp 6 pid number
……

Core 0,2 is now guaranteed to be not disturbed by virtual machines. Core 4,6,8,10 is now reserved for dpdk in the guest machine and no virtual cores are going to be allocated on the corresponding hyperthread (core 20,22,24,26,28,30). Further, dpdk executes now on socket 0.
Check that limited amount of vmexit occurs on core
Use ftrace to measure kvm events in the host machine

#!/bin/bash

echo Clear trace buffer
echo 1 > /sys/kernel/debug/tracing/free_buffer

echo “set cpu mask core 4,6,8,10,12 ? 1550 0001010101010000)”
echo 1550 > /sys/kernel/debug/tracing/tracing_cpumask

echo increase trace buffer
echo 20000 > /sys/kernel/debug/tracing/buffer_size_kb

echo start trace
echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
sleep 1
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/kernel/debug/tracing/events/timer/enable
echo 0 > /sys/kernel/debug/tracing/events/kvm/enable

cat /sys/kernel/debug/tracing/trace_pipe > x
echo trace ended

sync

sleep 5

echo start analalyze
grep -v start_test x > xx

events=`egrep -c kvm xx`

kvmevent=`awk ‘{print $5}’ xx |
sort -u`

func=`awk ‘{print $1}’ xx | sort -u`

for o in $kvmevent; do
val=`grep -c $o xx`
printf “%26s %7d\n” $o $val
done

printf “%20s %7d\n”Total events: $events

echo “Functions”
for o in $func; do
val=`grep -c $o xx`
printf “%26s %7d\n” $o $val
done

VM exit cannot be avoided but should be as minimal as possible. At no traffic the number of kvm_exit should be around 100 and must not increase when traffic starts.

Posted in Uncategorized | Leave a comment

Linux Device Driver Initialization

This article briefly speaks about introduction to Linux PCI device driver initialization and its probing technique.

Introduction:

In Linux 3.5, the PCI device driver during the boot process is deployed in the form of statically linked or LKM typically loaded into the kernel as a kernel module. When the module is registered with the kernel it executes a callback function called xyz_init_module. The drivers probe function xyz_probe() is called indirectly when the drivers xyz_int_module() function registers with the PCI bus provided required conditions are met through kernel core layers and when PCI bus is configured successfully.

The PCI initialization framework must adhere three logical sections mentioned below:

  • The device driver should be capable of module initialization xyz_init_module routine
  • Secondly, capable of device initialization xyz_probe routine. Also referred as driver probe method.
  • Third,  module initialization routine should call pci_register_driver() in order to register pci driver to PCI core.

Note: I am referring to Intel 10 GigaBit Ethernet (drivers/net/ethernet/intel/ixgbe/) driver. All functions that used to discuss are from ixgbe_main.c. For instance xyz string replaced with ixgbe in module and device initialization routines are ixgbe_init_module() and ixgbe_probe() respectively.

Driver Register:

The ixgbe drivers ixgbe_init_module() function calls pci_register_driver(struct pci_driver *drv) by passing reference to structure of type pci_driver. struct pci_driver is an important structure all PCI drivers should have, which gets initilized with variables like drivers name, table list of PCI devices the driver can support, callback routines for the PCI core subsystem.

pci_driver structure created for each PCI device:

struct pci_driver {
struct list_head node;
const char *name;
const struct pci_device_id *id_table; /* must be non-NULL for probe to be called */
int (*probe) (struct pci_dev *dev, const struct pci_device_id *id); /* New device inserted */
void (*remove) (struct pci_dev *dev); /* Device removed (NULL if not a hot-plug capable driver) */
int (*suspend) (struct pci_dev *dev, pm_message_t state); /* Device suspended */
int (*suspend_late) (struct pci_dev *dev, pm_message_t state);
int (*resume_early) (struct pci_dev *dev);
int (*resume) (struct pci_dev *dev); /* Device woken up */
void (*shutdown) (struct pci_dev *dev);
struct pci_error_handlers *err_handler;
struct device_driver driver;
struct pci_dynids dynids;
};

The drivers pci_driver structure has important member fields listed below:

  • name – Name to the driver which is unique among all PCI drivers in the kernel. It will appear under /sys/bus/pci/drivers.
  • pci_device_id – A table of device identification data consists type of chips this driver supports.
  • probe – The address of ixgbe_probe() function.
  • remove/suspend/resume/shutdown – address to the function that the PCI core system calls when PCI device is removed/suspended/resumed/shutdown respectively. Generally used by upper layers for power management.

pci_driver structure for ixgbe driver created by initializing fields as below:

static struct pci_driver ixgbe_driver = {
.name = ixgbe_driver_name,
.id_table = ixgbe_pci_tbl,
.probe = ixgbe_probe,
.remove = __devexit_p(ixgbe_remove),
#ifdef CONFIG_PM
.suspend = ixgbe_suspend,
.resume = ixgbe_resume,
#endif
.shutdown = ixgbe_shutdown,
.err_handler = &ixgbe_err_handler
};

Hence to register ixgbe’s struct pci_driver with PCI layer, call to pci_register_driver(&ixgbe_driver) initiates probing for the device in the underlying PCI core. The function returns value 0 if success and negative number on failure.

Probing:

Now lets understand how probing of ixgbe driver takes place through sequence of function calls registering with PCI core, when module initialization routine ixgbe_init_module() called for the first time when ixgbe driver is loaded.

  • ixgbe_init_module() calls pci_register_driver(&ixgbe_driver)
  • pci_register_driver() calls __pci_register_driver()
  • __pci_register_driver() binds the driver to PCI bus pci_bus_type
  • __pci_register_driver() calls driver_register()
  • driver_register() calls driver_find() to check if the driver is already registered.
  • If drivers registration is for first time, driver_register() calls bus_add_driver()
  • bus_add_driver() calls driver_attach()
  • driver_attach() function tries to bind the driver to device. It calls bus_for_each_device() which invokes callback for each device providing the __driver_attach() function as a callback.
  • __driver_attach() calls driver_probe_device() only if the device doesn’t have a driver yet.
  • driver_probe_device() calls really_probe()
  • really_probe() calls the probe method of PCI bus object being the method called pci_device_probe()
  • pci_device_probe() calls __pci_device_probe()
  • __pci_device_probe() checks for a PCI device for this driver, calls pci_call_probe()
  • pci_call_probe() calls local_pci_probe()
  • local_pci_probe() calls probe method ixgbe_probe() for ixgbe.

Note: The above call graph doesn’t could not be the same when any PCI device is not plugged.

Conclusions:

I’ve presented both abstract and concrete view on Linux PCI network device driver initialization and its method of probing considering Intel’s ixgbe driver. We’ve gone through how PCI device is registered right form module initilization till device initilization. In particular the device probing call trace is described when a PCI device is deployed in the form of loadable kernel module. In next article we will cover what exactly driver probe function setups in order to achieve network packet processing.

The above provided hyper-link’s are based on Linux 3.5

 References:

  1. Chapter 12, LDD3 
  2. Chapter 14, LDD3
  3. Linux PCI Init
Posted in Device Driver, Linux | Tagged , , | 1 Comment

PCI Fundamentals

PCI Peripheral Addressing:

Usually PCI peripheral device is identified with below numbers:

  1. Bus number – 8 bits
  2. Device number – 5 bits
  3. Function number – 3bits

root@linux:/root> lspci | grep PCI
80:03.2 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3c (rev 05)

The 16 bit hardware address of PCI peripheral device can be figured from above PCI Id output as shown,

8 bit bus number: 1000 0000
5 bit device number: 0 0011
3 bit function number: 010

Collectively 16 bit HW address becomes: 1000 0000 0001 1010 base 2 == 801a in Hexa decimal.

PCI Peripheral Memory & I/O-Registers:

The PCI peripheral device significantly consists of two address respectively for Memory space and I/O ports. Both addresses are unique per PCI device which are apparently mapped with access to configuration registers of that particular PCI device.

Ideally configuration space for a PCI device is of 256 Bytes as shown below. Operating system device drivers software does has accessible API’s to access the PCI config. space to determine the amount of memory, I/O space needed for the device through its configurations registers.

Configuration space:

root@linux:/root> lspci -s 80:03.2 -x -v
80:03.2 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3c (rev 05) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=80, secondary=86, subordinate=86, sec-latency=0
I/O behind bridge: 0000d000-0000dfff
Memory behind bridge: fb000000-fb0fffff
Capabilities: [40] Subsystem: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3c
Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
Capabilities: [90] Express Root Port (Slot+), MSI 00
Capabilities: [e0] Power Management version 3
Capabilities: [100] Vendor Specific Information: ID=0002 Rev=0 Len=00c
Capabilities: [110] Access Control Services
Capabilities: [148] Advanced Error Reporting
Capabilities: [1d0] Vendor Specific Information: ID=0003 Rev=1 Len=00a
Capabilities: [250] #19
Capabilities: [280] Vendor Specific Information: ID=0004 Rev=2 Len=018
Kernel driver in use: pcieport
00: 86 80 0a 3c 07 04 10 00 05 00 04 06 10 00 81 00
10: 00 00 00 00 00 00 00 00 80 86 86 00 d0 d0 00 00
20: 00 fb 00 fb f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 10 00

Accessing PCI Configuration space:

Linux kernel provides interfaces to access configuration registers using two PCI read/write functions,

/* Low-level architecture-dependent routines */
struct pci_ops {
int (*read)(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *val);
int (*write)(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 val);
};

Kernel software uses below functions inorder to read the config. space,

pci_read_config_byte(const struct pci_dev *dev, int where, u8 *val);
pci_read_config_word(const struct pci_dev *dev, int where, u16 *val);
pci_read_config_dword(const struct pci_dev *dev, int where, u32 *val);

All PCI buses are recognized and created at system bootup time which are associated with various features such as respective pci_bus structure, one member of interest is struct pci_ops which makes Linux kernel to access PCI read/write operations.

Secondly initializing PCIs memory and I/O space is done by subroutine pci_enable_device() in Linux, where each PCI device having 6 base address registers (BARs) associated with different addresses in memory and I/O port memory mapped.

Posted in PCI | Tagged , | Leave a comment

Shared libraries on 64 Bit Linux system – Why huge Virtual Memory

For the same program compiled on both 32 & 64 bit, the virtual size for 64bit process is larger than 32bit. To view the process virtual memory size mapped use command /proc/<pid>/maps. The output tells significant pages of dynamic libraries occupied are huge when compared to 32 bit process.   

So what constitutes a 64 bit dynamic libraries have a additional memory compared to 32 bit; Lets study the difference by taking an example of loading libc.so and looking this from how loader loads dynamic libraries. Below are strace outputs for both 32 bit and 64 bit executables which tells us there are function calls to mmap & mprotect.

Linux strace output for 64 & 32 bit process respectively as shown below,

esunboj@L9AGC12:~/32_64bit$ strace ./crash-x86-64
...
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3>\1\200\30\2"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1811128, ...}) = 0
mmap(NULL, 3925208, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =    
0x7fa354f8a000
mprotect(0x7fa35513f000, 2093056, PROT_NONE) = 0
mmap(0x7fa35533e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE,
3, 0x1b4000) = 0x7fa35533e000
mmap(0x7fa355344000, 17624, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS,
-1, 0) = 0x7fa355344000
close(3)                                = 0
...

esunboj@L9AGC12:~/32_64bit$ strace ./crash
...
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\3\1000\226\1004"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1730024, ...}) = 0
mmap2(NULL, 1743580, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xfffffffff7546000
mprotect(0xf76e9000, 4096, PROT_NONE)   = 0
mmap2(0xf76ea000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE,  
3, 0x1a3) = 0xfffffffff76ea000
mmap2(0xf76ed000, 10972, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 
-1, 0) = 0xfffffffff76ed000
close(3)                                = 0
...

Closely observing both the strace’s outputs two things need to be investigate,

1. Each of them maps memory(mmap’s) 3 times and 1 call to mprotect exactly after first mmap.

2. Comparing mprotect calls for 64bit & 32bit has 2093056B & 4096B of region protected respectively.

In dl-load.c, subroutine _dl_map_object_from_fd() maps dynamic library memory segments to virtual space by setting required permissions and zero fills .bss section of library and updates the link map structure. Lets get here some part of code snippet from dl-load.c for more analysis,

struct link_map *
 _dl_map_object_from_fd ( )
{
 ...
  /* Scan the program header table, collecting its load commands. */
  struct loadcmd
   {
     ElfW(Addr) mapstart, mapend, dataend, allocend;
     off_t mapoff;
     int prot;
   } loadcmds[l->l_phnum], *c; // l is link_map struct described for each object 
                                  of dynamic linker 
  size_t nloadcmds = 0;
  bool has_holes = false;
  ...
  for (ph = phdr; ph < &phdr[l->l_phnum]; ++ph)
  switch (ph->p_type)
  {
  ...
  case PT_LOAD:
  ...
    c = &loadcmds[nloadcmds++];
    c->mapstart = ph->p_vaddr & ~(GLRO(dl_pagesize) - 1);
    c->mapend = ((ph->p_vaddr + ph->p_filesz + GLRO(dl_pagesize) - 1)
                     & ~(GLRO(dl_pagesize) - 1));
  ...
    if (nloadcmds > 1 && c[-1].mapend != c->mapstart)
        has_holes = true;
  ...
  }
  ...
    if (has_holes)
       __mprotect ((caddr_t) (l->l_addr + c->mapend),
          loadcmds[nloadcmds - 1].mapstart - c->mapend, PROT_NONE);
  ...
}

In the above code l_phnum used in for statement holds number of entries in the ELF program header. Ideally for each iteration each entry segments are mapped. When PT_LOAD segment case hits for its first time, its basically a .text or .rodata section which gets mmapped (1st mmap in strace) and second PT_LOAD segment represents .data section gets mapped (2nd mmap in strace). Before second PT_LOAD segment is mapped, mapstart and mapend is preserved which refer to start and end of text section. In next PT_LOAD iteration if previous segment mapend not equals to current (.data) segment mapstart then their is a hole between two PT_LOAD segments (meaning gap between .text and .data sections). Therefore, if their is a hole between memory regions with null permissions, loader will protect (mprotect call in strace) it or make it inaccessible. Protected region for 64 bit and 32 bit process are 511 Vs just 1 page respectively adding to huge memory chunk for 64 bit libraries.

Proof for 64bit inaccessible region: Objdump for libc.so below gives us some virtual address (VA) statistics which are roundoff appropriately as follows,

                 PT_LOAD(1st)            PT_LOAD(2nd)                    
mapstart VA   0x0000000000000000     0x00000000003b4000   
mapend   VA   0x00000000001b5000     0x00000000003A0000

Here PT_LOAD(1st) mapend (0x00000000001b5000) is not equal to PT_LOAD(2) mapstart (0x00000000003b4000) resulting a memory hole of 0x00000000001FF000 (In decimal 2093056Bytes).

esunboj@L9AGC12:~/32_64bit$objdump -x -s -d -D /lib/x86_64-linux-gnu/libc.so.6 
Program Header: 
...
  LOAD off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**21
       filesz 0x00000000001b411c memsz 0x00000000001b411c flags r-x
  LOAD off    0x00000000001b4700 vaddr 0x00000000003b4700 paddr 0x00000000003b4700 align 2**21
       filesz 0x0000000000005160 memsz 0x0000000000009dd8 flags rw- 
...

On top 64 bit text takes a higher representation of instruction bytes compared to 32 bit. Similarly size of pointers on 64 bit are 8Bytes adding 4 more bytes. Also data structure alignment is a 8Bytes aligned in 64 bit making mapped regions larger.

Simple size command on binaries can show the difference between 32/64 bit programs memory regions as below,

esunboj@L9AGC12:~/32_64bit$ ls -lrt
total 10368
-rwxrwxrwx 1 esunboj ei 5758776 Oct 10 11:35 crash-x86-64
-rwxrwxrwx 1 esunboj ei 4855676 Oct 10 11:36 crash
esunboj@L9AGC12:~/32_64bit$ size crash
   text    data     bss     dec     hex filename
4771286   82468  308704 5162458  4ec5da crash
esunboj@L9AGC12:~/32_64bit$ size crash-x86-64 
   text    data     bss     dec     hex filename
5634861  121164 1623728 7379753  709b29 crash-x86-64
Posted in Linux Memory | Tagged | Leave a comment