Jump once for x86_64 PLT

DJ: Direct jump
IJ: Indirect jump
PLT: Produce linkage table


main (int argc, char *argv[])
    puts ("Hello");
    puts ("World");
    return 0;

x86_64 PLT(DJ + IJ)

    callq   puts@plt
    jmpq    *0x2fe2(%rip) # .got.plt
                          # 1st call: point to 1:
                          # non-1st call: point to puts:
    pushq   $0x0          # symbol index
    jmpq    _dl_runtime_resolve

1st call:
main -> puts@plt -> _dl_runtime_resolve -> puts

non-1st call:
main -> puts@plt -> puts

x86_64 PLT(IJ)

    callq   *0x2fe2(%rip) # .got.plt
                          # 1st call: point to puts@plt
                          # non-1st call: point to puts:
    pushq   $0x0          # symbol index
    jmpq    _dl_runtime_resolve

1st call:
main -> puts@plt -> _dl_runtime_resolve -> puts

non-1st call:
main -> puts


Run GNOME on NanoPi M4 (RK3399)


/usr/bin/gnome-shell: symbol lookup error: /lib/libmutter-4.so.0: undefined symbol: gbm_bo_get_offset

Arch Linux
Refer: https://archlinuxarm.org/platforms/armv8/generic

Mali GPU user space drivers
Drivers: https://github.com/heiher/libmali-rk3399

git clone https://github.com/heiher/libmali-rk3399
cd libmali-rk3399
# Build gbm wrapper
# Install to system
sudo cp conf/mali.conf /etc/ld.so.conf.d/
sudo cp -rd lib /usr/lib/mali
# Update ld.so cache
sudo ldconfig

Force use GLESv2

# /etc/clutter-1.0/settings.ini

Fix crash when maximizing window
Patch: https://gitlab.gnome.org/GNOME/mutter/merge_requests/515

GPU device access permission

# /etc/udev/rules.d/50-mali.rules 
KERNEL=="mali0", MODE="0666"
sudo chmod 0666 /dev/mali0

Start gdm.


Run QEMU with hardware virtualization on macOS

在macOS上通过虚拟机运行其它操作系统,又不想用商业软件,那么开源的QEMU是一个比较好的选择。QEMU的功能支持还是比较全面的,除了功能以外,使用虚拟机软件的用户最关心的就是性能了,一个好消息是macOS 10.10+版本已经引人了硬件虚拟化支持框架,也就是Hypervisor.framework,另一个好消息是QEMU也已支持该框架,也就是hvf accelerator。

1. macOS 10.10+
2. Macports

已经使用过的用户可能已经发现,QEMU使用hvf accelerator并开启多核是有问题的呀。的确,QEMU使用hvf accelerator以单核运行时没有问题,当使用-smp参数指定多核时,很大概率上虚拟机硬件初始化都完成不了就死机了。
不过,好消息是该问题也已经修复了,导致这个问题的原因是hvf accelerator代码设计没有考虑到虚拟机启动后所有hvf vcpu都在并行执行指令,其中包括硬件初始化的I/O模拟操作,多个CPU同时对同一硬件执行初始化显然是不行的。

Patch (已经合并上游社区)

Install QEMU

cd ~
git clone https://github.com/hevz/macports
sudo vim /opt/local/etc/macports/sources.conf
# Add local repositories
file:///Users/[YOUR USER NAME]/macports
rsync://rsync.macports.org/macports/release/tarballs/ports.tar [default]
cd ~/macports
sudo port install qemu

Run Arch Linux
1. 下载Arch Linux安装ISO镜像。
2. 创建一个虚拟机磁盘镜像。
3. 开始安装新的系统。
4. 启动安装后的系统。

mkdir ~/system/images
qemu-img create -f qcow2 ~/system/images/arch.qcow2 40G
qemu-system-x86_64 -no-user-config -nodefaults -show-cursor \
    -M pc-q35-3.1,accel=hvf,usb=off,vmport=off \
    -cpu host -smp 4,sockets=1,cores=2,threads=2 -m 4096 \
    -realtime mlock=off -rtc base=utc,driftfix=slew \
    -drive file=~/system/images/arch.qcow2,if=none,format=qcow2,id=disk0 \
    -device virtio-blk-pci,bus=pcie.0,addr=0x1,drive=disk0 \
    -netdev user,id=net0,hostfwd=tcp::2200-:22 \
    -device virtio-net-pci,netdev=net0,bus=pcie.0,addr=0x2 \
    -device virtio-keyboard-pci,bus=pcie.0,addr=0x3 \
    -device virtio-tablet-pci,bus=pcie.0,addr=0x4 \
    -device virtio-vga,bus=pcie.0,addr=0x5 \
    -cdrom ~/archlinux-2019.01.01-x86_64.iso -boot d



Reordering on an Alpha processor

A very non-intuitive property of the Alpha processor is that it allows the following behavior:

Initially: p = & x, x = 1, y = 0

    Thread 1         Thread 2
  y = 1         |    
  memoryBarrier |    i = *p
  p = & y       |
Can result in: i = 0

This behavior means that the reader needs to perform a memory barrier in lazy initialization idioms (e.g., Double-checked locking) and creates issues for synchronization-free immutable objects (e.g., ensuring. that other threads see the correct value for fields of a String object).

Kourosh Gharachorloo wrote a note explaining how it can actually happen on an Alpha multiprocessor:
The anomalous behavior is currently only possible on a 21264-based system. And obviously you have to be using one of our multiprocessor servers. Finally, the chances that you actually see it are very low, yet it is possible.

Here is what has to happen for this behavior to show up. Assume T1 runs on P1 and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which causes an “invalidate y” to be sent to P2. This invalidate goes into the incoming “probe queue” of P2; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on P2. The invalidate is acknowledged right away at this point (i.e., you don’t wait for it to actually invalidate the copy in P2’s cache before sending the acknowledgment). Therefore, P1 can go through its MB. And it proceeds to do the write to p. Now P2 proceeds to read p. The reply for read p is allowed to bypass the probe queue on P2 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, P2 can derefence P to read the old value of y that is sitting in its cache (the inval y in P2’s probe queue is still sitting there).

How does an MB on P2 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of P, you do an MB which pulls in the inval to y for sure. And you can no longer see the old cached value for y.

Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, P2 will likely have ample opportunity to service the messages (i.e., inval) in its probe queue before it receives the data reply for “read p”. Nonetheless, if you get into a situation where you have placed many things in P2’s probe queue ahead of the inval to y, then it is possible that the reply to p comes back and bypasses this inval. It would be difficult for you to set up the scenario though and actually observe the anomaly.

The above addresses how current Alpha’s may violate what you have shown. Future Alpha’s can violate it due to other optimizations. One interesting optimization is value prediction.

From: http://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html


用龙芯EJTAG硬件断点优化Linux ptrace watch性能

在MIPS标准的协处理器0(CP0)中定义一组硬件watchpoints接口,由于某些原因,龙芯3系列处理器并未实现,这就导致了在该架构Linux系统中用gdb watch只能使用软件断点,真心非常、非常慢。:(

好消息是龙芯3系列处理器是实现了MIPS EJTAG的,兼容2.61标准,那么能否利用MIPS EJTAG的硬件断点功能部件实现Linux ptrace的watchpoints功能呢?答案是肯定的。让我们一起看看具体的方法吧。

首先,我们需要更改BIOS中的异常处理函数,将EJTAG调试异常重新路由至Linux内核中处理,因为MIPS EJTAG异常处理程序的入口地址固定为0xbfc00480

         /* Debug exception */
         .align  7           /* bfc00480 */
         .set    push
         .set    noreorder
         .set    arch=mips64r2
         dmtc0   k0, CP0_DESAVE
         mfc0    k0, CP0_DEBUG
         andi    k0, 0x2
         beqz    k0, 1f
          mfc0   k0, CP0_STATUS
         andi    k0, 0x18
         bnez    k0, 2f
         mfc0    k0, CP0_EBASE
         ins     k0, zero, 0, 12
         addiu   k0, 0x480
         jr      k0
          dmfc0  k0, CP0_DESAVE
         la      k0, 0xdeadbeef
         dmtc0   k0, CP0_DEPC
         dmfc0   k0, CP0_DESAVE
         .set    pop

1. 将来自用户态的sdbbp指令触发的异常路由至地址 0xdeadbeef。
2. 将来自内核态的sdbbp指令触发的异常或是任意态的非sdbbp触发的异常路由至 ebase+0x480。

1. 实现 EJTAG watch 相关的 probe、install、read、clear 等操作,及合适的调试异常处理程序。
2. 实现 Linux ptrace watch 接口与 EJTAG watch 的对接。

See: https://github.com/heiher/linux-stable/commits/ejtag-watch-4.9


MIPS J类指令目标范围

MIPS 跳转指令共分为三类:基于PC的相对跳转、基于PC区域的相对跳转、基于寄存器的绝对跳转。其中基于 PC 区域的相对跳转也就是我们要说的 J 类指令。

J类指令有长达26位的指令 index 编码域,因为指令都是4字节对齐的,所有表示的范围是 256M(28位)。那么J类跳转的目标地址是如何计算的呢?

目标PC = 延迟槽指令PC的28位以上的高位 || (J类指令26位的立即数 << 2)


|: 265M 边界
j: j 指令位置
t: 可行的跳转目标位置


System V AMD64 ABI calling conventions

The calling convention of the System V AMD64 ABI is followed on Solaris, Linux, FreeBSD, Mac OS X, and other UNIX-like or POSIX-compliant operating systems. The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, and R9, while XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments. For system calls, R10 is used instead of RCX. As in the Microsoft x64 calling convention, additional arguments are passed on the stack and the return value is stored in RAX.

Registers RBP, RBX, and R12-R15 are callee-save registers; all others must be saved by the caller if they wish to preserve their values.

Unlike the Microsoft calling convention, a shadow space is not provided; on function entry, the return address is adjacent to the seventh integer argument on the stack.