Measure CPU clock cycles spent executing a function using the x86_64 RDTSCP instruction, from both kernel and user space