I want to study the behavior of branch predictor in current CPUs. To be precise, I will control the branch predictor indirectly (by controlling the jumps, and thereby the pattern). The code for branch predictor stuff is given below:

For frequent failure of branch predictor

Code:
#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    const unsigned arraySize = 32768;
    int data[arraySize];
    
    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
            sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
    return 0;
}
For maximum correct prediction

Code:
#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    std::sort(data, data + arraySize);
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
            sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
    return 0;
}
I want to measure the clock cycles consumed during branch predictor operation (in both cases: when it is correct, as well as when it fails). How can I do that? I guess there will be inaccuracies because of context switching, out of order execution, SpeedStep, TurboBoost and no syncronization of TSCs between different cores (which can be solved by using cpu affinity type stuff). Is there a way:

  1. To prevent context switching on 1 CPU core, while leaving others under normal operation.
  2. To prevent out-of-order execution and/or pipelining while the code to be profiled is being run
  3. To lock one CPU core to a fixed frequency while leaving the other cores.


I don't mind if I have to play using assembly; or dive into kernel space. However kernel patching is not an option.

In case it is relevant, I am using Fedora 21 x64 (Kernel 3.17) on Intel i5 M430.

Any help is appreciated. Thanks in advance.