The examples presented so far have exhibited an unfortunate trade off between code clarity and efficiency.
Indeed they have.
My observation is that this phenomena is not just unfortunate here but true in general. Getting things done quickly is likely to involve using techniques that are not obvious to those that don't know the theory behind it.
A very simple example might be the simple problem of summing a sequence of integers. The naive will start adding them up one by one, result = 1 + 2 + 3 + 4 ... N. They might write a loop to do that if they are learning programming. Someone with a little maths knowledge will take a short cut, result = N * (first + last) / 2. BOOM done.
A rather more complex example (there is a joke there, wait for it) is that of the Fourier Transform. Anyone who has learned about Fourier in school could probably write a discreet Fourier transform. It will of course be a slow as hell for large data sets. It took Cooley and Tukey to figure out a much faster way to do it in 1965 with the Fast Fourier Transform.
When I saw my first FFT code, in BASIC, I had no idea what it was doing, it seemed totally crazy what with that bit reversal thing going on and such. It was two or three decades before I found out how/why the FFT worked and could write one for myself.
All of this is before we talk about how to write any of these algorithms efficiently in any particular language. You can spend as much time as you like micro-optimizing the wrong algorithm with your hand crafted assembler or whatever, but you will never beat the smart algorithm that is orders of magnitude faster and written in Python!
And, as I think you are saying, the best algorithm for the job may well depend on the size of the data set. Or it may depend on the typical data you have in your application rather than some assumed random input.
When I was getting my head around the FFT I wrote a recursive solution first. I used that to develop my typical three nested loop iterative solution, which was much faster. Except I was amazed to find that when my data set exceed 4 million or so that recursive solution started to be the faster!
Then there is supercomputers and massively parallel systems. A strange case there is the problem of calculating the digits of PI. The current record number of digits calculated is 22,459,157,718,361. Calculated in 100 days or so.
You might think the worlds top supercomputers could tackle this during their lunch break. Turns out they can't. It was done on a souped up PC (4 x Xeon E7-8890 v3 @ 2.50 GHz (72 cores, 144 threads) 1.25 TB DDR4 RAM + 20 x 6 TB disk)
This optimization thing is complicated.
Meanwhile, I'll get back to my C++ thing....
Memory in C++ is a leaky abstraction .