I'm not exactly sure what you mean by command set. You refer to same CPU architecture but then compare cycles for opcodes.
This is dependant on how the benchmark routines are coded. Using the same command set and routine (on same architecture of cpu) would indicate any efficiency in the amount of cycles it takes for each opcode etc.
On the same architecture you'd run the same code and on a different architecture you'd run equivalent code.
Comparing clock cycles per instruction is the approach I've seen countless 6502 fans try to use.
This drastically skews the results in favor of the 6502 in most of the comparisons I've seen because most comparisons don't deal with 16 bit manipulation and when it is used it's often limited. The number of pointers and variables being manipulated is often limited as well. When doing this, results can make the 6502 appear to be over 3 times faster at the same MHz as the Z80.
On the other hand, you can't call a benchmark that always deals with 16 bits fair because it will adversely impact 6502 results and favor CPUs that deal with 16 bits better. This is why I suggest a variety of benchmarks. I intentionally chose some that the 6502 would perform very well and some CPUs that support 16 bits would perform well.
Many benchmarks for modern hardware do the same thing, they give a score for different categories. I suggested several benchmarks involving bit manipulation, math (assuming identical algorithms and accuracy for floats) and other things an 8 bit would realistically be asked to do. At least one benchmark totally written in C is important because it can contrast differences in how well different CPUs support high level languages.
The SIEVE results I found in the UCSD Pascal code are important because they were conducted on optimized P-Machines for each CPU running the exact same code. The CPU has to perform the same exact work. The results varied but if you look at them, the ratio between the 6502 and Z80 speed difference is more like 2:1 on average or less if you take the fastest Z80 version. Some people will try to argue that their favorite CPU's virtual machine probably wasn't optimal, but emulating these P-Code instructions is pretty simple so I doubt you'll get back to a 3:1 clock ratio through optimizing.
One flaw in the UCSD Pascal SIEVE benchmark is that UCSD Pascal runs identically on every platform and it was originally designed for mainframes or mini-computers. It could possibly be tuned to work on 8 bit CPUs better. I believe BASIC-09 for the CoCo offers a much greater speed advantage over Microsoft BASIC than Apple Pascal (modified UCSD) does even though it uses a similar virtual machine.
You have to use the same algorithm but with common sense optimizations for each CPU. It would not make sense to draw lines on the screen and allow the code for one CPU to have special cases for different angles but not use the same optimizations for other CPUs.
If the benchmark routine for a different architecture of cpu is written differently however, this may not be optimal and the benchmark results would be flawed