It is important to understand that an FPGA does not replicate the circuits. Not even similar. It only replicates their function. In considering a basic digital logic gate, an inverter..
Software emulation is this:
A = not A or perhaps A = ~A
Real hardware is this:
FPGA is this:
Ahem ... I somewhat disagree.
Your NOT instructions example is actually a good setup.
A CPU to execute a NOT has to load explicitly a memory location onto a register, perform the NOT, mask as needed and then write back to a memory location.
An FPGA just needs to connect a pin to one of the LE LUTs input, have the LUT be configured for a NOT and connect the output of that LE to an out pin .... given the right timing (FPGA are sync beast by far [and so are CPU for that matter], the NOT circuit you attached is an async TTL design which is hardly ever used stand alone btw) ... at the end I can take the FPGA with appropriate voltage converters if need be and replace the actual NOT circuit, for the SW version instead I need to find a way to map the actual input to some interface and the same for the output and hope I don't have to switch too fast (maybe a hundred Khz but hardly anything more than that).
Now take an FPGA with say 5K LEs, they can all perform NOT in parallel at once in sync with the clock say at 1MHz, take your SW implementation on a single core x86 at 5GHz, assuming the NOT + LOAD from Memory + STORE to Memory plus whatever is needed to map to a real HW interface (if you use IN/OUT instructions it's dog slower but ....) all takes a single clock cycle (it won't as memory accesses will be needed and because this is real world hardware caching is useless and actually counterproductive) ... anyhow even in the best of situations you'll have around the time to execute 5B of those instructions in 1 sec and that is around 5K 1Mhz NOTs ... so it looks like you're on par, but you'll be sequential on those (one will fire at the beginning and one at the end of the "Mhz", and also in the loop of 5K).
Now take a 35K LEs FPGA and have the NOT run at say 10Mhz and start accounting for real memory accesses and you'll see that the CPU instruction NOT is anything but what an FPGA is doing. You can throw at it even a 16 cores and still come up so short. you'll need a 70 cores at 15Ghz (assuming load and store each takes only 1 cycle) ...even if you try to add pipelining you'll end much slower as data will take a while to come from memory and again you cannot cache as you're dealing with an externally controlled signal.
So told you can simulate/emulate in SW what an FPGA would do for each clock cycle, it just won't be fast enough for real time videogame playing.
There's SW (like Spice/PSpice/HSpice) that simulate transistors at the physics level to run ... well ...simulations and it takes a long time to get results for any moderately complex circuits/netlists.
BTW the NOT circuit you chose to link is only one of dozens of possible designs (in specific a TTL, low speed, no Schottky diodes, totem pole output design) many of which are compatible with each other even if they do not have the same physical design or exact timing. The truth table is actually what matters (and the physical voltage levels) .... and it is so much so that if you look back at PALs (the grandfather of CPLD->FPGA) you'll see that that is actually how you'd program them binary math equations but it is the same.
... I personally think you're spending too much time and energy attempting to bend what an FPGA can/can't do to match your notion of SW emulation ... it's a waste of time, they do things very differently even if they attempt to achieve a similar goal, and a cycle exact emulation in SW (as of today) is still not possible in real time for anything but the slowest/oldest/simpler machines, while we can get bloody close with an FPGA for the 8bit and 16bit systems. The converse is also true, we have HLE for many advanced machines in SW, but FPGA cannot get there in any meaningful way for now.
Both of them can only be as accurate as our knowledge of the devices/experiences we're trying to replicate, no question about it.
FPGAs promise to achieve higher accuracy to the actual hardware at the cycle level (as in I can replace the old rig with an FPGA version and not notice [like flashcarts that simulate complex mappers do, it would be cheaper to slap a RPi in there and a few IO pins but it just won't work]) again depending on how much we know about the HW and it won't require tons of cores at very high speed either ... so in principle cheaper .... this is not the case as CPUs are so cheap presently due to volumes.
SW emulation can reach a high level of fidelity but it is bound by the max speed of the core it is using and how many it can really use in parallel (notice that many emulators can't really run in multi threaded fashion at their core).