Simulation is best used to generate large quantities of data and then analyze that data in aggregate. Our optimization algorithm will follow the trend in that data very closely.
It is possible to set up specific tests like you have that show the algorithm to be “off” - but this is well within a reasonable margin for any statistical model.
We discourage using small numbers of data points to compare gear. This is because simulation models are not perfect models of WoW. On top of that, it is impossible to measure the actual error in the simulation model compared to the real game. We know we are “close” - but exactly how close - 1%? 2%? 5%? is actually not possible to determine. Given this unknown structural error, simulating two sets of gear that are very close in value and comparing just those two data points… is not sufficient to determine which set of gear is actually going to be better in-game.
Simulation data should be thought of as showing us trends - all the combinations of gear that simulate to within 1, 2, even 3% of the highest data point found are functionally equivalent in-game, assuming the player can play optimally in all cases. The goal of a gear optimizer is to find one of the sets of gear that falls in that top 1 or 2% of simulated results, from the millions, billions, or quadrillions of possibilities. This will ensure that in-game, we will have the best chance possible of doing optimal damage.
So, the TLDR answer to your question is probably: Your test simulation shows that BiB is functioning correctly. It has correctly chosen one of the sets of gear that simulation data shows will most likely result in optimal performance.