I think that the point wasn't quite clear:
The amount of variance between trials with in-game testing is so large that you can't get an accurate picture of how well the set bonus will perform. It's simply impossible to do enough trials in-game to accurately get a feel for certain things like this, so people have created simulators that can reduce the margin of error small enough to make confident decisions.
Let's say that you go to a target dummy, and do 100 6-minute sessions (so you fight at the target dummy for 10 hours straight, one very, very boring afternoon). And let's say you have pretty good gear, such that fully buffed you could get up to the 250k DPS area. Of those 100 6-minute sessions, your DPS could swing by as much as 15,000 DPS up or down from one fight to the next.
Now... if you tested on a target dummy and saw your DPS jump by 15k from one to the next, wouldn't you say that something "really good" just happened? Nope. You just got lucky.