Over the years, I've seen A/B testing used a multitude of ways, sometimes well but more often inappropriately. Here are some examples of how A/B testing should not be used and why:
Using A/B testing as a substitute for innovation or design. A/B testing won't produce an iPhone! All A/B testing can do is optimize around a local maximum.
Trying to find a subtle difference with a sample size that's too small. You need some level of statistical significance, and the smaller the difference in performance, the more users you'll need in your study to find it.
Continuing to test after you're close enough to the local maximum. Some companies go nuts with A/B testing trying to find minute performance improvements when they should instead spend their time developing a next-generation solution (trying to find a new global maximum).
Trying to get an immediate result. A/B tests are sometimes used on existing customers, who are used to a different experience than the one you are testing. So, if you see a difference in performance between test and control, you can't tell if it's attributable to difference in design (which is what you are trying to test) or whether it's attributable to the element of surprise. In the case of changing search algorithms, almost every tweak generates a lift... but only temporarily. It's not that the new algoirthm is better, it's that it's new. So, you need to run the test long enough that this "newness effect" is accounted for.
Measuring the wrong variable. A/B tests are often used to improve completion rates of multi-step flows. For example, creating a new account, filling out a questionaire, completing a survey, etc. Let's say you're trying to generate mortgage leads by showing a banner ad, followed by a landing page, followed by a questionaire, follwed by an explicit agreement to be contacted by a sales rep. If you run A/B tests on the banner ad, you can't just measure the impact on click-through rate - you have to measure the impact on the final completion rate (through the rest of the funnel). As an extreme example, you could probably double the click-through rate by putting a racy picture of a woman on the banner, but whatever users you pick up with the new ad were never going to complete the form in the first place, and you just alienated (and are no longer getting clicks from) your target demographic of professional females. The racy ad looks better if you measure click-through rates but is clearly much worse if you measure the real target variable of quality leads.
Transplanting into a fragile system. You can wreak havoc by transplanting a new component into an otherwise interdependent system. This happens in biological systems all the time. Transplant an organ that works far better than the failing one, but the immune system goes haywire. Introduce a new species that feeds on a pest, and they destroy entire ecosystems. This can happen with A/B testing as well. You might substitute one module for another, see good performance metrics, but throw off users' mental model. You might acquire more users but change your demographic completely. Measure the potential side effects and don't commit to a modified product design based on the outcome of your A/B test until you're confident you haven't thrown your system into disarray.