If a problem only occurs randomly once in every N times on average, how many tests do I have to perform to be...
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{
margin-bottom:0;
}
I know that you can never be 100% sure, but is there a method to determine an appropriate number of tests?
manual-testing intermittent-failures
|
show 13 more comments
I know that you can never be 100% sure, but is there a method to determine an appropriate number of tests?
manual-testing intermittent-failures
7
@whoever voted to close - would you say this is primarily opinion based? or answerable with some math and probability?
– trashpanda
May 28 at 9:04
7
This is certainly not opinion based but related to statistical analysis. Although the question can be, and was, asked in math related SEs it is very relevant here and the context might be a bit different.
– Rsf
May 28 at 9:50
2
@JoãoFarias Of course, controlling said randomness can indeed be tricky. I like to look at the example of CHESS, which goes to great lengths to mock the scheduling algorithm of the OS to find multithreading bugs. Sometimes a statistical approach, while less satisfactory, can be more valuable from a business perspective.
– Cort Ammon
May 28 at 16:18
1
Possible duplicate of How can I be sure that rarely reproduced issue is fixed?
– Kevin McKenzie
May 28 at 17:51
6
As a developer, "this fails randomly" actually means "I haven't yet been able to pinpoint the set of circumstances that makes this fail, and I cannot spend more time investigating it". Likewise, "I've solved this random error" means "The error still happens, but you won't notice anymore because I've added logic to capture it and fix any wrong data or actions it caused".
– walen
May 29 at 8:43
|
show 13 more comments
I know that you can never be 100% sure, but is there a method to determine an appropriate number of tests?
manual-testing intermittent-failures
I know that you can never be 100% sure, but is there a method to determine an appropriate number of tests?
manual-testing intermittent-failures
manual-testing intermittent-failures
edited May 28 at 8:51
jonrsharpe
2151 gold badge3 silver badges9 bronze badges
2151 gold badge3 silver badges9 bronze badges
asked May 28 at 7:03
Sam HallSam Hall
3032 silver badges4 bronze badges
3032 silver badges4 bronze badges
7
@whoever voted to close - would you say this is primarily opinion based? or answerable with some math and probability?
– trashpanda
May 28 at 9:04
7
This is certainly not opinion based but related to statistical analysis. Although the question can be, and was, asked in math related SEs it is very relevant here and the context might be a bit different.
– Rsf
May 28 at 9:50
2
@JoãoFarias Of course, controlling said randomness can indeed be tricky. I like to look at the example of CHESS, which goes to great lengths to mock the scheduling algorithm of the OS to find multithreading bugs. Sometimes a statistical approach, while less satisfactory, can be more valuable from a business perspective.
– Cort Ammon
May 28 at 16:18
1
Possible duplicate of How can I be sure that rarely reproduced issue is fixed?
– Kevin McKenzie
May 28 at 17:51
6
As a developer, "this fails randomly" actually means "I haven't yet been able to pinpoint the set of circumstances that makes this fail, and I cannot spend more time investigating it". Likewise, "I've solved this random error" means "The error still happens, but you won't notice anymore because I've added logic to capture it and fix any wrong data or actions it caused".
– walen
May 29 at 8:43
|
show 13 more comments
7
@whoever voted to close - would you say this is primarily opinion based? or answerable with some math and probability?
– trashpanda
May 28 at 9:04
7
This is certainly not opinion based but related to statistical analysis. Although the question can be, and was, asked in math related SEs it is very relevant here and the context might be a bit different.
– Rsf
May 28 at 9:50
2
@JoãoFarias Of course, controlling said randomness can indeed be tricky. I like to look at the example of CHESS, which goes to great lengths to mock the scheduling algorithm of the OS to find multithreading bugs. Sometimes a statistical approach, while less satisfactory, can be more valuable from a business perspective.
– Cort Ammon
May 28 at 16:18
1
Possible duplicate of How can I be sure that rarely reproduced issue is fixed?
– Kevin McKenzie
May 28 at 17:51
6
As a developer, "this fails randomly" actually means "I haven't yet been able to pinpoint the set of circumstances that makes this fail, and I cannot spend more time investigating it". Likewise, "I've solved this random error" means "The error still happens, but you won't notice anymore because I've added logic to capture it and fix any wrong data or actions it caused".
– walen
May 29 at 8:43
7
7
@whoever voted to close - would you say this is primarily opinion based? or answerable with some math and probability?
– trashpanda
May 28 at 9:04
@whoever voted to close - would you say this is primarily opinion based? or answerable with some math and probability?
– trashpanda
May 28 at 9:04
7
7
This is certainly not opinion based but related to statistical analysis. Although the question can be, and was, asked in math related SEs it is very relevant here and the context might be a bit different.
– Rsf
May 28 at 9:50
This is certainly not opinion based but related to statistical analysis. Although the question can be, and was, asked in math related SEs it is very relevant here and the context might be a bit different.
– Rsf
May 28 at 9:50
2
2
@JoãoFarias Of course, controlling said randomness can indeed be tricky. I like to look at the example of CHESS, which goes to great lengths to mock the scheduling algorithm of the OS to find multithreading bugs. Sometimes a statistical approach, while less satisfactory, can be more valuable from a business perspective.
– Cort Ammon
May 28 at 16:18
@JoãoFarias Of course, controlling said randomness can indeed be tricky. I like to look at the example of CHESS, which goes to great lengths to mock the scheduling algorithm of the OS to find multithreading bugs. Sometimes a statistical approach, while less satisfactory, can be more valuable from a business perspective.
– Cort Ammon
May 28 at 16:18
1
1
Possible duplicate of How can I be sure that rarely reproduced issue is fixed?
– Kevin McKenzie
May 28 at 17:51
Possible duplicate of How can I be sure that rarely reproduced issue is fixed?
– Kevin McKenzie
May 28 at 17:51
6
6
As a developer, "this fails randomly" actually means "I haven't yet been able to pinpoint the set of circumstances that makes this fail, and I cannot spend more time investigating it". Likewise, "I've solved this random error" means "The error still happens, but you won't notice anymore because I've added logic to capture it and fix any wrong data or actions it caused".
– walen
May 29 at 8:43
As a developer, "this fails randomly" actually means "I haven't yet been able to pinpoint the set of circumstances that makes this fail, and I cannot spend more time investigating it". Likewise, "I've solved this random error" means "The error still happens, but you won't notice anymore because I've added logic to capture it and fix any wrong data or actions it caused".
– walen
May 29 at 8:43
|
show 13 more comments
6 Answers
6
active
oldest
votes
I'm going to take a different approach than statistics (though I think the other response answers your actual question more directly). Any time I've encountered "a problem that only happens some of the time" as either QA or a support role it's been an investigative exercise about narrowing down why the event happens irregularly or in what situations it occurs.
Some (but certainly not all) points of investigation may be:
- Specific accounts or data.
- Differences in hosts/environments the applications or services are running on.
- Different versions of the application or service running on different hosts
- Certain days, dates, times of day or time zones.
- Certain users and their specific means of accessing the application (physical device, browser, network connection)
This sort of situation is where reproduction steps and other details from the person reporting the problem can be so valuable in resolving their issues. Telling a customer "your problem is fixed" when you're just making an educated guess can spiral in a negative direction if your experiment is based on incorrect assumptions. In my experience it's better to try to coach them about what information will help resolve their problem and how they can help you get it.
20
Can I add a really obscure one that troubled one company I worked at? "Whether the logged-in user has an odd or an even number of characters in their username". (Which meant that the developer didn't have a problem, and the first test user did.)
– Martin Bonner
May 28 at 17:12
That's definitely an interesting one. It's always a good challenge when it gets to that granular of a difference.
– Cherree
May 28 at 18:09
7
Back before the turn of the century, we had an issue with phone calls counting double when transit calls (routed through the country but originating and terminating in other countries) for some specific pair of countries had happened on exactly 20 days of a calendar month. That one was completely deterministic but hell to track down. (Root cause was a buggy database driver fetching rows in batches of 20)
– JollyJoker
May 29 at 7:38
1
So much this; apart from hardware failures (and buggy RAM does exist) most problems are deterministic. Even a race-condition is just a fancy name for A occurs before B some of the time. If you have a flaky behavior, it means you have determined the conditions in which it happens in sufficient detail.
– Matthieu M.
May 29 at 11:22
1
Just another anecdotal example - we had a problem once that the customer reported happened sometimes and we couldn't reproduce. One amazing QA person we had managed to narrow it down to...network latency. She played with the network throttling options in Chrome (it's under the dev tools) and found that the problem was almost exclusively happening between some network speeds. On a fast network, you wouldn't see the problem, nor would you on a very slow network. It happened in a narrow band. It was still irregular but she found the optimum speeds where it happened around 80% of the time.
– VLAZ
May 30 at 11:22
|
show 2 more comments
You must perform the same test equal number of times on both the unfixed version and the supposedly fixed version. You have to show that,
- The test fails once in every N times randomly, on the unfixed version.
- The same test passes every time, or at least fails less often, on the fixed version.
You have to show that the only difference between 1 and 2 is the "fix" itself, not any external or environmental factors.
If you only perform the test on the new, fixed version, it could very well be the case that the bug was caused by an unrelated environmental factor that simply doesn't exist in your test now.
9
+1 for the principle of verifying that the old version still fails. It's way too easy for random bugs to randomly go away for days due to some external changes such as network speed.
– jpa
May 29 at 6:53
This is really nasty on less strict platforms like Javascript. Depending on how fast your computer process the scripts, race conditions of overriding functions can occur and the code can randomly call a different function depending on the network latency and particular script's load time.
– Nelson
May 29 at 8:58
add a comment
|
I suppose this answer could help you
You need to decide first at what probability you want to "detect" the problem.
This is a nice example to why theoretical knowledge is necessary even for testers.
The simplified version:
p is the probability for failure, 1/N in our case
then the probability for success is 1-p
and the probability to have N successful tries is (1-p)^N
so the probability to have N successful tries and and then a failure would be 1-(1-p)^N
extracting N and simplifying a bit assuming big enough N gives:
- −log(1−p)⋅N
(*) "log" is sometimes referred to (for example in calculators) as ln(x), loge(x) or log(x)
5
I would replace "probability" with "confidence". How confident you want to be? Note, that to be confident for 100%, you would need to execute infinite number of tests, because -log (1-x) goes to infinity when x goes to 1.
– dzieciou
May 28 at 18:20
1
And all this is based on assumption test execution are independent. For instance, the error does not accumulate and manifests as a result of accumulation of n executions.
– dzieciou
May 28 at 18:21
This is actually a valid point @Makyen, I edited the answer
– Rsf
May 29 at 8:07
This answer would be greatly improved by definingx
, which just shows up in the last bullet point out of nowhere.
– Gregor
May 31 at 1:54
add a comment
|
While I agree with the other answers saying "dig deeper", to answer the actual math question in the title:
If the issue occurs completely at random with probability p
, then the chance if it occurring at least once in n
trials is 1-(1-p)^n
. Setting this to x
(your confidence that the issue has been fixed) and solving for n
gives you
n = log(1-x)/log(1-p)
So for example, if your issue occurs 1 out of 4 times, and you want to be 95% sure it's fixed (meaning you'll incorrectly identify it as fixed 1 out of 20 times!!), then
p = 0.25
x = 0.95
n = log(0.05)/log(0.75) ≈ 10.4
so you'd need to run 11 trials
The difference between my answer and @emrys57's is that mine assumes you know the probability, while theirs assumes you know some initial sequence of results. Presumably they should both give the same answer with a large enough initial sequence.
add a comment
|
A problem has been observed that sometimes, in testing, produces errors. We don’t actually know the probability that it will produce an error on any given test run, because we can only do a finite number of tests on the broken system. If 1 represents a test pass, and 0 represents a test fail, we might have a sequence of results from multiple tests that looks like this:
011001010011
Representing 6 passes and 6 fails in 12 tests. This gives us a guess Pf of the probability of a failure of a test for the broken system. We assume that this probability isn’t changing with time. However, there will be a large uncertainty in the actual value of Pf since we cannot do very many tests.
Following these tests, we make a repair. We hope that this fixes the problem, but we’re not sure. We make more tests of the repaired system. If we have fixed the problem, the value of Pf measured after the repair should be 0. If we have not fixed the system, it should be the same as the value of Pf before the repair, unchanged Pf.
If we run some tests after the repair and one fails, we immediately know the repair failed. The question is, if no tests fail after repair, does that mean the repair worked?
Consider the case where we have results
a zeroes (test fails)
b ones (test passes)
Including after repair: c ones (test passes)
The number of ways of arranging the a + b initial results is
Ntot = (a + b)! /(b! * a!)
In the case where Pf does not change yet all the last C results are ones, we need to arrange b - c ones in the first a + b - c results. The number of different ways of arranging these first a + b - c results is
Nsuc = (a + b - c)! / ( (b - c)! * a! )
These patterns are those from all the Ntot possible patterns where the last c results are all ones.
If the change didn’t actually repair the system, but really left its state the same, the results we observe are a random collection of zeroes and ones produced by chance, depending on the probability Pf that a single experiment will return a zero or one results. Given this, the chance that we will observe a pattern of a zeros and b ones with the last c results all ones is
Cran = Nsuc/Ntot
Or
Cran = (a + b - c)! * b! * a! / ((a +b)! * (b - c)! * a!)
Or
Cran = (a + b - c)! * b! / ((a + b)! * (b - c)!)
Cran is the chance that we observe c test passes after we make a repair to the system, out of a total of a fails and b passes, but in fact we have changed nothing, and we see a sequence of passes after repair by happenstance.
As an example, with the pattern above before repair, and 6 ones (passes) after repair, Cran is slightly less than 5%. To reach a confidence of 99% that the repair has succeeded, you need 11 passes after repair.
There's a spreadsheet that can be copied to make this computation. I hope I have it all right, it is 15 years since I last worked this out. And, back then, it took me 15 years to first find the answer.
add a comment
|
Let B mean "broken", F mean "fixed", E_k mean "error observed in k trials". You are saying that P(E_1|B)=1/N; the probability of seeing an error in a single observation, given that it's broken, is 1/N. Now, that itself is likely going to have some uncertainty, since likely your only way of measuring it will be by seeing how often it fails, and making an empirical estimate. However, if we take that as given, then applying Bayes' rule gives us that if our prior probability is P(B), then the posterior probability is P(~E_k|B)P(B)/P(~E_k).
For P(~E_k|B), we have P(~E_k|B)= P(~E_1|B)^k = (1-P(E_1|B))^k = (1-1/N)^k = ((N-1)/N)^k. For large N and k, we can approximate that as e^(-k/N).
For P(~E_k), we have P(~E_k) = P(~E_k|B)P(B)+P(~E_k|F)P(F). P(~E_k|F) =1 (If we've fixed it, we're guaranteed to see no errors). And P(F) is just 1-P(B).
Plugging that all in, we have
P(B|~E_k) ~= e^(-k/N)P(B)/(e^(-k/N)P(B)+1-P(B))
This can be made easier to read by setting X = e^(-k/N), Y = P(B). Then we have
XY/(XY+1-Y)
or
1-(Y-1)/(XY+1-Y)
add a comment
|
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "244"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsqa.stackexchange.com%2fquestions%2f39365%2fif-a-problem-only-occurs-randomly-once-in-every-n-times-on-average-how-many-tes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm going to take a different approach than statistics (though I think the other response answers your actual question more directly). Any time I've encountered "a problem that only happens some of the time" as either QA or a support role it's been an investigative exercise about narrowing down why the event happens irregularly or in what situations it occurs.
Some (but certainly not all) points of investigation may be:
- Specific accounts or data.
- Differences in hosts/environments the applications or services are running on.
- Different versions of the application or service running on different hosts
- Certain days, dates, times of day or time zones.
- Certain users and their specific means of accessing the application (physical device, browser, network connection)
This sort of situation is where reproduction steps and other details from the person reporting the problem can be so valuable in resolving their issues. Telling a customer "your problem is fixed" when you're just making an educated guess can spiral in a negative direction if your experiment is based on incorrect assumptions. In my experience it's better to try to coach them about what information will help resolve their problem and how they can help you get it.
20
Can I add a really obscure one that troubled one company I worked at? "Whether the logged-in user has an odd or an even number of characters in their username". (Which meant that the developer didn't have a problem, and the first test user did.)
– Martin Bonner
May 28 at 17:12
That's definitely an interesting one. It's always a good challenge when it gets to that granular of a difference.
– Cherree
May 28 at 18:09
7
Back before the turn of the century, we had an issue with phone calls counting double when transit calls (routed through the country but originating and terminating in other countries) for some specific pair of countries had happened on exactly 20 days of a calendar month. That one was completely deterministic but hell to track down. (Root cause was a buggy database driver fetching rows in batches of 20)
– JollyJoker
May 29 at 7:38
1
So much this; apart from hardware failures (and buggy RAM does exist) most problems are deterministic. Even a race-condition is just a fancy name for A occurs before B some of the time. If you have a flaky behavior, it means you have determined the conditions in which it happens in sufficient detail.
– Matthieu M.
May 29 at 11:22
1
Just another anecdotal example - we had a problem once that the customer reported happened sometimes and we couldn't reproduce. One amazing QA person we had managed to narrow it down to...network latency. She played with the network throttling options in Chrome (it's under the dev tools) and found that the problem was almost exclusively happening between some network speeds. On a fast network, you wouldn't see the problem, nor would you on a very slow network. It happened in a narrow band. It was still irregular but she found the optimum speeds where it happened around 80% of the time.
– VLAZ
May 30 at 11:22
|
show 2 more comments
I'm going to take a different approach than statistics (though I think the other response answers your actual question more directly). Any time I've encountered "a problem that only happens some of the time" as either QA or a support role it's been an investigative exercise about narrowing down why the event happens irregularly or in what situations it occurs.
Some (but certainly not all) points of investigation may be:
- Specific accounts or data.
- Differences in hosts/environments the applications or services are running on.
- Different versions of the application or service running on different hosts
- Certain days, dates, times of day or time zones.
- Certain users and their specific means of accessing the application (physical device, browser, network connection)
This sort of situation is where reproduction steps and other details from the person reporting the problem can be so valuable in resolving their issues. Telling a customer "your problem is fixed" when you're just making an educated guess can spiral in a negative direction if your experiment is based on incorrect assumptions. In my experience it's better to try to coach them about what information will help resolve their problem and how they can help you get it.
20
Can I add a really obscure one that troubled one company I worked at? "Whether the logged-in user has an odd or an even number of characters in their username". (Which meant that the developer didn't have a problem, and the first test user did.)
– Martin Bonner
May 28 at 17:12
That's definitely an interesting one. It's always a good challenge when it gets to that granular of a difference.
– Cherree
May 28 at 18:09
7
Back before the turn of the century, we had an issue with phone calls counting double when transit calls (routed through the country but originating and terminating in other countries) for some specific pair of countries had happened on exactly 20 days of a calendar month. That one was completely deterministic but hell to track down. (Root cause was a buggy database driver fetching rows in batches of 20)
– JollyJoker
May 29 at 7:38
1
So much this; apart from hardware failures (and buggy RAM does exist) most problems are deterministic. Even a race-condition is just a fancy name for A occurs before B some of the time. If you have a flaky behavior, it means you have determined the conditions in which it happens in sufficient detail.
– Matthieu M.
May 29 at 11:22
1
Just another anecdotal example - we had a problem once that the customer reported happened sometimes and we couldn't reproduce. One amazing QA person we had managed to narrow it down to...network latency. She played with the network throttling options in Chrome (it's under the dev tools) and found that the problem was almost exclusively happening between some network speeds. On a fast network, you wouldn't see the problem, nor would you on a very slow network. It happened in a narrow band. It was still irregular but she found the optimum speeds where it happened around 80% of the time.
– VLAZ
May 30 at 11:22
|
show 2 more comments
I'm going to take a different approach than statistics (though I think the other response answers your actual question more directly). Any time I've encountered "a problem that only happens some of the time" as either QA or a support role it's been an investigative exercise about narrowing down why the event happens irregularly or in what situations it occurs.
Some (but certainly not all) points of investigation may be:
- Specific accounts or data.
- Differences in hosts/environments the applications or services are running on.
- Different versions of the application or service running on different hosts
- Certain days, dates, times of day or time zones.
- Certain users and their specific means of accessing the application (physical device, browser, network connection)
This sort of situation is where reproduction steps and other details from the person reporting the problem can be so valuable in resolving their issues. Telling a customer "your problem is fixed" when you're just making an educated guess can spiral in a negative direction if your experiment is based on incorrect assumptions. In my experience it's better to try to coach them about what information will help resolve their problem and how they can help you get it.
I'm going to take a different approach than statistics (though I think the other response answers your actual question more directly). Any time I've encountered "a problem that only happens some of the time" as either QA or a support role it's been an investigative exercise about narrowing down why the event happens irregularly or in what situations it occurs.
Some (but certainly not all) points of investigation may be:
- Specific accounts or data.
- Differences in hosts/environments the applications or services are running on.
- Different versions of the application or service running on different hosts
- Certain days, dates, times of day or time zones.
- Certain users and their specific means of accessing the application (physical device, browser, network connection)
This sort of situation is where reproduction steps and other details from the person reporting the problem can be so valuable in resolving their issues. Telling a customer "your problem is fixed" when you're just making an educated guess can spiral in a negative direction if your experiment is based on incorrect assumptions. In my experience it's better to try to coach them about what information will help resolve their problem and how they can help you get it.
answered May 28 at 13:32
CherreeCherree
1,1246 silver badges12 bronze badges
1,1246 silver badges12 bronze badges
20
Can I add a really obscure one that troubled one company I worked at? "Whether the logged-in user has an odd or an even number of characters in their username". (Which meant that the developer didn't have a problem, and the first test user did.)
– Martin Bonner
May 28 at 17:12
That's definitely an interesting one. It's always a good challenge when it gets to that granular of a difference.
– Cherree
May 28 at 18:09
7
Back before the turn of the century, we had an issue with phone calls counting double when transit calls (routed through the country but originating and terminating in other countries) for some specific pair of countries had happened on exactly 20 days of a calendar month. That one was completely deterministic but hell to track down. (Root cause was a buggy database driver fetching rows in batches of 20)
– JollyJoker
May 29 at 7:38
1
So much this; apart from hardware failures (and buggy RAM does exist) most problems are deterministic. Even a race-condition is just a fancy name for A occurs before B some of the time. If you have a flaky behavior, it means you have determined the conditions in which it happens in sufficient detail.
– Matthieu M.
May 29 at 11:22
1
Just another anecdotal example - we had a problem once that the customer reported happened sometimes and we couldn't reproduce. One amazing QA person we had managed to narrow it down to...network latency. She played with the network throttling options in Chrome (it's under the dev tools) and found that the problem was almost exclusively happening between some network speeds. On a fast network, you wouldn't see the problem, nor would you on a very slow network. It happened in a narrow band. It was still irregular but she found the optimum speeds where it happened around 80% of the time.
– VLAZ
May 30 at 11:22
|
show 2 more comments
20
Can I add a really obscure one that troubled one company I worked at? "Whether the logged-in user has an odd or an even number of characters in their username". (Which meant that the developer didn't have a problem, and the first test user did.)
– Martin Bonner
May 28 at 17:12
That's definitely an interesting one. It's always a good challenge when it gets to that granular of a difference.
– Cherree
May 28 at 18:09
7
Back before the turn of the century, we had an issue with phone calls counting double when transit calls (routed through the country but originating and terminating in other countries) for some specific pair of countries had happened on exactly 20 days of a calendar month. That one was completely deterministic but hell to track down. (Root cause was a buggy database driver fetching rows in batches of 20)
– JollyJoker
May 29 at 7:38
1
So much this; apart from hardware failures (and buggy RAM does exist) most problems are deterministic. Even a race-condition is just a fancy name for A occurs before B some of the time. If you have a flaky behavior, it means you have determined the conditions in which it happens in sufficient detail.
– Matthieu M.
May 29 at 11:22
1
Just another anecdotal example - we had a problem once that the customer reported happened sometimes and we couldn't reproduce. One amazing QA person we had managed to narrow it down to...network latency. She played with the network throttling options in Chrome (it's under the dev tools) and found that the problem was almost exclusively happening between some network speeds. On a fast network, you wouldn't see the problem, nor would you on a very slow network. It happened in a narrow band. It was still irregular but she found the optimum speeds where it happened around 80% of the time.
– VLAZ
May 30 at 11:22
20
20
Can I add a really obscure one that troubled one company I worked at? "Whether the logged-in user has an odd or an even number of characters in their username". (Which meant that the developer didn't have a problem, and the first test user did.)
– Martin Bonner
May 28 at 17:12
Can I add a really obscure one that troubled one company I worked at? "Whether the logged-in user has an odd or an even number of characters in their username". (Which meant that the developer didn't have a problem, and the first test user did.)
– Martin Bonner
May 28 at 17:12
That's definitely an interesting one. It's always a good challenge when it gets to that granular of a difference.
– Cherree
May 28 at 18:09
That's definitely an interesting one. It's always a good challenge when it gets to that granular of a difference.
– Cherree
May 28 at 18:09
7
7
Back before the turn of the century, we had an issue with phone calls counting double when transit calls (routed through the country but originating and terminating in other countries) for some specific pair of countries had happened on exactly 20 days of a calendar month. That one was completely deterministic but hell to track down. (Root cause was a buggy database driver fetching rows in batches of 20)
– JollyJoker
May 29 at 7:38
Back before the turn of the century, we had an issue with phone calls counting double when transit calls (routed through the country but originating and terminating in other countries) for some specific pair of countries had happened on exactly 20 days of a calendar month. That one was completely deterministic but hell to track down. (Root cause was a buggy database driver fetching rows in batches of 20)
– JollyJoker
May 29 at 7:38
1
1
So much this; apart from hardware failures (and buggy RAM does exist) most problems are deterministic. Even a race-condition is just a fancy name for A occurs before B some of the time. If you have a flaky behavior, it means you have determined the conditions in which it happens in sufficient detail.
– Matthieu M.
May 29 at 11:22
So much this; apart from hardware failures (and buggy RAM does exist) most problems are deterministic. Even a race-condition is just a fancy name for A occurs before B some of the time. If you have a flaky behavior, it means you have determined the conditions in which it happens in sufficient detail.
– Matthieu M.
May 29 at 11:22
1
1
Just another anecdotal example - we had a problem once that the customer reported happened sometimes and we couldn't reproduce. One amazing QA person we had managed to narrow it down to...network latency. She played with the network throttling options in Chrome (it's under the dev tools) and found that the problem was almost exclusively happening between some network speeds. On a fast network, you wouldn't see the problem, nor would you on a very slow network. It happened in a narrow band. It was still irregular but she found the optimum speeds where it happened around 80% of the time.
– VLAZ
May 30 at 11:22
Just another anecdotal example - we had a problem once that the customer reported happened sometimes and we couldn't reproduce. One amazing QA person we had managed to narrow it down to...network latency. She played with the network throttling options in Chrome (it's under the dev tools) and found that the problem was almost exclusively happening between some network speeds. On a fast network, you wouldn't see the problem, nor would you on a very slow network. It happened in a narrow band. It was still irregular but she found the optimum speeds where it happened around 80% of the time.
– VLAZ
May 30 at 11:22
|
show 2 more comments
You must perform the same test equal number of times on both the unfixed version and the supposedly fixed version. You have to show that,
- The test fails once in every N times randomly, on the unfixed version.
- The same test passes every time, or at least fails less often, on the fixed version.
You have to show that the only difference between 1 and 2 is the "fix" itself, not any external or environmental factors.
If you only perform the test on the new, fixed version, it could very well be the case that the bug was caused by an unrelated environmental factor that simply doesn't exist in your test now.
9
+1 for the principle of verifying that the old version still fails. It's way too easy for random bugs to randomly go away for days due to some external changes such as network speed.
– jpa
May 29 at 6:53
This is really nasty on less strict platforms like Javascript. Depending on how fast your computer process the scripts, race conditions of overriding functions can occur and the code can randomly call a different function depending on the network latency and particular script's load time.
– Nelson
May 29 at 8:58
add a comment
|
You must perform the same test equal number of times on both the unfixed version and the supposedly fixed version. You have to show that,
- The test fails once in every N times randomly, on the unfixed version.
- The same test passes every time, or at least fails less often, on the fixed version.
You have to show that the only difference between 1 and 2 is the "fix" itself, not any external or environmental factors.
If you only perform the test on the new, fixed version, it could very well be the case that the bug was caused by an unrelated environmental factor that simply doesn't exist in your test now.
9
+1 for the principle of verifying that the old version still fails. It's way too easy for random bugs to randomly go away for days due to some external changes such as network speed.
– jpa
May 29 at 6:53
This is really nasty on less strict platforms like Javascript. Depending on how fast your computer process the scripts, race conditions of overriding functions can occur and the code can randomly call a different function depending on the network latency and particular script's load time.
– Nelson
May 29 at 8:58
add a comment
|
You must perform the same test equal number of times on both the unfixed version and the supposedly fixed version. You have to show that,
- The test fails once in every N times randomly, on the unfixed version.
- The same test passes every time, or at least fails less often, on the fixed version.
You have to show that the only difference between 1 and 2 is the "fix" itself, not any external or environmental factors.
If you only perform the test on the new, fixed version, it could very well be the case that the bug was caused by an unrelated environmental factor that simply doesn't exist in your test now.
You must perform the same test equal number of times on both the unfixed version and the supposedly fixed version. You have to show that,
- The test fails once in every N times randomly, on the unfixed version.
- The same test passes every time, or at least fails less often, on the fixed version.
You have to show that the only difference between 1 and 2 is the "fix" itself, not any external or environmental factors.
If you only perform the test on the new, fixed version, it could very well be the case that the bug was caused by an unrelated environmental factor that simply doesn't exist in your test now.
answered May 29 at 1:12
Double Vision Stout Fat HeavyDouble Vision Stout Fat Heavy
3012 bronze badges
3012 bronze badges
9
+1 for the principle of verifying that the old version still fails. It's way too easy for random bugs to randomly go away for days due to some external changes such as network speed.
– jpa
May 29 at 6:53
This is really nasty on less strict platforms like Javascript. Depending on how fast your computer process the scripts, race conditions of overriding functions can occur and the code can randomly call a different function depending on the network latency and particular script's load time.
– Nelson
May 29 at 8:58
add a comment
|
9
+1 for the principle of verifying that the old version still fails. It's way too easy for random bugs to randomly go away for days due to some external changes such as network speed.
– jpa
May 29 at 6:53
This is really nasty on less strict platforms like Javascript. Depending on how fast your computer process the scripts, race conditions of overriding functions can occur and the code can randomly call a different function depending on the network latency and particular script's load time.
– Nelson
May 29 at 8:58
9
9
+1 for the principle of verifying that the old version still fails. It's way too easy for random bugs to randomly go away for days due to some external changes such as network speed.
– jpa
May 29 at 6:53
+1 for the principle of verifying that the old version still fails. It's way too easy for random bugs to randomly go away for days due to some external changes such as network speed.
– jpa
May 29 at 6:53
This is really nasty on less strict platforms like Javascript. Depending on how fast your computer process the scripts, race conditions of overriding functions can occur and the code can randomly call a different function depending on the network latency and particular script's load time.
– Nelson
May 29 at 8:58
This is really nasty on less strict platforms like Javascript. Depending on how fast your computer process the scripts, race conditions of overriding functions can occur and the code can randomly call a different function depending on the network latency and particular script's load time.
– Nelson
May 29 at 8:58
add a comment
|
I suppose this answer could help you
You need to decide first at what probability you want to "detect" the problem.
This is a nice example to why theoretical knowledge is necessary even for testers.
The simplified version:
p is the probability for failure, 1/N in our case
then the probability for success is 1-p
and the probability to have N successful tries is (1-p)^N
so the probability to have N successful tries and and then a failure would be 1-(1-p)^N
extracting N and simplifying a bit assuming big enough N gives:
- −log(1−p)⋅N
(*) "log" is sometimes referred to (for example in calculators) as ln(x), loge(x) or log(x)
5
I would replace "probability" with "confidence". How confident you want to be? Note, that to be confident for 100%, you would need to execute infinite number of tests, because -log (1-x) goes to infinity when x goes to 1.
– dzieciou
May 28 at 18:20
1
And all this is based on assumption test execution are independent. For instance, the error does not accumulate and manifests as a result of accumulation of n executions.
– dzieciou
May 28 at 18:21
This is actually a valid point @Makyen, I edited the answer
– Rsf
May 29 at 8:07
This answer would be greatly improved by definingx
, which just shows up in the last bullet point out of nowhere.
– Gregor
May 31 at 1:54
add a comment
|
I suppose this answer could help you
You need to decide first at what probability you want to "detect" the problem.
This is a nice example to why theoretical knowledge is necessary even for testers.
The simplified version:
p is the probability for failure, 1/N in our case
then the probability for success is 1-p
and the probability to have N successful tries is (1-p)^N
so the probability to have N successful tries and and then a failure would be 1-(1-p)^N
extracting N and simplifying a bit assuming big enough N gives:
- −log(1−p)⋅N
(*) "log" is sometimes referred to (for example in calculators) as ln(x), loge(x) or log(x)
5
I would replace "probability" with "confidence". How confident you want to be? Note, that to be confident for 100%, you would need to execute infinite number of tests, because -log (1-x) goes to infinity when x goes to 1.
– dzieciou
May 28 at 18:20
1
And all this is based on assumption test execution are independent. For instance, the error does not accumulate and manifests as a result of accumulation of n executions.
– dzieciou
May 28 at 18:21
This is actually a valid point @Makyen, I edited the answer
– Rsf
May 29 at 8:07
This answer would be greatly improved by definingx
, which just shows up in the last bullet point out of nowhere.
– Gregor
May 31 at 1:54
add a comment
|
I suppose this answer could help you
You need to decide first at what probability you want to "detect" the problem.
This is a nice example to why theoretical knowledge is necessary even for testers.
The simplified version:
p is the probability for failure, 1/N in our case
then the probability for success is 1-p
and the probability to have N successful tries is (1-p)^N
so the probability to have N successful tries and and then a failure would be 1-(1-p)^N
extracting N and simplifying a bit assuming big enough N gives:
- −log(1−p)⋅N
(*) "log" is sometimes referred to (for example in calculators) as ln(x), loge(x) or log(x)
I suppose this answer could help you
You need to decide first at what probability you want to "detect" the problem.
This is a nice example to why theoretical knowledge is necessary even for testers.
The simplified version:
p is the probability for failure, 1/N in our case
then the probability for success is 1-p
and the probability to have N successful tries is (1-p)^N
so the probability to have N successful tries and and then a failure would be 1-(1-p)^N
extracting N and simplifying a bit assuming big enough N gives:
- −log(1−p)⋅N
(*) "log" is sometimes referred to (for example in calculators) as ln(x), loge(x) or log(x)
edited May 31 at 4:28
answered May 28 at 8:27
RsfRsf
4,7251 gold badge15 silver badges29 bronze badges
4,7251 gold badge15 silver badges29 bronze badges
5
I would replace "probability" with "confidence". How confident you want to be? Note, that to be confident for 100%, you would need to execute infinite number of tests, because -log (1-x) goes to infinity when x goes to 1.
– dzieciou
May 28 at 18:20
1
And all this is based on assumption test execution are independent. For instance, the error does not accumulate and manifests as a result of accumulation of n executions.
– dzieciou
May 28 at 18:21
This is actually a valid point @Makyen, I edited the answer
– Rsf
May 29 at 8:07
This answer would be greatly improved by definingx
, which just shows up in the last bullet point out of nowhere.
– Gregor
May 31 at 1:54
add a comment
|
5
I would replace "probability" with "confidence". How confident you want to be? Note, that to be confident for 100%, you would need to execute infinite number of tests, because -log (1-x) goes to infinity when x goes to 1.
– dzieciou
May 28 at 18:20
1
And all this is based on assumption test execution are independent. For instance, the error does not accumulate and manifests as a result of accumulation of n executions.
– dzieciou
May 28 at 18:21
This is actually a valid point @Makyen, I edited the answer
– Rsf
May 29 at 8:07
This answer would be greatly improved by definingx
, which just shows up in the last bullet point out of nowhere.
– Gregor
May 31 at 1:54
5
5
I would replace "probability" with "confidence". How confident you want to be? Note, that to be confident for 100%, you would need to execute infinite number of tests, because -log (1-x) goes to infinity when x goes to 1.
– dzieciou
May 28 at 18:20
I would replace "probability" with "confidence". How confident you want to be? Note, that to be confident for 100%, you would need to execute infinite number of tests, because -log (1-x) goes to infinity when x goes to 1.
– dzieciou
May 28 at 18:20
1
1
And all this is based on assumption test execution are independent. For instance, the error does not accumulate and manifests as a result of accumulation of n executions.
– dzieciou
May 28 at 18:21
And all this is based on assumption test execution are independent. For instance, the error does not accumulate and manifests as a result of accumulation of n executions.
– dzieciou
May 28 at 18:21
This is actually a valid point @Makyen, I edited the answer
– Rsf
May 29 at 8:07
This is actually a valid point @Makyen, I edited the answer
– Rsf
May 29 at 8:07
This answer would be greatly improved by defining
x
, which just shows up in the last bullet point out of nowhere.– Gregor
May 31 at 1:54
This answer would be greatly improved by defining
x
, which just shows up in the last bullet point out of nowhere.– Gregor
May 31 at 1:54
add a comment
|
While I agree with the other answers saying "dig deeper", to answer the actual math question in the title:
If the issue occurs completely at random with probability p
, then the chance if it occurring at least once in n
trials is 1-(1-p)^n
. Setting this to x
(your confidence that the issue has been fixed) and solving for n
gives you
n = log(1-x)/log(1-p)
So for example, if your issue occurs 1 out of 4 times, and you want to be 95% sure it's fixed (meaning you'll incorrectly identify it as fixed 1 out of 20 times!!), then
p = 0.25
x = 0.95
n = log(0.05)/log(0.75) ≈ 10.4
so you'd need to run 11 trials
The difference between my answer and @emrys57's is that mine assumes you know the probability, while theirs assumes you know some initial sequence of results. Presumably they should both give the same answer with a large enough initial sequence.
add a comment
|
While I agree with the other answers saying "dig deeper", to answer the actual math question in the title:
If the issue occurs completely at random with probability p
, then the chance if it occurring at least once in n
trials is 1-(1-p)^n
. Setting this to x
(your confidence that the issue has been fixed) and solving for n
gives you
n = log(1-x)/log(1-p)
So for example, if your issue occurs 1 out of 4 times, and you want to be 95% sure it's fixed (meaning you'll incorrectly identify it as fixed 1 out of 20 times!!), then
p = 0.25
x = 0.95
n = log(0.05)/log(0.75) ≈ 10.4
so you'd need to run 11 trials
The difference between my answer and @emrys57's is that mine assumes you know the probability, while theirs assumes you know some initial sequence of results. Presumably they should both give the same answer with a large enough initial sequence.
add a comment
|
While I agree with the other answers saying "dig deeper", to answer the actual math question in the title:
If the issue occurs completely at random with probability p
, then the chance if it occurring at least once in n
trials is 1-(1-p)^n
. Setting this to x
(your confidence that the issue has been fixed) and solving for n
gives you
n = log(1-x)/log(1-p)
So for example, if your issue occurs 1 out of 4 times, and you want to be 95% sure it's fixed (meaning you'll incorrectly identify it as fixed 1 out of 20 times!!), then
p = 0.25
x = 0.95
n = log(0.05)/log(0.75) ≈ 10.4
so you'd need to run 11 trials
The difference between my answer and @emrys57's is that mine assumes you know the probability, while theirs assumes you know some initial sequence of results. Presumably they should both give the same answer with a large enough initial sequence.
While I agree with the other answers saying "dig deeper", to answer the actual math question in the title:
If the issue occurs completely at random with probability p
, then the chance if it occurring at least once in n
trials is 1-(1-p)^n
. Setting this to x
(your confidence that the issue has been fixed) and solving for n
gives you
n = log(1-x)/log(1-p)
So for example, if your issue occurs 1 out of 4 times, and you want to be 95% sure it's fixed (meaning you'll incorrectly identify it as fixed 1 out of 20 times!!), then
p = 0.25
x = 0.95
n = log(0.05)/log(0.75) ≈ 10.4
so you'd need to run 11 trials
The difference between my answer and @emrys57's is that mine assumes you know the probability, while theirs assumes you know some initial sequence of results. Presumably they should both give the same answer with a large enough initial sequence.
edited May 28 at 20:41
answered May 28 at 20:34
BlueRaja - Danny PflughoeftBlueRaja - Danny Pflughoeft
1914 bronze badges
1914 bronze badges
add a comment
|
add a comment
|
A problem has been observed that sometimes, in testing, produces errors. We don’t actually know the probability that it will produce an error on any given test run, because we can only do a finite number of tests on the broken system. If 1 represents a test pass, and 0 represents a test fail, we might have a sequence of results from multiple tests that looks like this:
011001010011
Representing 6 passes and 6 fails in 12 tests. This gives us a guess Pf of the probability of a failure of a test for the broken system. We assume that this probability isn’t changing with time. However, there will be a large uncertainty in the actual value of Pf since we cannot do very many tests.
Following these tests, we make a repair. We hope that this fixes the problem, but we’re not sure. We make more tests of the repaired system. If we have fixed the problem, the value of Pf measured after the repair should be 0. If we have not fixed the system, it should be the same as the value of Pf before the repair, unchanged Pf.
If we run some tests after the repair and one fails, we immediately know the repair failed. The question is, if no tests fail after repair, does that mean the repair worked?
Consider the case where we have results
a zeroes (test fails)
b ones (test passes)
Including after repair: c ones (test passes)
The number of ways of arranging the a + b initial results is
Ntot = (a + b)! /(b! * a!)
In the case where Pf does not change yet all the last C results are ones, we need to arrange b - c ones in the first a + b - c results. The number of different ways of arranging these first a + b - c results is
Nsuc = (a + b - c)! / ( (b - c)! * a! )
These patterns are those from all the Ntot possible patterns where the last c results are all ones.
If the change didn’t actually repair the system, but really left its state the same, the results we observe are a random collection of zeroes and ones produced by chance, depending on the probability Pf that a single experiment will return a zero or one results. Given this, the chance that we will observe a pattern of a zeros and b ones with the last c results all ones is
Cran = Nsuc/Ntot
Or
Cran = (a + b - c)! * b! * a! / ((a +b)! * (b - c)! * a!)
Or
Cran = (a + b - c)! * b! / ((a + b)! * (b - c)!)
Cran is the chance that we observe c test passes after we make a repair to the system, out of a total of a fails and b passes, but in fact we have changed nothing, and we see a sequence of passes after repair by happenstance.
As an example, with the pattern above before repair, and 6 ones (passes) after repair, Cran is slightly less than 5%. To reach a confidence of 99% that the repair has succeeded, you need 11 passes after repair.
There's a spreadsheet that can be copied to make this computation. I hope I have it all right, it is 15 years since I last worked this out. And, back then, it took me 15 years to first find the answer.
add a comment
|
A problem has been observed that sometimes, in testing, produces errors. We don’t actually know the probability that it will produce an error on any given test run, because we can only do a finite number of tests on the broken system. If 1 represents a test pass, and 0 represents a test fail, we might have a sequence of results from multiple tests that looks like this:
011001010011
Representing 6 passes and 6 fails in 12 tests. This gives us a guess Pf of the probability of a failure of a test for the broken system. We assume that this probability isn’t changing with time. However, there will be a large uncertainty in the actual value of Pf since we cannot do very many tests.
Following these tests, we make a repair. We hope that this fixes the problem, but we’re not sure. We make more tests of the repaired system. If we have fixed the problem, the value of Pf measured after the repair should be 0. If we have not fixed the system, it should be the same as the value of Pf before the repair, unchanged Pf.
If we run some tests after the repair and one fails, we immediately know the repair failed. The question is, if no tests fail after repair, does that mean the repair worked?
Consider the case where we have results
a zeroes (test fails)
b ones (test passes)
Including after repair: c ones (test passes)
The number of ways of arranging the a + b initial results is
Ntot = (a + b)! /(b! * a!)
In the case where Pf does not change yet all the last C results are ones, we need to arrange b - c ones in the first a + b - c results. The number of different ways of arranging these first a + b - c results is
Nsuc = (a + b - c)! / ( (b - c)! * a! )
These patterns are those from all the Ntot possible patterns where the last c results are all ones.
If the change didn’t actually repair the system, but really left its state the same, the results we observe are a random collection of zeroes and ones produced by chance, depending on the probability Pf that a single experiment will return a zero or one results. Given this, the chance that we will observe a pattern of a zeros and b ones with the last c results all ones is
Cran = Nsuc/Ntot
Or
Cran = (a + b - c)! * b! * a! / ((a +b)! * (b - c)! * a!)
Or
Cran = (a + b - c)! * b! / ((a + b)! * (b - c)!)
Cran is the chance that we observe c test passes after we make a repair to the system, out of a total of a fails and b passes, but in fact we have changed nothing, and we see a sequence of passes after repair by happenstance.
As an example, with the pattern above before repair, and 6 ones (passes) after repair, Cran is slightly less than 5%. To reach a confidence of 99% that the repair has succeeded, you need 11 passes after repair.
There's a spreadsheet that can be copied to make this computation. I hope I have it all right, it is 15 years since I last worked this out. And, back then, it took me 15 years to first find the answer.
add a comment
|
A problem has been observed that sometimes, in testing, produces errors. We don’t actually know the probability that it will produce an error on any given test run, because we can only do a finite number of tests on the broken system. If 1 represents a test pass, and 0 represents a test fail, we might have a sequence of results from multiple tests that looks like this:
011001010011
Representing 6 passes and 6 fails in 12 tests. This gives us a guess Pf of the probability of a failure of a test for the broken system. We assume that this probability isn’t changing with time. However, there will be a large uncertainty in the actual value of Pf since we cannot do very many tests.
Following these tests, we make a repair. We hope that this fixes the problem, but we’re not sure. We make more tests of the repaired system. If we have fixed the problem, the value of Pf measured after the repair should be 0. If we have not fixed the system, it should be the same as the value of Pf before the repair, unchanged Pf.
If we run some tests after the repair and one fails, we immediately know the repair failed. The question is, if no tests fail after repair, does that mean the repair worked?
Consider the case where we have results
a zeroes (test fails)
b ones (test passes)
Including after repair: c ones (test passes)
The number of ways of arranging the a + b initial results is
Ntot = (a + b)! /(b! * a!)
In the case where Pf does not change yet all the last C results are ones, we need to arrange b - c ones in the first a + b - c results. The number of different ways of arranging these first a + b - c results is
Nsuc = (a + b - c)! / ( (b - c)! * a! )
These patterns are those from all the Ntot possible patterns where the last c results are all ones.
If the change didn’t actually repair the system, but really left its state the same, the results we observe are a random collection of zeroes and ones produced by chance, depending on the probability Pf that a single experiment will return a zero or one results. Given this, the chance that we will observe a pattern of a zeros and b ones with the last c results all ones is
Cran = Nsuc/Ntot
Or
Cran = (a + b - c)! * b! * a! / ((a +b)! * (b - c)! * a!)
Or
Cran = (a + b - c)! * b! / ((a + b)! * (b - c)!)
Cran is the chance that we observe c test passes after we make a repair to the system, out of a total of a fails and b passes, but in fact we have changed nothing, and we see a sequence of passes after repair by happenstance.
As an example, with the pattern above before repair, and 6 ones (passes) after repair, Cran is slightly less than 5%. To reach a confidence of 99% that the repair has succeeded, you need 11 passes after repair.
There's a spreadsheet that can be copied to make this computation. I hope I have it all right, it is 15 years since I last worked this out. And, back then, it took me 15 years to first find the answer.
A problem has been observed that sometimes, in testing, produces errors. We don’t actually know the probability that it will produce an error on any given test run, because we can only do a finite number of tests on the broken system. If 1 represents a test pass, and 0 represents a test fail, we might have a sequence of results from multiple tests that looks like this:
011001010011
Representing 6 passes and 6 fails in 12 tests. This gives us a guess Pf of the probability of a failure of a test for the broken system. We assume that this probability isn’t changing with time. However, there will be a large uncertainty in the actual value of Pf since we cannot do very many tests.
Following these tests, we make a repair. We hope that this fixes the problem, but we’re not sure. We make more tests of the repaired system. If we have fixed the problem, the value of Pf measured after the repair should be 0. If we have not fixed the system, it should be the same as the value of Pf before the repair, unchanged Pf.
If we run some tests after the repair and one fails, we immediately know the repair failed. The question is, if no tests fail after repair, does that mean the repair worked?
Consider the case where we have results
a zeroes (test fails)
b ones (test passes)
Including after repair: c ones (test passes)
The number of ways of arranging the a + b initial results is
Ntot = (a + b)! /(b! * a!)
In the case where Pf does not change yet all the last C results are ones, we need to arrange b - c ones in the first a + b - c results. The number of different ways of arranging these first a + b - c results is
Nsuc = (a + b - c)! / ( (b - c)! * a! )
These patterns are those from all the Ntot possible patterns where the last c results are all ones.
If the change didn’t actually repair the system, but really left its state the same, the results we observe are a random collection of zeroes and ones produced by chance, depending on the probability Pf that a single experiment will return a zero or one results. Given this, the chance that we will observe a pattern of a zeros and b ones with the last c results all ones is
Cran = Nsuc/Ntot
Or
Cran = (a + b - c)! * b! * a! / ((a +b)! * (b - c)! * a!)
Or
Cran = (a + b - c)! * b! / ((a + b)! * (b - c)!)
Cran is the chance that we observe c test passes after we make a repair to the system, out of a total of a fails and b passes, but in fact we have changed nothing, and we see a sequence of passes after repair by happenstance.
As an example, with the pattern above before repair, and 6 ones (passes) after repair, Cran is slightly less than 5%. To reach a confidence of 99% that the repair has succeeded, you need 11 passes after repair.
There's a spreadsheet that can be copied to make this computation. I hope I have it all right, it is 15 years since I last worked this out. And, back then, it took me 15 years to first find the answer.
edited May 28 at 17:44
answered May 28 at 17:05
emrys57emrys57
1412 bronze badges
1412 bronze badges
add a comment
|
add a comment
|
Let B mean "broken", F mean "fixed", E_k mean "error observed in k trials". You are saying that P(E_1|B)=1/N; the probability of seeing an error in a single observation, given that it's broken, is 1/N. Now, that itself is likely going to have some uncertainty, since likely your only way of measuring it will be by seeing how often it fails, and making an empirical estimate. However, if we take that as given, then applying Bayes' rule gives us that if our prior probability is P(B), then the posterior probability is P(~E_k|B)P(B)/P(~E_k).
For P(~E_k|B), we have P(~E_k|B)= P(~E_1|B)^k = (1-P(E_1|B))^k = (1-1/N)^k = ((N-1)/N)^k. For large N and k, we can approximate that as e^(-k/N).
For P(~E_k), we have P(~E_k) = P(~E_k|B)P(B)+P(~E_k|F)P(F). P(~E_k|F) =1 (If we've fixed it, we're guaranteed to see no errors). And P(F) is just 1-P(B).
Plugging that all in, we have
P(B|~E_k) ~= e^(-k/N)P(B)/(e^(-k/N)P(B)+1-P(B))
This can be made easier to read by setting X = e^(-k/N), Y = P(B). Then we have
XY/(XY+1-Y)
or
1-(Y-1)/(XY+1-Y)
add a comment
|
Let B mean "broken", F mean "fixed", E_k mean "error observed in k trials". You are saying that P(E_1|B)=1/N; the probability of seeing an error in a single observation, given that it's broken, is 1/N. Now, that itself is likely going to have some uncertainty, since likely your only way of measuring it will be by seeing how often it fails, and making an empirical estimate. However, if we take that as given, then applying Bayes' rule gives us that if our prior probability is P(B), then the posterior probability is P(~E_k|B)P(B)/P(~E_k).
For P(~E_k|B), we have P(~E_k|B)= P(~E_1|B)^k = (1-P(E_1|B))^k = (1-1/N)^k = ((N-1)/N)^k. For large N and k, we can approximate that as e^(-k/N).
For P(~E_k), we have P(~E_k) = P(~E_k|B)P(B)+P(~E_k|F)P(F). P(~E_k|F) =1 (If we've fixed it, we're guaranteed to see no errors). And P(F) is just 1-P(B).
Plugging that all in, we have
P(B|~E_k) ~= e^(-k/N)P(B)/(e^(-k/N)P(B)+1-P(B))
This can be made easier to read by setting X = e^(-k/N), Y = P(B). Then we have
XY/(XY+1-Y)
or
1-(Y-1)/(XY+1-Y)
add a comment
|
Let B mean "broken", F mean "fixed", E_k mean "error observed in k trials". You are saying that P(E_1|B)=1/N; the probability of seeing an error in a single observation, given that it's broken, is 1/N. Now, that itself is likely going to have some uncertainty, since likely your only way of measuring it will be by seeing how often it fails, and making an empirical estimate. However, if we take that as given, then applying Bayes' rule gives us that if our prior probability is P(B), then the posterior probability is P(~E_k|B)P(B)/P(~E_k).
For P(~E_k|B), we have P(~E_k|B)= P(~E_1|B)^k = (1-P(E_1|B))^k = (1-1/N)^k = ((N-1)/N)^k. For large N and k, we can approximate that as e^(-k/N).
For P(~E_k), we have P(~E_k) = P(~E_k|B)P(B)+P(~E_k|F)P(F). P(~E_k|F) =1 (If we've fixed it, we're guaranteed to see no errors). And P(F) is just 1-P(B).
Plugging that all in, we have
P(B|~E_k) ~= e^(-k/N)P(B)/(e^(-k/N)P(B)+1-P(B))
This can be made easier to read by setting X = e^(-k/N), Y = P(B). Then we have
XY/(XY+1-Y)
or
1-(Y-1)/(XY+1-Y)
Let B mean "broken", F mean "fixed", E_k mean "error observed in k trials". You are saying that P(E_1|B)=1/N; the probability of seeing an error in a single observation, given that it's broken, is 1/N. Now, that itself is likely going to have some uncertainty, since likely your only way of measuring it will be by seeing how often it fails, and making an empirical estimate. However, if we take that as given, then applying Bayes' rule gives us that if our prior probability is P(B), then the posterior probability is P(~E_k|B)P(B)/P(~E_k).
For P(~E_k|B), we have P(~E_k|B)= P(~E_1|B)^k = (1-P(E_1|B))^k = (1-1/N)^k = ((N-1)/N)^k. For large N and k, we can approximate that as e^(-k/N).
For P(~E_k), we have P(~E_k) = P(~E_k|B)P(B)+P(~E_k|F)P(F). P(~E_k|F) =1 (If we've fixed it, we're guaranteed to see no errors). And P(F) is just 1-P(B).
Plugging that all in, we have
P(B|~E_k) ~= e^(-k/N)P(B)/(e^(-k/N)P(B)+1-P(B))
This can be made easier to read by setting X = e^(-k/N), Y = P(B). Then we have
XY/(XY+1-Y)
or
1-(Y-1)/(XY+1-Y)
answered May 28 at 20:56
AcccumulationAcccumulation
1911 bronze badge
1911 bronze badge
add a comment
|
add a comment
|
Thanks for contributing an answer to Software Quality Assurance & Testing Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsqa.stackexchange.com%2fquestions%2f39365%2fif-a-problem-only-occurs-randomly-once-in-every-n-times-on-average-how-many-tes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
7
@whoever voted to close - would you say this is primarily opinion based? or answerable with some math and probability?
– trashpanda
May 28 at 9:04
7
This is certainly not opinion based but related to statistical analysis. Although the question can be, and was, asked in math related SEs it is very relevant here and the context might be a bit different.
– Rsf
May 28 at 9:50
2
@JoãoFarias Of course, controlling said randomness can indeed be tricky. I like to look at the example of CHESS, which goes to great lengths to mock the scheduling algorithm of the OS to find multithreading bugs. Sometimes a statistical approach, while less satisfactory, can be more valuable from a business perspective.
– Cort Ammon
May 28 at 16:18
1
Possible duplicate of How can I be sure that rarely reproduced issue is fixed?
– Kevin McKenzie
May 28 at 17:51
6
As a developer, "this fails randomly" actually means "I haven't yet been able to pinpoint the set of circumstances that makes this fail, and I cannot spend more time investigating it". Likewise, "I've solved this random error" means "The error still happens, but you won't notice anymore because I've added logic to capture it and fix any wrong data or actions it caused".
– walen
May 29 at 8:43