Saturday, August 25, 2012

Risk Based Testing, Strategies for Prioritizing Tests against Deadlines

Often all other activities before test execution are delayed. This means testing has to be done under severe pressure. It is out of question to quit the job, nor to delay delivery or to test badly. The real answer is a prioritization strategy in order to do the best possible job with limited resources.
Which part of the systems requires most attention? There is no unique answer, and decisions about what to test have to be risk-based. There is a relationship between the resources used in testing and the risk after testing. There are possibilities for stepwise release. The general strategy is to test some important functions and features that hopefully can be released, while delaying others.
First, one has to test what is most important in the application. This can be determined by looking at visibility of functions, at frequency of use and at the possible cost of failure. Second, one has to test where the probability of failure is high, i.e. one may find most trouble. This can be determined by identifying especially defect-prone areas in the product. Project history gives some indication, and product measures like complexity give more. Using both, one finds a list of areas to test more or less.
After test execution has started and one has found some defects, these defects may be a basis for re-focussing testing. Defects clump together in defect-prone areas. Defects are a symptom of typical trouble the developers had. Thus, a defect leads to the conclusion that there are more defects nearby, and that there are more defects of the same kind. Thus, during the latter part of test execution, one should focus on areas where defects have been found, and one should generate more tests aimed at the type of defect detected before.
Disclaimer: The ideas in this paper are not verified for use with safety critical software. Some of the ideas may be useful in that area, but due consideration is necessary. The presented ideas mean that the tester is taking risks, and the risks may or may not materialize in the form of serious failures.
Introduction
The scenario is as follows: You are the test manager. You made a plan and a budget for testing. Your plans were, as far as you know, reasonable and well founded. When the time to execute the tests approaches, the product is not ready, some of your testers are not available, or the budget is just cut. You can argue against these cuts and argue for more time or whatever, but that doesn’t always help. You have to do what you can with a smaller budget and time frame. Resigning is no issue. You have to test the product as well as possible, and you have to make it works reasonably well after release. How to survive?
There are several approaches, using different techniques and attacking different aspects of the testing process. All of them aim at finding as many defects as possible, and as serious defects as possible, before product release. Different chapters of this paper show the idea. At the end of the paper, some ideas are given that should help to prevent the pressured scenario mentioned before.
In this paper we are talking about the higher levels of testing: integration, system and acceptance test. We assume that developers have done some basic level of testing of every program (unit testing). We also assume the programs and their designs have been reviewed in some way. Still, most of the ideas in this paper are applicable if nothing has been done before you take over as the test manager. It is, however, easier if you know some facts from earlier quality control activities such as design and code reviews and unit testing.
1. The bad game
You are in a bad game with a high probability of loosing: You will loose the game any way, by bad testing, or by requiring more time to test. After doing bad testing you will be the scapegoat for lack of quality. After reasonable testing you will be the guilty in late release. A good scenario illustrating the trouble is the Y2K project. Testing may have been done in the last minute, and the deadline was fixed. In most cases, trouble was found during design or testing and system owners were glad that problems were found. In most cases, nothing bad happened after January 1st, 2000. In many cases, managers then decided there had been wasted resources for testing. But there are options. During this paper I will use Y2K examples to illustrate the major points.
How to get out of the game?
You need some creative solution, namely you have to change the game. You need to inform management about the impossible task you have, in such a way that they understand. You need to present alternatives. They need a product going out of the door, but they also need to understand the RISK.
One strategy is to find the right quality level. Not all products need to be free of defects. Not every function needs to work. Sometimes, you have options to do a lot about lowering product quality. This means you can cut down testing in less important areas.
Another strategy is priority: Test should find the most important defects first. Most important means often "in the most important functions". These functions can be found by analyzing how every function supports the mission, and checking which functions are critical and which are not. You can also test more where you expect more defects. Finding the worst areas in the product soon and testing them more will help you find more defects. If you find too many serious problems, management will often be motivated to postpone the release or give you more time and resources. Most of this paper will be about a combination of most important and worst areas priority.
A third strategy is making testing cheaper in general. One major issue here is automation of test execution. But be cautious: Automation can be expensive, especially if you have never done it before or if you do it wrong! However, experienced companies are able to automate test execution with no overhead compared to manual testing.
A fourth strategy is getting someone else to pay. Typically, this someone else is the customer. You release a lousy product and the customer finds the defects for you. Many companies have applied this. For the customer this game is horrible, as he has no alternative. But it remains to be discussed if this is a good strategy for long term success. So this "someone else" should be the developers, not the testers. You may require the product to fulfill certain entry criteria before you test. Entry criteria can include certain reviews having been done, a minimum level of test coverage in unit testing, and a certain level of reliability. The problem is: you need to have high-level support in order to be able to enforce this. Entry criteria tend to be skipped if the project gets under pressure and organizational maturity is low.
The last strategy is prevention, but that only pays off in the next project, when you, as the test manager, are involved from the project start on.
2. Understanding necessary quality levels
Software is embedded in the larger, more complex business world. Quality must be considered in that context (8).
The relentless pursuit of quality can dramatically improve the technical characteristics of a software product. In some applications - medical instruments, railway-signaling applications, air-navigation systems, industrial automation, and many defense-related systems - the need to provide a certain level of quality is beyond debate. But is quality really the only or most important framework for strategic decision making in the commercial marketplace?
Quality thinking fails to address many of the fundamental issues that most affect a company's long-term competitive and financial performance. The real issue is which quality will produce the best financial performance.
You have to be sure which qualities and functions are important. Fewer defects do not always mean more profit! You have to research how quality and financial performance interact. Examples of such approaches include the concept of Return on Quality (ROQ) used in corporations such as AT&T (9). ROQ evaluates prospective quality improvements against their ability to also improve financial performance. Be also aware of approaches like Value Based Management. Avoid to fanatically pursuing quality for its own sake.
Thus, more testing is not always needed to ensure product success!
Example from the Y2K problem: It may be acceptable that a product fails to work on February 29, 2000. It may also be acceptable that it sorts records wrong if they are blended with 19xx and 20xx dates. But it may be of immense importance that the product can record and process orders after 1 Jan 2000.
3. Priority in testing most important and worst parts of the product.
Risk is the product of damage and probability for damage to occur. The way to assess risk is outlined in figure 1 below. Risk analysis assesses damage during use, usage frequency, and determines probability of failure by looking at defect introduction.
Figure 1: Risk definition and structure
Testing is always a sample. You can never test everything, and you can always find more to test. Thus you will always need to make decisions about what to test and what not to test, what to do more or less. The general goal is to find the worst defects first, the ones that NEED TO BE FIXED BEFORE RELEASE, and to find as many such defects as possible.
This means the defects must be important. The problem with most systematic test methods, like white box testing, or black box methods like equivalence partitioning, boundary value analysis or cause-effect graphing, is that they generate too many test cases, some of which are less important (17). A way to lessen the test load is finding the most important functional areas and product properties. Finding as many defects as possible can be improved by testing more in bad areas of the product. This means you need to know where to expect more defects.
When dealing with all the factors we look at, the result will always be a list of functions and properties with an associated importance. In order to make the final analysis as easy as possible, we express all the factors in a scale from 1 to 5. Five points are given for "most important" or "worst", or generally for something having higher risk, which we want to test more, 1 points is given to less important areas. (Other publications often use weights 1 through 3).
The details of the computation are given later.
3.1. Determining damage: What is important?
You need to know the possible damage resulting from an area to be tested. This means analyzing the most important areas of the product. In this section, a way to prioritize this is described. The ideas presented here are not the only valid ones. In every product, there may be other factors playing a role, but the factors given here have been valuable in several projects.
Important areas can either be functions or functional groups, or properties such as performance, capacity, security etc. The result of this analysis is a list of functions and properties or combination of both that need attention. I am concentrating here on sorting functions into more or less important areas. The approach, however, is flexible and can accommodate other items.
Major factors include:
  • Critical areas (cost and consequences of failure)
You have to analyze the use of the software within its overall environment. Analyze the ways the software may fail. Find the possible consequences of such failure modes, or at least the worst ones. Take into account redundancy, backup facilities and possible manual check of software output by users, operators or analysts. Software that is directly coupled to a process it controls is more critical than software whose output is manually reviewed before use. If software controls a process, this process itself should be analyzed. The inertia and stability of the process itself may make certain failures less interesting.
Example: The subscriber information system for a Telecom operator may uncouple subscriber lines - for instance if 31-12-99 is used as «indefinite» value for the subscription end date. This is a critical failure. On the other hand, in a report, the year number may be displayed as blanks if it is in 2000, which is a minor nuisance.
Output that is immediately needed during working hours is more critical than output that could be sent hours or days later. On the other hand, if large volumes of data to be sent by mail are wrong, just the cost of re-mailing may be horrible. The damage may be classified into the classes mentioned down below, or quantified into money value, whatever seems better. In systems with large variation of damage it is better to use damage as absolute money value, and not classify it into groups.
A possible hierarchy for grouping damage is the following:
A failure would be catastrophic (3)
The problem would cause the computer to stop, maybe even lead to crashes in the environment (stop the whole country or business or product). Such failures may deal with large financial losses or even damage to human life. An example would be the gross uncoupling of all subscribers to the telephone network on a special date.
Failures leading to loosing the license, i.e. authorities closing down the business, are part of this class. Serious legal consequences may also belong here.
The last kind of catastrophic failures is endangering the life of people.
A failure would be damaging (2)
The program may not stop, but data may be lost or corrupted, or functionality may be lost until the program or computer is restarted. An example is equipment that will not work just around midnight on 31 December.
A failure would be hindering (1)
The user is forced to workarounds, to more difficult actions to reach the same results.
A failure would be annoying (0)
The problem does not affect functionality, but rather make the product less appealing to the user or customer. However, the customer can live with the problem.
  • Visible areas
The visible areas are areas where many users will experience a failure, if something goes wrong. Users do not only include the operators sitting at a terminal, but also final users looking at reports, invoices, or the like, or dependent on the service delivered by the product which includes the software. A factor to take into account under this heading is also the forgivingness of the users, i.e. their tolerance against any problem. It relates to the importance of different qualities, see above.
Software intended for untrained or naive users, especially software intended for use by the general public, needs careful attention to the user interface. Robustness will also be a major concern. Software which directly interacts with hardware, industrial processes, networks etc. will be vulnerable to external effects like hardware failure, noisy data, timing problems etc. This kind of software needs thorough validation, verification and retesting in case of environment changes.
An example for a visible area is the functionality in a phone switch, which makes it possible to make a call. Less visible areas are all the value-added services like call transfer.
One factor in visibility is possible loss of faith by customers. I.e. longer-term damage which would means longer-term loss of business because customers may avoid products from the company.
  • Usage frequency

Damage is dependent on how often a function or feature is used.
Some functions may be used every day, other functions only a few times. Some functions may be used by many, some by few users. Give priority to the functions used often and heavily. The number of transactions per day may be an idea helping in finding priorities.
A possibility to leave out some areas is to cut out functionality that is going to be used seldom, i.e. will only be used once per quarter, half-year or year. Such functionality may be tested after release, before its first use. A possible strategy for Y2K testing was to test leap year functionality in January and February 2000, and then again during December 2000 and in 2004.
Sometimes this analysis is not quite obvious. In process control systems, for example, certain functionality may be invisible from the outside. In modern object oriented systems, there may be a lot of central libraries used everywhere. It may be helpful to analyze the design of the complete system.
A possible hierarchy is outlined here (from (3)):
Unavoidable (3)
An area of the product that most users will come in contact with during an average usage session (e.g. startups, printing, saving).
Frequent (2)
An area of the product that most users will come in contact with eventually, but maybe not during every usage session.
Occasional (1)
An area of the product that an average user may never visit, but that deals with functions a more serious or experienced user will need occasionally.
Rare (0)
An area of the product which most users never will visit, which is visited only if users do very uncommon steps of action. Critical failures, however, are still of interest.
An alternative method to use for picking important requirements is described in (1).
Importance can be classified by using a scale from one to five. However, in some cases this does not sufficiently map the variation of the scale in reality. Then, it is better to use real values, like the cost of damage and the actual usage frequency.
3.2. Failure probability: What is (presumably) worst
The worst areas are the ones having most defects. The task is to predict where most defects are located. This is done by analyzing probable defect generators. In this section, some of the most important defect generators and symptoms for defect prone areas are presented. There exist many more, and you have to always include local factors in addition to the ones mentioned here.
  • Complex areas
Complexity is maybe the most important defect generator. More than 200 different complexity measures exist, and research into the relation of complexity and defect frequency has been done for more than 20 years. However, no predictive measures have until now been generally validated. Still, most complexity measures may indicate problematic areas. Examples include long modules, many variables in use, complex logic, complex control structure, a large data flow, central placement of functions, a deep inheritance tree, and even subjective complexity as understood by the designers. This means you may do several complexity analyses, based on different aspects of complexity and find different areas of the product that might have problems.
  • Changed areas
Change is an important defect generator (13). One reason is that changes are subjectively understood as easy, and thus not analyzed thoroughly for their impact. Another reason is that changes are done under time pressure and analysis is not completely done. The result is side-effects. Advocates for modern system design methods, like the Cleanroom process, state that debugging during unit test is more detrimental than good to quality, because the changes introduce more defects than they repair.
In general, there should exist a protocol of changes done. This is part of the configuration management system (if something like that exists). You may sort the changes by functional area or otherwise and find the areas which have had exceptionally many changes. These may either have a bad design from before, or have a bad design after the original design has been destroyed by the many changes.
Many changes are also a symptom of badly done analysis (5). Thus, heavily changed areas may not correspond to user expectations.
  • Impact of new technology, solutions, methods
Programmers using new tools, methods and technology experience a learning curve. In the beginning, they may generate many more faults than later. Tools include CASE tools, which may be new in the company, or new in the market and more or less unstable. Another issue is the programming language, which may be new to the programmers, or Graphical User Interface libraries. Any new tool or technique may give trouble. A good example is the first project with a new type of user interface. The general functionality may work well, but the user interface subsystem may be full of trouble.
Another factor to consider is the maturity of methods and models. Maturity means the strength of the theoretical basis or the empirical evidence. If software uses established methods, like finite state machines, grammars, relational data models, and the problem to be solved may be expressed suitably by such models, the software can be expected to be quite reliable. On the other hand, if methods or models of a new and unproven kind, or near the state of the art are used, the software may be more unreliable.
Most software cost models include factors accommodating the experience of programmers with the methods, tools and technology. This is as important in test planning, as it is in cost estimation.
  • Impact of the number of people involved
The idea here is the thousand monkeys' syndrome. The more people are involved in a task, the larger is the overhead for communication and the chance that things go wrong. A small group of highly skilled staff is much more productive than a large group of average qualification. In the COCOMO (10) software cost model, this is the largest factor after software size. Much of its impact can be explained from effort going into detecting and fixing defects.
Areas where relatively many and less qualified people have been employed, may be pointed out for better testing.
Care should be taken in that analysis: Some companies (11) employ their best people in more complex areas, and less qualified people in easy areas. Then, defect density may not reflect the number of people or their qualification.
A typical case is the program developed by lots of hired-in consultants without thorough follow-up. They may work in very different ways. During testing, it may be found that everyone has used a different date format, or a different time window.
  • Impact of turnover
If people quit the job, new people have to learn the design constraints before they are able to continue that job. As not everything may be documented, some constraints may be hidden for the new person, and defects result. Overlap between people may also be less than desirable. In general, areas with turnover will experience more defects than areas where the same group of people has done the whole job.
  • Impact of time pressure
Time pressure leads to people making short-cuts. People concentrate on getting the job done, and they often try to skip quality control activities, thinking optimistically that everything will go fine. Only in mature organizations, this optimism seems to be controlled.
Time pressure may also lead to overtime work. It is well known, however, that people loose concentration after prolonged periods of work. This may lead to more. Together with short-cuts in applying reviews and inspections, this may lead to extreme levels of defects density.
Data about time pressure during development can best be found by studying time lists, project meeting minutes, or by interviewing management or programmers.
  • Areas which needed optimizing
The COCOMO cost model mentions shortage of machine and network capacity and memory as one of its cost drivers. The problem is that optimization needs extra design effort, or that it may be done by using less robust design methods. Extra design effort may take resources away from defect removal activities, and less robust design methods may generate more defects.
  • Areas with many defects before
Defect repair leads to changes which lead to new defects, and defect prone areas tend to persist. Experience exists that defect prone areas in a delivered system can be traced back to defect prone areas in reviews and unit and subsystem testing. Evidence in studies (5) and (7) shows that modules that had faults in the past are likely to have faults in the future. If defect statistics from design and code reviews, and unit and subsystem testing exist, then priorities can be chosen for later test phases.
  • Geographical distribution

If people working together on a project are not co-located, communication will be worse. This is true even on a local level. Here are some ideas which haven proven to be valuable in assessing if geography may have a detrimental effect on a project:
  • People having their offices in different floors of the same building will not communicate as much as people on the same floor.
  • People sitting more than 25 meters apart may not communicate enough.
  • A common area in the workspace, such as a common printer or coffee machine improves communication. People sitting in different buildings do not communicate as much as people in the same building. People sitting in different labs communicate less than people in the same lab. People from different countries may have difficulties, both culturally and with the language. If people reside in different time zones, communication will be more difficult. This is a problem in outsourcing software development.
In principle, geographical distribution is not dangerous. The danger arises if people with a large distance have to communicate, for example, if they work with a common part of the system. You have to look for areas where the software structure implies the need for good communication between people, but where these people have geography against them.
  • History of prior use

If many users have used software before, an active user group can be helpful in testing new versions. Beta testing may be possible. For a completely new system, a user group may need to be defined, and prototyping may be applied. Typically, completely new functional areas are most defect-prone because even the requirements are unknown.
  • Local factors
Examples include looking at who did the job, looking at who does not communicate well with someone else, who is new in the project, which department has recently been reorganized, which managers are in conflict with each other, the involvement of prestige and many more factors. Only fantasy sets boundaries. The message is: You have to look out for possible local factors outside the factors having been discussed here.
  • One general factor to be considered in general
This paper is about high level testing. Developers test before this. It is reasonable to have a look at how developers have tested the software before, and what kind of problems they typically overlook. Analyze the unit test quality. This may lead to a further tailoring of the test case selection methods (17).
Looking at these factors will determine the fault density of the areas to be tested. However, using only this will normally over-value some areas. Typically, larger components will be tested too much. Thus, a correction factor should be applied: Functional size of the area to be tested. I.e. the total weight of this area will be "fault proneness / functional volume". This factor can be found from function point analysis early, or from counting code lines if that is available.
Figure 2: Failure Probability
What to do if you do not know anything about the project, if all the defect generators can not be applied?
You have to run a test. A first rough test should find defect prone areas, the next test will then concentrate on them. The first test should cover the whole system, but be very shallow. It should only cover typical business scenarios and a few important failure situations, but cover all of the system. You can then find where there was most trouble, and give priority to these areas in the next round of testing. The next round will then do deep and through testing of prioritized areas.
This two-phase approach can always be applied, in addition to the planning and prioritizing done before testing. Chapter 4 explains more of this.
3.3. How to calculate priority of test areas
The general method is to assign weights, and to calculate a weighted sum for every area of the system. Test where the result is highest!
For every factor chosen, assign a relative weight. You can do this in very elaborate ways, but this will take a lot of time. Most often, three weights are good enough. Values may be 1, 3, and 10. (1 for "factor is not very important", 3 for "factor has normal influence", 10 for "factor that has very strong influence").
For every factor chosen, you assign a number of points to every product requirement (every function, functional area, or quality characteristic. The more important the requirement is, or the more alarming a defect generator seems to be for the area, the more points. A scale from 1 to 3 or 5 is normally good enough. Assigning the points is done intuitively.
The number of points for a factor is then multiplied by its weight. This gives a weighted number of points between 1 and 50. These weighted numbers are then summed up for damage (impact) and for probability of errors, and finally multiplied. As many intuitive mappings from reality for points seem to involve a logarithmic scale, where points follow about a multiplier of 10, the associated risk calculation should ADD the calculated weighted sums for probability and damage. If most factors’ points inherently follow a linear scale, the risk calculation should MULTIPLY the probability and damage points. The user of this method should check how they use the method! Testing can then be planned by assigning most tests to the areas with the highest number of points.
An example (functional volume being equal for the different areas):
Area to testBusiness criticalityVisibilityComplexityChange frequencyRISK
Weight31033
Order registration245146*18
Invoicing454262*18
Order statistics213316*18
Management reporting212416*18
Performance of order registration541155*6
Performance of statistics111113*6
Performance of invoicing411122*6
The table suggests that function «invoicing» is most important to test, «order registration» and performance of order registration. The factor which has been chosen as the most important is visibility.
Computation is easy, as it can be programmed using a spreadsheet. A more detailed case study is published in (4). A spreadsheet is on http://home.c2i.net/schaefer/testing/riskcalc.hqx (Binhex file, save to disk, decompress, open with Excel)
A word of caution: The assignment of points is intuitive and may be wrong. Thus, the number of points can only be a rough guideline. It should be good enough to distinguish the high-risk areas from the medium and low risk areas. That is its main task. This also means you don't need to be more precise than needed for just this purpose. If more precise test prioritization is necessary, a more quantified approach should be used wherever possible. Especially the possible damage should be used as is, with its absolute values and not a translation to points. An approach is described in (18).
4. Making testing more effective
More effective test means to find more and more important defects in the same amount of time.
The strategy to achieve this is to learn from experience and adapt testing.
First, the whole test should be broken into four phases:
  • test preparation
  • pre-test
  • main test
  • after-test.
Test preparation sets up areas to test, the test cases, test programs, databases and the whole test environment. Especially setting up the test environment can give a lot of trouble and delay. It is generally easy to install the program itself and the correct operating system and database system. Problems often occur with the middleware, i.e. the connection between software running on a client, and software running on different servers. Care should be taken to thoroughly specify all aspects of the test environment, and dry runs should be held, in order to ensure that the test can be run when it is time to do it. In a Y2K project, care was taken to ensure that licenses were in place for machine dates after 1999, and the licenses allowed resetting of the machine date. Another area to focus is that included software was Y2K compliant.
The pre-test is run after the software under test is installed in the test lab. This test contains just a few test cases running typical day to day usage scenarios. The goal is to test if the software is ready for testing at all, or totally unreliable or incompletely installed. Another goal may be to find some initial quality data, i.e. find some defect prone areas to focus the further test on.
The main test consists of all the pre-planned test cases. They are run, failures are recorded, defects found and repaired, and new installations of the software made in the test lab. Every new installation may include a new pre-test. The main test takes most of the time during a test execution project.
The after-test starts with every new release of the software. This is the phase where optimization should occur. Part of the after-test is regression testing, in order to find possible side-effects of defect repair. But the main part is a shift of focus.
The type of defects may be analyzed. A possible classification is described in (14). In principle, every defect is a symptom of a weakness of some designer, and it should be used to actively search for more defects of the same kind.
Example: In a Y2K project, it was found that sometimes programs would display blank instead of zeroes in the year field in year 2000. A scan for the corresponding wrong code through many other programs produced many more instances of the same problem.
Another approach is to concentrate more tests on the more common kinds of defects, as these might be more common in the code. The problem is, however, that such defects might already have been found because the test was designed to find more of this kind of defects. Careful analysis is needed. Generally, apply the abstractions of every defect found as a checklist to more testing or analysis.
The location of defects may also be used to focus testing. If an area of code has especially many failures, that area should be a candidate for even more testing (7, 13). But during the analysis, care should be taken to ensure that a high level of defects in an area is not caused by an especially high-test coverage in that area.
5. Making testing cheaper
A viable strategy for cutting budgets and time usage is to do the work in a more productive and efficient way. This normally involves applying technology. In software, not only technology, but also personnel qualifications seem to be ways to improve efficiency and cut costs. This also applies in testing.
Automation
There exist many test automation tools. Tools catalogues list more tools for every new edition, and the existing tools are more and more powerful while not costing more (12). Automation can probably do most in the area of test running and regression testing. Experience has shown that more test cases can be run for much less money, often less than a third of the resources spent for manual testing. In addition, automated tests often find more defects. This is fine for software quality, but may hit the testers, as the defect repair will delay the project... Still, such tools are not very popular, because they require an investment into training, learning and building an infrastructure at start. Sometimes a lot of money is spent in fighting with the tool. For the productivity improvement, nothing general can be said, as the application of such tools is too dependent on platforms, people and organization. Anecdotal evidence prevails, and for some projects automation has had a great effect.
An area where test is nearly impossible without automation is stress, volume and performance testing. Here, the question is either to do it automatically or not to do it at all.
Test management can also be improved considerably using tools for tracking test cases, functions, defects and their repairs. Such tools are now more and more often coupled to test running automation tools.
In general, automation is interesting for cutting testing budgets. You should, however, make sure you are organized, and you should keep the cost for startup and tool evaluation outside your project. Tools help only if you have a group of people who already know how to use them effectively and efficiently. To bring in tools in the last moment has a low potential to pay off, and can do more harm than good.
The people factor - Few and good people against many who don’t know
The largest obstacle to an adequate testing staff is ignorance on the part of management. Some of them believe that "development requires brilliance, but anybody can be a tester."
Testing requires skill and knowledge. Without application knowledge your testers do not know what to look after. You get shallow test cases which do not find defects. Without knowledge about common errors the testers do not know how to make good test cases. Good test cases, i.e. test cases that have a high probability of finding errors, if there are errors, are also called «destructive test cases». Again, they do not find defects. Without experience in applying test methods people will use a lot of unnecessary time to work out all the details in a test plan.
If testing has to be cheap, the best is to get a few highly experienced specialists to collect the test candidates, and have highly skilled testers to improvise the test instead of working it out on paper. Skilled people will be able to work from a checklist, and pick equivalence classes, boundary values, and destructive combinations by improvisation. Non-skilled people will produce a lot of paper before having an even less destructive test. A method for this is called "exploratory testing".
The test people must be at least equally smart, equally good designers and have equal understanding of the functionality of the system. One could let the Function Design Team Leader become the System Test Team Leader as soon as functional design is complete. Pre-sales, Documentation, Training, Product Marketing and/or Customer Support personnel should also be included in the test team. This provides early knowledge transfer (a win-win for both development and the other organization) and more resources than there exist full-time. Test execution requires lots of bodies that don't need to be there all of the time, but need to have a critical and informed eye on the software. You probably also need full-time testers, but not as many as you would use in the peak testing period. Full-time test team members are good for test design and execution, but also for building or implementing testing tools and infrastructure during less busy times.
If an improvised test has to be repeated, there is a problem. But modern test automation tools can be run in a capture mode, and the captured test may later be edited for documentation and rerunning purposes.
The message is: get highly qualified people for your test team!
6. Cutting testing work
Another way of cutting costs is to get rid of part of the task. Get someone else to pay for it or cut it out completely!
Who pays for unit testing?
Often, unit testing is done by the programmers and never turns up in any official testing budget. The problem is that unit testing is often not really done. Test coverage tool vendors often report that without their tools, 40 - 50% of the code are never unit tested. Many defects then survive until the later test phases. This means later test phases have to test better, and they are overloaded and delayed by finding all the defects which could have been found earlier.
As a test manager, you should require higher standards for unit testing! This is inline with modern "agile" approaches to software development. Unit tests should be automated as well and rerun every time units are changed or integrated.
What about test entry criteria?
The idea is the same as in contracts with external customers: If the supplier does not meet the contract, the supplier gets no acceptance and no money. Problems occur when there is only one supplier and when there is no tradition in requiring quality. Both conditions are true in software. But entry criteria can be applied if the test group is strong enough. Criteria include many, from the most trivial to advanced. Here is a small collection of what makes the life in testing easier:
  • The system delivered to integration or system test is complete
  • It has been run through static analysis and defects are fixed
  • A code review has been done and defects have been corrected
  • Unit testing has been done to the accepted standards (near 100% statement coverage, for example)
  • Any required documentation is delivered and is of a certain quality
  • The units compile and can be installed without trouble
  • The units should have passed some functional test cases (smoke test).
  • Really bad units are sorted out and have been subjected to special treatment like extra reviews, reprogramming etc.
You will not be allowed to require all these criteria. You will maybe not be allowed to enforce them. But you may turn projects into a better state over time by applying entry criteria. If every unit is reviewed, statically analyzed and unit tested, you will have a lot less problems to fight with later.
Less documentation
If a test is designed "by the book", it will take a lot of work to document. Not all this is needed. Tests may be coded in a high level language and may be self-documenting. A test log made by a test automation tool may do the service. Qualified people may be able to make a good test from checklists, and even repeat it. Check out exactly which documentation you will need, and prepare no more. Most important is a test plan with a description of what is critical to test, and a test summary report describing what has been done and the risk of installation.
Cutting installation cost - strategies for defect repair
Every defect delays testing and requires an extra cost. You have to rerun the actual test case, try to reproduce the defect, document as much as you can, probably help the designers debugging, and at the end install a new version and retest it. This extra cost is impossible to control for a test manager, as it is completely dependent on system quality. The cost is normally not budgeted for either. Still, this cost will occur. Here is some advice about how to keep it low.
When to correct a defect, when not?
Every installation of a defect fix means disruption: Installing a new version, initializing it, retesting the fix, and retesting the whole. The tasks can be minimized by installing many fixes at once. This means you have to wait for defect fixes. On the other hand, if defect fixes themselves are wrong, this strategy leads to more work in debugging the new version. The fault is not that easy to find. There will be an optimum, dependent on system size, the probability to introduce new defects, and the cost of installation. For a good description of practical test exit criteria, see (2). Here are some rules for optimizing the defect repair work:
Rule 1: Repair only important defects!
Rule 2: Change requests and small defects should be assigned to the next release!
Rule 3: Correct defects in groups! Normally only after blocking failures are found.
Rule 4: Use an automated "smoke test" to test any corrections immediately.
7. Strategies for prevention
The starting scenario for this paper is the situation where everything is late and where no professional budgeting has been done. In most organization, there exist no experience data and there exists no serious attempt to really estimate costs for development, testing, and error cost in maintenance. Without experience data there is no way to argue about the costs of reducing a test.
The imperatives are:
  • You need a cost accounting scheme
  • You need to apply cost estimation based on experience and models
  • You need to know how test quality and maintenance trouble interact
Measure:
  • Size of project in lines of code, function points etc.
  • Percentage of work used in management, development, reviews, test preparation, test execution, and rework
  • Amount of rework during first three or six months after release
  • Fault distribution, especially causes of user detected problems.
  • Argue for testing resources by weighting possible reductions in rework before and after delivery against added testing cost.
Papers showing how such cost and benefit analysis can be done, using retrospective analysis, have been published in several ESSI projects run by Otto Vinter from Bruel&Kjær (6). A different way to prevent trouble is incremental delivery. The general idea is to break up the system into many small releases. The first delivery to the customer is the least commercially acceptable system, namely, a system which does exactly what the old one did, only with new technology. From the test of this first version you can learn about costs, error contents, bad areas etc. and then you have an opportunity to plan better.
8. Summary
Testing in a situation where management cuts both budget and time is a bad game. You have to endure and survive this game and turn it into a success. The general methodology for this situation is not to test everything a little, but to concentrate on high risk areas and the worst areas.
Priority 1: Return the product as fast as possible to the developers with a list of as serious deficiencies as possible.
Priority 2: Make sure that, whenever you stop testing, you have done the best testing in the time available!
References
(1) Joachim Karlsson & Kevin Ryan, "A Cost-Value Approach for Prioritizing Requirements", IEEE Software, Sept. 1997
(2) James Bach, "Good Enough Quality: Beyond the Buzzword", IEEE Computer, Aug. 1997, pp. 96-98
(3) Risk-Based Testing, STLabs Report, vol. 3 no. 5 (info@stlabs.com)
(4) Ståle Amland, "Risk Based Testing of a Large Financial Application", Proceedings of the 14th International Conference and Exposition on TESTING Computer Software, June 16-19, 1997, Washington, D.C., USA.
(5) Tagji M. Khoshgoftaar, Edward B. Allan, Robert Halstead, Gary P. Trio, Ronald M. Flass, "Using Process History to Predict Software Quality," IEEE Computer, April 1998
(6) Several ESSI projects, about improving testing, and improving requirements quality, have been run by Otto Vinter. Contact the author at otv@delta.dk.
(7) Ytzhak Levendel, "Improving Quality with a Manufacturing Process", IEEE Software, March 1991.
(8) "When the pursuit of quality destroys value", by John Favaro, Testing Techniques Newsletter, May-June 1996.
(9) "Quality: How to Make It Pay," Business Week, August 8, 1994
(10) Barry W. Boehm, Software Engineering Economics, Prentice Hall, 1981
(11) Magne Jørgensen, 1994, "Empirical studies of software maintenance", Thesis for the Dr. Scient. degree, Research Report 188, University of Oslo.
(12) Lots of test tool catalogues exist. The easiest accessible key is the Test Tool FAQ list, published regularly on Usenet newsgroup comp.software.testing. More links on the author’s Web site.
(13) T. M. Khoshgoftaar, E.B. Allan, R. Halstead, Gary P. Trio, R. M. Flass, «Using Process History to Predict Software Quality», IEEE Computer, April 1998
(14) IEEE Standard 1044, A Standard Classification of Software Anomalies, IEEE Computer Society.
(15) James Bach, «A framework for good enough testing», IEEE Computer Magazine, October 1998
(16) James Bach, "Risk Based Testing", STQE Magazine,6/1999, www.stqemagazine.com
(17) Nathan Petschenik, "Practical Priorities in System Testing", in "Software- State of the Art" by DeMarco and Lister (ed), Sept. 1985, pp.18 ff
(18) Heinrich Schettler, "Precision Testing: Risikomodell Funktionstest" (in German), to be published.

No comments: