The third stage is issued by the emerging test automation technologies and methods that the respondents to the WQR 2016-17 survey [1] foresaw using in the coming years: 

  • Test design automation - the ability to generate automatically synthetic test cases, with high fault detection potential (efficacy), high fault detection rate (efficiency), and high coverage.
  • Robotic automation - the ability of the test automation engine to perform autonomously the test tasks humans assign and control. Once configured, the test engine can be left to carry out the work by itself, without humans in the loop.
  • Cognitive automation -  the combination of intelligence-led and knowledge-based automatic search, detection, and diagnosis of failures. 
  • Test data automation - the ability to generate and manage the executable format of the test cases automatically.
  • Machine learning - machine learning is everywhere, but its application to the testing challenge is not precisely defined: it could be used to learn data for testing purposes from the development and the operation processes. 
  • Predictive analysis - predicting where and when the failures are the most likely to happen.
  • Self-remediation - the ability of an automated ops manager to detect and diagnose failures at the production time, isolate the faulty component, receipt and validate the system upgrade, and install it without service interruption.
  • Test environment virtualization - the ability to allocate/deallocate and scale up and down the computing resources needed by the system under test and the test system.

Now we go ahead to the fourth stage and come back to the first question.

Stage 4. What is test automation?

Recently, analysts and professionals have 

  • written about “the broken promise of test automation” [2], 
  • asked questions such as: “If automation worked so well, we must question why it is used so little” [3],
  • reported that 38% of the respondents of a LogiGear survey rolled out a software automation testing framework that failed [6], 
  • concluded that “automation is currently under-exploited in QA and testing” [4].

Note that analysts and professionals have been talking about test automation for at least 15 years. So, what's wrong and what's going on with test automation? Test automation is made of 'test' and 'automation.' So, what are the tasks that carry out the testing job? Among them, what are those that it is possible to automate (we consider that all are worthy of being automated)? What is the exact meaning of test task automation? Can we establish a hierarchy of test automation levels? What tasks can attain what levels? 

If we examine, even superficially, the test job, a distinction appears between clerical, cognitive and creative tasks. Generally speaking, a task, yet complicated to carry out, for which it exists a set of prescriptions and rules that can be applied mechanically without any specific knowledge and skill, can be considered a clerical task. A task for which the mentioned set doesn't exist or exists only partially, and that needs a dose of problem-solving abilities to be carried out is a creative task. A task is cognitive if it requires knowledge and skills to be carried out. Clerical tasks, even the most complicated ones, are certainly easier to automate than cognitive and creative tasks. For clerical tasks, we consider that the term ‘mechanization’ is more appropriate than ‘automation.’ In fact, the tools currently available in the QA & Testing market implement the bare mechanization of the clerical test tasks, such as:

  • the production of message templates from a machine-readable specification of the interface,
  • the batch transmission of coded test inputs to an atomic SUT, 
  • the collection and logging of the SUT outcomes,
  • the creation of skeletons of virtual, simulated components connected upstream and downstream from the SUT, for integration test in isolation.
Note that the production of the message contents and the implementation of the upstream and downstream components' behaviors are not clerical tasks. Conversely:
  • the design and coding of test cases,
  • the implementation and configuration of the test system components and their binding to the system under test,
  • the arbitration of the SUT outcomes and the drafting of the test reports,
  • the prioritization of the test case runs and the planning and monitoring of test sessions,
are cognitive and creative tasks and require knowledge of the system under test, skills of test methods and tools, and problem-solving capabilities.

The design and coding of test cases

Nowadays, the use of production data for testing is widespread. The enabling point to use production data for testing is the availability of the interaction graph traces from your production system. In other terms, you must instrument the production system to trace exchanges in such a way that they can be organized in direct acyclic graphs that represent the explicit causal relationships, within the SUT components, between received and sent messages. This effort is relatively easy to accomplish with an atomic system but tricky with a distributed one. 

It is easy to recognize that there is a fundamental difference between testing an atomic system, taken in isolation and testing a distributed system.

Testing an atomic system in isolation is white-box - observing and evaluating the execution paths of the internal code if the source code is available - and black-box - observing and evaluating its behavior at its interfaces. 

Testing a distributed system is gray-box testing - observing and evaluating at the same time the behavior of all the components of the distributed system at their interfaces). 

Some people consider that when you have adequately tested all the SUT component in isolation, with the help of simulated downstream and upstream components (see below), gray-box testing of the distributed system is superfluous. This position is fundamentally flawed. 

It is henceforth accepted that a distributed system is made of loosely coupled components interacting via messages sent at the interface endpoints in compliance with exchange protocols. The important point is "loosely coupled": tightly coupling of the distributed system cannot be discovered without gray-box testing (this point will be clearer below). Because gray-box testing is difficult to carry out, some consider that it is useless.

When you get your production system traces, you have to arbitrate them, that is to say, evaluate whether they are compliant with the expected behavior. After you select, between the traced interaction graphs that are functionally compliant, some of them by some criteria (see below). Then you identify the stimuli, namely the first interactions that trigger every selected interaction graph. These stimuli are your test inputs, and the other interactions of the traced graph are the representation of the correct system 'response' (remember that you have selected the interaction graphs that are compliant with the system expected behavior). The stimulus and its response constitute a test case. The job is over: when testing distributed stateful systems, for each chosen interaction path, you also have to record and include in the test case the initial states of the SUT component in which the interaction path occurs. So your production system has to log the initial state records for each trace. In practice, you have to systematically log the states before the operations invoked in the interaction graph occur. Moreover, you have to correlate the state records to the interaction path traces. The enterprise is intricate: to sum up, you can use production data for testing if the production system has been designed and implemented in compliance with this requirement. 

Suppose that you got your test case suite. When you want to run it, first you have to install and configure a copy of the system (the system under test) in a staging environment - generally speaking an upgraded version of the system in production. For each test case you want to run, you have to configure the SUT component initial states, inject the test input into the SUT, and compare the resulting trace with the SUT response in the test case. 
Note that this approach works for stateless transactions on stateful systems (for instance, queries on the state of a bank account). For transactions that modify the SUT components' states (such as the withdrawal from a bank account), it would be better to record the SUT components' states after the transaction too and include these records in the test cases. A test case becomes a quadruple (stimulus, SUT response, SUT initial state, SUT final state) allowing doubly checking the response and the final state. This approach reduces the risk of false negatives (the bank account system replying that the withdrawal has been carried out, but it hasn't) and quasi-false positives (the bank system replying that the withdrawal has not occurred and it has). 
The test suite thus constituted can be used for regression test on system upgrades that do not modify the system function. Remember that regression test is intended to check that an implementation fix does not produce undesirable side effects - possibly ‘far’ from the fix, in a component of a distributed system not directly connected. 

The constitution of a test suite from production traces and logs is not straightforward, but it avoids the explicit design and coding of 'artificial' test cases from scratch, which is a complicated endeavor. It can be applied reasonably to the test of an atomic system, where traces and state records can be constituted without much effort. Conversely, the test of distributed systems is practically out of reach of the currently available tools and technology.  
The bad news is that, in any case, businesses and administrations are forced to abandon production data for testing now, and this for two reasons. 

The first reason is that European citizens' data will be shielded by the EU General Directive on Data Protection (GDPR) by the 25th of May 2018, and fines for failing to keep them appropriately secure are severe. The overwhelming majority of business applications and services of all the European and non-European businesses deal with these sensitive data. Unfortunately, the mere anonymization (masking names and identifiers) of personal data, doesn't perform adequately. From a well-known paper, 87% of the population in the United States has reported characteristics that make them unique based only on (i) 5-digit ZIP code, (ii) gender, and (iii) date of birth [5]. Masking names and identifiers (such as the social security number) is useless for anonymizing a vast majority of individuals that are fully identifiable by a set of demographics and correlated personal data that does not include names and identifiers. 

Some US companies supply tools equipped with data analysis and machine learning techniques to accelerate and partially automate the process of deriving test cases from production journals. These tools are not able to cope with the central problem of data protection. It is obviously interesting to trace and analyze the usage patterns of a system in the production environment, but not for spreading protected data. 

The second reason is that testing with production data lacks effectiveness, efficiency, and coverage. Effectiveness is the ability to discover defects. Efficiency is the ability to discover defects early in the test session. Coverage is the ability to cover many, ideally the totality, of the interaction paths. In fact, the test cases that reproduce the most frequently used paths in production exhibit reduced fault detection potential and limited coverage. The software is not hardware: the most frequently used interaction paths are probably the most accurate ones and become more accurate as and when they are used, because defects have already been detected and corrected, and are fixed as soon as they appear. Errors often reside, sometimes with catastrophic consequences, in the least visited interaction paths. Above all, test cases derived from production data do not have clear objectives of fault discovery, detection, and diagnosis.

To sum up, because it is difficult or even impossible to guarantee the data protection, and testing with production data is not effective, efficient and adequate regarding coverage, the only sustainable alternative is the synthesis of focused, artificial test cases. 

In the synthesis of artificial test inputs, there are two different steps: (i) their design, and (ii) their coding in an executable format. The tools currently available in the Testing and QA market can at most supply message skeletons, and the tester is in charge of designing the content of the input, the expected interaction graph triggered by the input, and the initial and possibly final states of a distributed system. Then, the tester must code the test input in its executable format, configure the initial state of the system under test, transmit the test input, and arbitrate the traces of the system call-graph and, possibly, the final states of the system. The approach of manual synthesis of artificial test cases is called script-based testing.

The design of artificial test inputs is a cognitive and creative endeavor that is difficult to automate. Moreover, if the design result is expressed in a machine-readable format, the coding of the designed test input in the executable format could be realized mechanically. Otherwise, if the design result is informal, the coding task cannot be distinguished really from the design. Why designing artificial test inputs is a cognitive task? Because it requires an in-depth knowledge of the business logic and the technical implementation of the SUT.

Why is it a creative task? Because testing is looking for, detecting, and diagnosing faults, which is an uncertain enterprise. Consider functional testing, i.e., testing the compliance of the system behavior with the system functional specifications. The most exhaustive functional testing campaign does not prove the absence of functional defects, it shows only their presence when it discovers failures. The number of possible test inputs being practically unlimited, test design objectives are: (i) the highest test effectiveness rate (the relative amount of the designed test inputs that reveal a new fault with respect to the totality of test inputs), and (ii) the highest test coverage (the number of different system behavior patterns that are triggered by the designed test input suite). 

Script-based testing raises several problems. First of all, we have already argued that it requires from the tester an in-depth knowledge of the business logic and the technical implementation of the SUT. Furthermore, it is a high code activity (coding the inputs, configuring and initializing the system state) whose result is cumbersome to maintain when developers are actively working on the application. Last but not least, it is an error-prone endeavor that can produce a frustrating number of false positives and negatives as the application inevitably continues to change. In fact, application complexity increases faster than test teams and tools can keep up, and testing is the number one challenge when putting in place a real continuous DevOps process.

Because test design and coding are time-consuming and expensive, the number of produced test cases is perforce small. The smaller the test suite, the more challenging the test suite quality regarding effectiveness (the fault discovery potential), coverage, and correctness (absence of false positives and negatives). The tester must have not only knowledge of the business logic and the technical implementation of the system under test, but also sophisticated skills in testing methods and techniques, to design and code a small but high-quality and focused test suite.

The configuration and implementation of the test system components and their binding to the system under test

Test run requires the configuration of what is called a test harness, namely the collection of the test system components that are (i) able to interact with the SUT when running a test case, and (ii) coordinated by a test system monitor that pilots the test case run.

A test harness for a terminal atomic system (not having interactions with downstream services) is relatively easy to put in place with the test drivers available in the QA & Testing tool market. A test harness for a non-terminal atomic system is more complicated to set up, and some tools propose virtualization techniques for the downstream services that allow testing the system in isolation. A test harness for gray-box testing of a distributed system has to put in place probes on the exchange channels between the SUT components, for which there is little or no support in the QA & Testing market. 

An ideal test harness proposes three kinds of functional test components:

  • drivers - that implement ‘clients’ of the SUT,
  • stubs - that simulate downstream services connected to the SUT,
  • intercepting proxies - that implement probes on the communication channel between SUT components.

A driver is a test component that can send the test inputs to the SUT, and, if the interaction is two-way, collect the SUT reply. A smart driver arbitrates the SUT reply (see below). A stub is a test component that can receive a message from a SUT and, if the interaction is two-way, send back to the SUT a stereotyped reply. A smart stub can arbitrate the received message. Note that a stub must configure a port to expose an interface endpoint, and the port must be attainable by the SUT.

Drivers and stubs, when correctly implemented, allow the integration test of an atomic component of a distributed system when the other components are not or only partially available. Smart drivers and stubs enable the automation of the test run. In this sense, these test components are 'virtual' or fake components. Note that the use of the 'virtualization' term in this context is essentially different from its use in the operating system and cloud computing domains. Some of the tools available in the QA and Testing market exhibit 'virtualization' functions, that consist of the automatic set up of driver and stub skeletons. The tester must fill these skeletons with the code that engenders the appropriate test component behavior. The development of this code is not a clerical task, especially if it is able to arbitrate the received message (smart test component).

Intercepting proxies are test components needed for end-to-end gray-box testing of distributed systems. In fact, gray-box testing is stimulating the behavior of the distributed SUT by sending a message to one of its components and observing and evaluating the message exchanges between these components and with the environment that are triggered by the stimulus. 

An intercepting proxy can observe an exchange between two distributed SUT components, by beholding the message sent by the initiator of the exchange, and, in a two-way interaction, the addressee reply. A smart intercepting proxy can arbitrate the message and its eventual response.

Nowadays, tools that support gray-box test of distributed systems are not proposed in the QA and Testing market. 

The implementation and the configuration of the test harness can be considered a cognitive and creative task because it requires the knowledge of the SUT topology, business logic, and technical implementation. Even with the support of a testing tool with virtualization functions, and yet without considering test arbitration, it is a high code endeavor that demands sharp development skills. 

The arbitration of the SUT outcomes and the drafting of the test reports

The arbitration of the SUT ‘response’ is the evaluation of the SUT actual behavior, the production the evaluation results, called the test verdicts, and their compilation in a compact report. 

The research on testing utilizes the metaphor of the oracle to enounce the problem. The oracle is an ideal agent which, interrogated about the SUT behavior, answers with a motivated verdict. 

Within black-box and gray-box testing, the behavior of the SUT, atomic or distributed, is observed and evaluated at its interfaces. A complete evaluation process shall cover all the aspects of this behavior: 

  • interlocution aspect,
  • syntactic aspect, 
  • protocol aspect, 
  • semantic aspect, 
  • pragmatic aspect, 
  • temporal aspect. 

The interlocution analysis aims at establishing that the intended addresser has sent the message attaining the interface endpoint.  This aspect must be taken into account within the gray-box test of distributed systems as a plain functional concern - that the identity of the actual addresser is the expected one. Note that, to limit the complexity of the argumentation, we do not consider the security concerns about the addresser identification, authentication (is the addresser the one who pretends to be?) and authorization (is the addresser authorized to invoke the operation conveyed by the message?). We do neither consider the test of the provisions, if any, for the other security aspects of the interaction, such as:

  • the confidentiality of the message,
  • the integrity of the message against malicious or accidental modifications, and
  • the non-repudiation of the message reception.

These additional aspects must be taken into account for testing the correct implementation of strong security requirements, as well as other aspects related to the message reliability must be considered for testing the correct implementation of strict robustness requirements.

The syntactic analysis evaluates the well-formedness of the message, namely its compliance with the expected scheme and format. The protocol analysis tries to ascertain if the message type (i.e., the type of the operation whose invocation is conveyed by the message) is compliant with the expected one. The semantic analysis seeks to determine whether the message content matches the expected one. The pragmatic analysis attempts to decide if the effects of the execution of the operation invoked by the message are compliant with the expected ones. Note that this analysis is problematic within black-box and gray-box testing, because the internal states of an atomic SUT or the components of a distributed SUT are, in principle, inaccessibles. The temporal analysis aims at evaluating the temporal relationship between the moment of the message reception (if it occurs), and its expected reception time window. Note that the timeout is an ambiguous signal - the time window interval is expired without reception of the message - that means either the message will never come or that it merely hasn't arrived yet. But the definition and monitoring of a time window for each message to be received from the SUT, even if not stated in the system specifications, is necessary for the continuity of the test session.

The arbitration of a test run shall take into account, for each outcome that is produced by the SUT in the course of the test run, all the aspects cited above. If we use the oracle metaphor, the oracle shall be able to answer interrogations about all these mentioned aspects.

It is evident that the arbitration task is highly cognitive. Today, the arbitration is mainly a human endeavor. The test run traces and logs are submitted to human experts that play the oracles and try to evaluate all the aspects cited above for each message exchange. This task is knowledge-intensive, requires keen attention and problem-solving capabilities, is time-consuming and error-prone. Arbitration errors take a growing importance within the increasing complexity of the system architectures. Repeated false positives - signaling a failure where there is no fault - and false negatives - not detecting a failure when it occurs - destroy the trust in the testing process of all the involved actors (development, operation, management). 

The arbitration automation, which is also called the 'oracle problem' in the test research, is one of the principal challenges of full test automation. The currently available tools and technologies in the QA & Testing market propose minimal support for the solution of the oracle problem.

Last but not least, a technical aspect that must be taken into account to warrant the trustworthiness of the testing process is the test run (execution and arbitration) repeatability. All things being equal - the same system build, the same environment, etc. - a repeated test case run must produce the same outcome and the same verdict. A derived requirement is that the SUT states that precede and follow a test run must be identical. Test repeatability is the typical feature that is easy to enunciate but difficult to implement.

The sequel in a forthcoming post...


[1] World Quality Report 2016-17 - 

[2] W. Ariola, The broken promise of test automation, SDTimes, June 16th, 2017. 

[3] What is Wrong with Automated Testing Today? - 

[4] World Quality Report 2017-18 -

[5] Sweeney, L. (2000). Simple demographics often identify people uniquely. Health (San Francisco), 671 , 1-34.