Extracting the DNA of icons

We present a way to measure the quality of icons using a large scale online test. With this data in hand, we seek inspiration on effects on the quality of icon sets.


In the last weeks we ran a couple of tests on different icon sets with the question in mind: What aspects of an icon set actually take what impact on its usability? Ad hoc results were individually calculated for Breeze, Crystal, Elementary, Faenza, Humanity, Nitrux, Nuvola, Oxygen, Tango and Treepata.

Overview of all icons used in the studies.

Figure 1: Overview of all icons used in the studies.

Table 1 below shows some descriptives. Familiarity ranges from 1 (“Use it daily”) to 5 (“Never seen it before”). Age was inquired in the ranges of <18, 18-30, 31-40, 41-50, 51-60, and >60, the median is reported.

Remarkable is the total number of participants with 6239, resulting on more than 70000 icon-term assignments. This gives a very good data basis for all evaluation, even though we again did not reach a sufficient amount of woman. Thanks a lot to everyone who participated!

Table 1: Icon set descriptives
Icon setParticipants [n]Familiarity [x̄]Age [x̃]Females [%]

If you want to analyze the data yourself you may download the raw data and our R script:


In this article we first discuss the generation of a quality indicator based on the results of these tests. Finally we apply the results to an overview of all tested icons and ask you for your inspiration.


Icon design is a tricky task. Designers want them to be beautiful, creative, coherent, unique. Usability demands a simple and unambiguous design. Translators localize tooltips and might add additional context to the referred function. On the other hand, icons are the most prominent visual reference for an application. They are the primary indicator for users when looking for a particular function. This makes it imperative to test icons to be able to assure their quality.

A lot of guide lines support designers creating icons. For example, the Microsoft Styleguide defines for Aero icons precisely how perspective, source of light, shade, saturation, size or level of detail should be like. All these guidelines primarily serve the creation of a consistent look-and-feel. But an icon rarely stands for itself. Therefore the integration into an icon set needs to be taken into consideration, whereby homogeneity (the icons of a set should belong together) and heterogeneity (within a set the risk of mix-ups should be minimized) become relevant. Along the aspects of homogeneity, heterogeneity and a graphically appropriate design, the most important premise for unambiguous symbols is the distinct association between the underlying function and its visual depiction. This relation is called a metaphor.

Holloway & Bailey (1995) conclude that developers should never test their own icons. This study compared the results of 10 software developers and 10 university students for icon recognition and preferences. There were 54 icons and 15 concepts, and each concept had two to four representative icons. First, participants attempted to match each icon with one of the 15 product concepts. Next, the participants were asked to pick the best icon from the ones specifically designed to represent each concept. The students correctly recognized more icons (M = 34.7) than the developers (M = 27.8), t(18) = 2.1, p < .05. Using product developers rather than representative users can result in incorrect decisions in icon usage.

Whether or not an icon is good is often assessed by qualitative tests. The material is presented to subjects who decide how well the depiction fits the term, how they like it and so on. A well known test is the Multiple Index Approach (ETSI, 1993) which collects data from hits, false alarms, missing values, subjective certainty, suitability, set and individual preference in three tests. All indicators are finally taken into evaluation. An early test with different icon sets for Xerox Workstation (Bewley et.al, 1983) included as well the response time and put this information into relation with the accuracy of descriptions on first sight. Again, the successive tests were analyzed individually.


  • Bewley, W.L., Koherts, T.I., Schrnit, D. & Verplank, W.L. (1983): Human Factors Testing in the Design of Xerox’s 8010 “Star” Office Workstation, CHl’83 Proceedings
  • European Telecommunications Standards Institute (1993): The Multiple Index Approach (MIA) for the evaluation of pictograms. ETSI Technical Report Human Factors. ETR070, DTR/HF-1010B
  • Holloway, J.B. & Bailey, J.H. (1995): Don’t Use a Product’s Developers for Icon Testing. In: Tauber, J.M. (1996): Proceedings of the Conference on Human Factors in Computing Systems (CHI’96), 309-310.


Our goal is to establish an icon test that can be done online with a large number of participants revealing quantitative results based on false assignments or missings and response time.


Users’ task in the icon tests was to assign a term like Save to the appropriate pictogram. Each of 13 the icons in the different icon sets represented one of the 13 terms in the test. The icons and terms were taken from the action section. Depending variables are the association of the icon and the response time.


Evaluation usually starts with a careful inspection of the raw data. In our case this includes the following steps:

  1. Assigning terms to images:
    Every test gets its own icon-term association using a unique id. This association is recoded to comparable values.
  2. Rejecting artifacts (Art):
    Some measured values might be out of range and should get excluded from evaluation. We treat a response time above one hour as an artifact.
  3. Detecting aborted sessions (Abo):
    Due to the fact that online tests are not controlled by a test lead, we get some people who start the test but do not finish. If the test was aborted we do not use the data .
  4. Detecting click throughs (Cli):
    Some people click through the whole test, maybe to read all terms, but do not select any icon. Those data could heavily bias the error rate. We only include data from subjects who took at least 50 % of the test.
  5. Finding outliers (Out):
    The common method is to calculate the difference between the 25 % and 75 % quantile of the distribution and to multiply it by 1.5. We apply this method to the upper border, and set the lower limit to 100 ms. Values outside these thresholds are not used for further calculations.

All preprocessing steps do not change the raw data but flag the particular item for later filtering.

Table 2: Number of items removed by preprocessing steps. Abbreviations explained above.
Icon setArtAboCliOut

Aggregate results

Last but not least the correct associations have to get identified by separating the hits from false assignments and misses (Table 3).

Table 3: Number of items for evaluation.
Icon setFalse assign [%]Miss [%]Hits [%]Total [n]

In the next step we calculate the average response time for hits (correct associations) and the error rate, which is the percentage of missing or false associations over the population. For instance, if an icon was mistaken by 20 out of 500 participants and 5 participants did not associated it with any item, the error rate is 0.05 (i.e 5%; 25/500).

If you want to analyze the data yourself you may download the raw data and our R script:


The first diagram shows the scatter plot of response time vs. error rate for all icons in all sets, including linear fit with confidence intervals.

Relation between response time and error rate (average results per test and icon).

Figure 2: Relation between response time and error rate (average results per test and icon).

The response time and the error rate are clearly related. The longer it takes to find the appropriate icon the more mistakes occur (R²=0.74, F(128)=360.87, p<.001).

Quality Indicator

If error rate and response time would be interchangeable – so correlation is 1.0 – we could just use one of the parameters to predict the other. But they only correlate high (R²=0.74) and not perfect. So we assume that they are both valid predictors for an underlying factor icon quality.

For example: From two icons that are well associated and result in comparable low error rates, the one that shows higher response times, should consequently be rated worse than the one with better response times. Same is true for icons with same response times but different error rates.

The clue to obtain a quality indicator that takes both sources of variance into account is to calculate the vector from the origin: the longer the vector the worse the quality of the icon.

Values after standardization

Figure 3: Values after standardization, arrows illustrate the calculation of the indicator.

So here is what we do:

  1. Standardization by z transformation:
    To make the response time comparable with the error rate we z-transform both vectors, i.e. value minus average divided by the standard deviation. By this operation the distribution is normalized with an average of zero and a standard deviation of 1.
  2. Ceiling:
    Due to the negative values resulting from the standardization, all data need to get an offset. We determine it by finding the minimum of error rate or response time and rounding it up.
  3. Vectorization:
    The distance of the resulting points is calculated as square root of e² plus t². To improve readability of the results we deduct the former ceiling value from the result.


Formula 1: Calculation of the quality indicator: e and t are z-transformed, offsetted, and vectorized.

Using this formula we get values from 1 (perfect) to 10 (worst) (yes, theoretically other values are possible). The results are shown in the next figure.

Calculated quality indicator.

Figure 4: Calculated quality indicator.

Icons that use well established metaphors, like the scissors for Cut, all get values close to one. And icons experimenting with different metaphors can be clearly differentiated. Icons – like Undo/Redo or Copy/Paste – that get mixed up often, score worse on the quality indicator, as one would expect.


Based on ten surveys each with 13 icons from a different set, gathering more than 50.000 valid answers, we were able to improve our indicator for the quality of icons. The new calculation is based on carefully validated error rates and response times. The indicator is computed as the distance of a point from a standardized origin. Results fit perfectly to what we know from previous studies.

This method still has some limitations. Most obvious is the problem of the “perfect” icon. With the current standardization every test set would produce one “perfect” icon, since average and standard deviation are calculated from the sample itself. So in future we will have to find some sort of universal constants we can use to make the calculation between different tests comparable. This constant probably needs to be adjusted for the set size (the more items a participant has to scan, the longer it takes).

Taking our data, the lowest average response time in all 10 tests is 2663 ms. Divided by the set size of 13, a constant of 200 ms results. This value corresponds to psycho-physiological findings on perception of simple objects. The current standard deviation over all 130 values is 1223 ms, which could be used as the second constant.

Additionally, the adjustment to have only positive values needs a fix value. By this ceiling using the least values we get one icon nearby the intersection. This icon would be again treated as perfect. Here we have no good idea how to solve the issue, except to always include perfect icons.

Please use the comments if you can help us to further improve our indicator or if you have ideas for a good design for a next icon test series.

Extracting the DNA of icon design

Now that we found a way to measure the quality of icons, it is time to find out more about them. It is easy to see that the metaphor is the deciding factor for a single icon. Icons fail if they use the wrong one. But there are icon sets that in average score worse even if only icons with the same metaphor are taken into account.

And here we would like to get some crowd inspiration from you. This whole study is made for exploration and inspiration, not so much for confirmation. So please inspire us how to find an answer to the question that still drives us:

What – beyond the metaphors – makes an icon set better than another one?

Just to give you some ideas:

How about abstract vs. explicit icon sets? What effect have 3D, use of color or the line strength on the quality of the set?

We are happy to hear your ideas. And to make it a bit easier for you, here you find the greymap of all icons we tested.

Matrix with gray scaled highlighting of the calculated index.

Figure 5: Matrix with gray scaled highlighting of the calculated index.