The sonified Towers of Hanoi

Fredrik Winberg & Sten-Olof Hellström

CID, CENTRE FOR USER ORIENTED IT DESIGN

C I D - 11 2 I S S N 1 4 0 3 - 0 7 2 1 D e p a r t m e n t o f N u m e r i c a l A n a l y s i s a n d C o m p u t e r S i e n c e K T H

The quest for auditory direct manipulation:

The sonified Towers of Hanoi

Page 2

Fredrik Winberg & Sten-Olof Hellström

The quest for auditory direct manipulation:The sonified Towers of Hanoi

Report number:

CID-

112

ISSN number:

ISSN

1403-0721

(print)

1403-073X

(Web/PDF)

Publication date:

September 2000

E-mail of author:

[email protected], [email protected]

Reports can be ordered from:

CID, Centre for User Oriented IT Design

NADA, Deptartment of Numerical Analysis and Computer Science

KTH (Royal Institute of Technology)

SE-

100 44

Stockhom, Sweden

Telephone: +

46 (0) 8 790 91 00

Fax: +

46 (0) 8 790 90 99

E-mail: [email protected]

URL: http://cid.nada.kth.se

Page 3

Paper presented at the 3rd International Conference on Disability, Virtual Reality and Associated

Technologies ICDVRAT 2000, Sardinia, Italy, 23-25 September 2000

The quest for auditory direct manipulation:

The sonified Towers of Hanoi

Fredrik Winberg

and Sten Olof Hellström

1,2

Centre for User Oriented IT-Design, Royal Institute of Technology,

Lindstedtsvägen 5, SE-10044 Stockholm, SWEDEN

Department of Music, City University,

London EC1V 0HB, UK

[email protected],

[email protected]

A B S T R A C T

This paper presents a study of an auditory version of the game Towers of Hanoi. The goal of this study was to investi-

gate the nature of continuos presentation and what this could mean when implementing auditory direct manipulation.

We also wanted to find out if it was possible to make an auditory interface that met the requirements of a direct ma-

nipulation interface. The results showed that it was indeed possible to implement auditory direct manipulation, but us-

ing Towers of Hanoi as the underlying model restricted the possibilities of scaling the auditory space. The results also

showed that having a limited set of objects, the nature of continuos presentation was not as important as how to interact

with the auditory space.

Keywords:

Auditory interface, direct manipulation, sonification model, blind users

1 .

I N T R O D U C T I O N

The use of computers today is very dependent on the user's sight. The information is presented visually and sound is

primarily used as very primitive queues for important visual information. This may not cause so much problems today

for a blind computer user using a screen reader, given that all non-textual information has some sort of alternative de-

scription linked to it (which of course is not true, but for the sake of argument we will assume that this is so). But what

about the other benefits that a graphical user interface gives a sighted user?

Representing the information using speech synthesis or Braille is a very linear way of presentation and has more in

common with the old text based interfaces such as MS-DOS than it has with modern graphical user interfaces such as

Windows or MacOS. And what about the next generation interfaces where the standard desktop environment is replaced

by something completely different? Why should blind computer users still have to struggle with text based interaction?

1.1

Direct manipulation

Direct manipulation is a fundamental concept within HCI (human-computer interaction) and is based on the following

properties:

Continuos representation of the object of interest.

Physical actions or labelled button presses instead of complex syntax.

Rapid incremental reversible operations whose impact on the object of interest is immediately visible.

(Schneiderman cited in Hutchins, Hollan, & Norman, 1985)

This means that you for example when moving a file instead of typing the command on your keyboard or choosing

from a list of actions, you simply point your mouse at the file you want to move, grab it by pressing down the mouse

button, drag it to the place you want it to be and drop it by releasing the button. Another important feature of direct

manipulation is that it relies on recognition rather than recall, for example the use of menus helps the user to remember

the name instead of forcing the user to memorise the exact name and the exact syntax of the command of interest.

Direct manipulation has been very influential in today's graphical user interfaces and will influence the way we in-

teract with computers for a long time.

Page 4

1.2

Screen readers

In present screen readers for blind computer users, direct manipulation as well as a number of other important features

of the graphical user interface are missing. Mynatt has summarised five goals for screen reader interface design (1997):

Access to functionality. The screen reader should at least give the user access to the same functions as are pre-

sented in the graphical user interface. In a graphical user interface, most functions are represented as pull-down

menus. The screen reader should give the blind computer user access to this functionality.

Iconic representation of interface objects. The screen reader has to be able to recognise and present the same in-

formation as is communicated by the visual appearance of the interface objects such as the picture, size and col-

our. For example, the picture of a trash can on an icon in MacOS symbolises that the icon represents the trash can

and that it is a suitable place to throw things one want to get rid of. The shape of the icon tells the user whether

the trash can is empty or has things in it.

Spatial arrangement. The spatial arrangement of the graphical objects also conveys information that helps the

user in structuring and working with many tasks at once. The screen reader should offer this functionality.

Constant or persistent presentation. Visual information is not time dependent in the same way as audio is. The

visual information exists in physical space and can be obtained and reviewed at any time; this is not the case for

audio information. The screen reader should support this kind of temporal independence.

Direct manipulation. The screen reader should give the user the same powerful means of interaction as direct ma-

nipulation does.

If auditory direct manipulation is to be implemented, all of these items have to be solved.

Auditory direct manipulation is a rather uncharted territory both in research and development, given that we talk

about real direct manipulation and not just interacting directly or almost directly with interface objects. In the GUIB

project for example (GUIB Consortium, 1995) the work has been concerned with giving the blind computer user a more

direct way of interacting with interface objects, but it has not dealt with direct manipulation itself. Other work has been

done on complex auditory interfaces (see for example Gaver, Smith, & O'Shea, 1991), but most of these has been

monitoring tasks were the focus has been on the display of information rather than the interaction with it (Saue, 2000).

2 .

G E N E R A L G O A L S

The two questions we want to address are

Is auditory direct manipulation at all possible?

Is auditory direct manipulation at all interesting or do we have to seek other paradigms for interaction with an

auditory interface?

In order to answer the above questions we have implemented three different audio-only, non-visual, versions of the

game "Towers of Hanoi" (see for example Ball, 1939). We also performed two user studies on these three versions. The

goal of the studies was to investigate the first principle of direct manipulation, continuos presentation, and what this

could mean in an auditory interface.

The three different levels of continuos presentation under study are

parallel

serial

, and

overlapping

presentation

mode. The first extreme case is when all sounds keep repeating simultaneously, the parallel presentation. The other

extreme is when there is no overlap at all and the sounds are played in sequence, the serial presentation. Finally, we

implemented a mixture of theses two with a slight overlap, the overlapping presentation (see Sonification model for a

more detailed description of these).

3 .

T O W E R S O F H A N O I

The game "Towers of Hanoi" consists of three towers where a number of differently sized discs are placed. Initially all

discs are placed on the leftmost tower with the discs placed in order with respect to size with the smallest disc on top.

The goal is to move all the discs to the rightmost tower. You are only allowed to move one disc at a time and this disc

has to be on top of a tower. You can move this disc to any of the three towers just as long as you don't move a larger

disc in top of a smaller one. This game can be played with as many discs as you want without having to use more than

three towers. The number of moves to complete the game increases rapidly when adding discs, the number of moves it

takes to solve the game for

discs is

-1

. This means that three discs take 7 moves, eight discs takes 255 moves and

sixty-four discs takes 1.8·10

moves to complete.

Page 5

We chose this game for three reasons; (1) we wanted to have a game that could be fun and challenging to solve, not

a typical experimental task, (2) the rules of the game are fairly easy to learn and the strategy is straightforward and

doesn't change when increasing the number of discs, just the number of steps in the solution path, and (3) it's easy to

show the subjects a wooden model of the game in order for them to learn how to solve the game (this applies both for

blind and sighted subjects).

4 .

S O N I F I C A T I O N M O D E L

In the experiment we studied two factors,

game complexity

and

presentation mode

. Game complexity varied at two

levels, referred to as

3disc

and

4disc

. Presentation mode varied at three levels referred to as

serial

overlapping

and

parallel

. See the next section for a discussion of the experimental design.

In the 3disc condition, three discs were moved between three towers. In the 4disc condition, four discs were moved

between three towers. We also want to represent the height of any given disc on a tower. This requires the representa-

tion of up to four discs, three horizontal locations and up to four vertical locations. To accomplish this in sound, each of

the discs is identified through associating it with a sound of a specific quality, and the positions of the discs are given

through spatialising the sounds in stereo, varying their amplitude envelopes and varying their length. We discuss these

features of

disc identity

and

disc location

below.

4.1

Disc identity

Timbre and pitch variations are used to individuate the discs. The larger the disc, the lower the pitch. Let us call the

largest disc 1, and the smallest disc 4. The sounds are mistuned with respect to each other and only rarely have partials

of the same frequencies, which helps to maximise their discriminability. The fundamental frequencies of the sounds are:

118 Hz (disc 1), 181 Hz (disc 2), 336 Hz (disc 3) and 456 Hz (disc 4). (In the 3disc condition, only discs 1, 2 and 3 were

used.)

Disc 1 has a sparser harmonic spectrum than disc 2 and, similarly, disc 3 has a sparser spectrum than disc 4. Fur-

thermore, discs 1 and 2 have fewer high frequency partials than discs 3 and 4. Any combination of discs will differ from

any other combination in terms of both pitch and timbre, and do so in a unique way. There is a large gap in frequency

between disc 2 and 3 so as to stop the sounds from fusing together when three or more are heard from the same location

in the stereo image (cf. Bregman, 1990).

The distinctions between the discs involve some redundancy or overcoding, being conveyed through simultaneous

variations in more than one auditory dimension. This is necessary when only one single auditory dimension is difficult

to perceive in a complex auditory space (Kramer, 1994).

4.2

Disc location

To represent the tower a disc is located on, stereo panning is varied, left, centre and right stereo locations are used. The

spatial discriminability of the sounds is further enhanced by varying their amplitude envelope. Individual discs are pre-

sented by pulsing their sounds. The character of the envelope of each pulse is varied to indicate which tower a disc is

located on. If a disc is placed on the left or right tower, the percentage ratio between attack and decay is 0:100. If a disc

is placed on the middle tower, the same ratio between attack and decay is 50:50.

As with disc identity, the spatial location of the discs is represented redundantly by simultaneously varying panning

and amplitude envelopes.

A disc's vertical position is represented by the length of the pulse, the higher the disc is placed the shorter the sound.

For example, if two discs are placed on the same tower, the one in the lowest position has a longer pulse length than the

one on top. The pulse lengths are 900, 600, 333, and 238 ms.

4.3

Presentation modes

To represent the overall configuration of the Towers of Hanoi at any given moment, three (in 3disc) and four (in 4disc)

inter-related series of pulses are to be heard. The relative timings of these series, and the inter-pulse intervals within

them, have been designed in three different ways.

The serial condition. The pulses for the discs are repeated in numerical order without any delay or overlap. As the

pulses vary in length to represent the height on the tower, the inter-pulse interval in this condition will vary de-

pending on the location of the discs.

Page 6

The overlapping condition. The inter-pulse onset interval is set to a constant value of 300 ms. Accordingly, a

pulse associated with a disc will overlap with that of another if the following disc is not placed on top of three

others (it's pulse length is smaller than the inter-pulse interval). Discs 1, 2 and 3 (and then 4 in 4disc) are repeat-

edly pulsed in order.

The parallel condition. The discs are pulsed continually and simultaneously. For each disc, the onset of a new

pulse occurs immediately after the release of the previous one.

4.4

Mouse location

We used the mouse as the input device. In order for the user to track the mouse cursor, the amplitude of the discs on the

tower where the cursor is located on is increased while the other discs amplitudes are decreased (the difference is 1:3).

Just using this amplitude focus can cause problems when all discs are located to either right or left, since there are no

sounds to indicate the difference between middle and the opposite side. To solve this problem we are also using transi-

tion tones that will sound when moving the cursor from one tower to another. If moving to left or right from the middle,

a short high tone (600 Hz for 500 ms) will sound from left or right. If moving from left or right to the middle, a short

lower tone (400 Hz for 500 ms) will sound from both left and right.

5 .

T H E S T U D Y

5.1

Hypotheses

It is possible to design an auditory interface that meets the requirements of direct manipulation as defined above.

The overlapping presentation mode will be the version that most subjects will both prefer and get the best results

from using. This will be further emphasised when increasing the complexity (using four instead of three discs).

When presenting the sounds using the parallel presentation mode, it will be easy to get a general overview of all objects,

but the separation could be quite hard when having many objects. Furthermore, continuos sounds, or continuos presen-

tation of sounds, could be harder to separate (cf. Gaver, Smith, & O'Shea, 1991) and masking would be more likely.

When using the serial presentation mode there is no problem of separation of the objects since there is no sounds

ever overlapping. On the other hand, the general overview is harder since the user has to wait until all sounds has been

played to get an overview. If the auditory interface is supposed to support direct manipulation and the number of objects

is large, this is could hardly be called continuos presentation in that case.

Since both of these presentation modes both have drawbacks and advantages in comparison to one another, a com-

bination of these seems to be the most appropriate, namely the overlapping presentation mode.

5.2

Experimental design

The experiment is designed to be a two factor within subject design. The first factor is presentation mode and is varied

at three levels, serial, overlapping and parallel. The second factor is game complexity and is varied at two levels, three

and four discs. Each subject played the game once for every combination of the two factors, which means that every

subject, played the game six times. The sequence of the combinations was counterbalanced using a Latin square.

During the experiment two quantitative measurements were made, number of errors (or rather the number of extra

steps in the solution path compared with the optimal path), and time to complete. After the session the subject answered

questions about which presentation mode they preferred and which they thought they performed best with.

The quantitative data had to be analysed using nonparametric statistical methods, since the measurements neither

could be classified as ratio or interval, but rather as ordinal measurements. Additionally, these methods are very insen-

sitive to extreme values, something that is important in an experiment were one might expect a learning effect that will

vary between different subjects. The three level factor (presentation mode) was analysed using the Friedman two-way

analysis of variance by ranks. The two level factor (game complexity) was analysed using the Wilcoxon signed ranks

test.

The experimental set-up was very simple. The subject used a pair of earphones and a regular computer mouse, the

computer screen was turned away from the subject and was used exclusively by the session leader to monitor what the

subject was doing.

A session started with the subject being informed about what was going to happen during the experiment and the

purpose of the study. After this, the subject learnt to play the game using a wooden model of the game. This continued

until the subject knew how to solve for both three and four discs without making any errors. By doing this we are trying

Page 7

to even out differences in prior knowledge of the game and get all subjects to have a useful and similar model in mind

when solving the auditory version of the game. After the subject has been accustomed to the game, the wooden game is

taken away. Now the sonification model is presented. All aspects are described and demonstrated to the subject and the

subject is allowed to ask questions and hear every detail as many times as he or she wants. The subject is also informed

that this is the last chance of asking any questions about the game or the sonification model. When the subject thinks

that he or she knows the sonification model the experiment starts. After all combinations of the game has been com-

pleted, the session ends with the subject answering a number of questions about preferences and perceived performance

5.3

The first study

The results of the first study was more concerned with the sonification model and the experimental design than on the

question of continuos presentation (Winberg & Hellström, 2000). Of the twelve subjects, three could not complete all or

some of the conditions. The two things that caused these dropouts, and caused problems for all subjects for that matter,

were the mouse interaction and the instructions.

The sonification model during this first study differed from the one described above in how the mouse interaction

was designed. The amplitude ratio was smaller (1:2) and there were no transition tones. Most subjects had big problems

tracking the mouse cursor when there were no transition tones. Many subjects found it very hard to find the middle

tower, flipping the cursor from left to right without ever finding the middle. This had a very randomised effect on their

results, making the collected data very unreliable.

The instructions that the subjects received were also something that caused problems. Many subjects simply did not

understand the sonification model at all, invalidating the basic assumption that all subjects would have a model of the

game and an understanding of the sonification model before the experiment started.

Despite all these problems encountered during this first study, as stated above nine out of the twelve subjects actu-

ally understood and achieved well in the experiment. Due to these problems, we had no interesting or valid data to do

any statistical calculations on. But the qualitative results pointed towards a strong preference for the overlapping version

and the subjects also thought that they performed better using this presentation mode, even though this was nothing that

could be deduced from the measured data.

5.4

The second study

When planning the second study, we introduced transition tones and increased the amplitude difference to enhance the

mouse interaction (see Mouse location above). We also changed the instructions and described the sonification model

more thoroughly to the subjects. By doing these two adjustments, we hoped that that the problems encountered in the

first study would be eliminated. The rest of the set-up and experimental design remained the same in the second study.

The outcome of the second study was very encouraging. None of the problems with the mouse interaction from the

first study appeared, none of the subjects had any problems tracking the mouse cursor or finding the middle tower.

Again, as suspected, the subjects preferred the overlapping version, but when it came to the statistical analysis, the

results were very surprising. The hypothesis that the overlapping version would be better could not be supported by the

analysis. The only significant results we got were that when using the overlapping presentation mode it took more time

to complete with four than three discs (Wilcoxon z

time

=-2.275, p<0.05), a result that is not interesting at all since the

number of moves is greater. The differences in number of errors between presentation modes were also significant, but

only when using three discs (Friedman

2errors

=7.032, p<0.05), not when using four discs (Friedman

2errors

=0.391,

p>0.05,

2time

=1,167, p>0.05), something that the hypothesis suggested. We did not find any significant difference

between the three presentation modes in general either (not taking the number of discs into account) (Friedman

2errors

=2.909, p>0.05,

2time

=3,583, p>0.05).

5.5

Subjects

We used twelve subjects in each study. Of these, just one was blind in the first study and none during the second. The

reason for not using more blind people, who indeed are the focus of this work, is that in this early stage of this work we

wanted to fine tune the experiment and the sonification model so that we wouldn't "waste" the blind subjects by letting

them participate in an experiment that might not give any interesting or valid results. Additionally, in this early stage we

are interested in such a low level questions that the difference in experience of auditory interfaces between blind and

sighted subjects is of minor importance. The continuation of this work will definitely involve blind users in defining,

designing and evaluating the auditory direct manipulation interfaces.

Page 8

6 .

D I S C U S S I O N

The first question one might ask is whether a sonification of the game Towers of Hanoi is complex in a relevant and

interesting way and if it really helps answering the hypotheses. The sonification and the study of Towers of Hanoi as an

auditory direct manipulation interface should not be seen in isolation. As will be pointed out below, this is just the first

of a number of studies on auditory direct manipulation. However, this study has given us important pointers were to go

from here.

The second question is why we have chosen the sounds that we have, why haven't we chosen real world sounds like

auditory icons (Gaver, 1994) and used natural mappings. The reason for choosing abstract sounds is primarily the scal-

ability (cf. earcons in Blattner, Sumikawa, Greenberg, 1989) of the model. Adding to this is the fact that the concept of

metaphorical or real world sounds is hard when displaying abstract concepts. There is even research that reports that

there are many acoustical mappings of auditory variables that doesn't necessarily have to be perceived the same by all

listeners, even though they would be considered intuitive or natural (Walker, Kramer, & Lane, 2000).

The sonification of "Towers of Hanoi" presented in this paper meet the requirements of a direct manipulation inter-

face.

The objects are presented in a continuos manner (or rather in three different ways that all could be considered

continuos enough). This means that the user rather than scanning the whole interface for objects with the mouse

risking missing some of them can hear all objects all the time without actually "looking" for them.

We have physical actions instead of complex syntax. Instead of choosing the appropriate action from a menu or

typing it on the keyboard, the user picks up, moves and drops the object just like a graphical user interface.

All actions are rapid, incremental and reversible with immediate feedback. There are no detours when performing

an action, the only way of moving a disc is using the shortest path between the start and the goal. If the user de-

cides that a move is wrong, it is as easy to move it back as it was moving it there.

The hypothesis that the overlapping presentation mode would be better could not be supported by this study, even

though this is what we expected. We have three different explanations to this.

Since the game Towers of Hanoi in itself can be complex and hard to solve for some people, the complexity or

number of auditory objects that is possible to present is limited by the underlying model. During an early pilot

study, we concluded that subjects could have problems solving the game if we used five or more discs. Therefore,

we had to limit the number of objects to four. When increasing the complexity of the auditory space we might get

different results. This calls for more studies of these kind of auditory interfaces where the complexity is not lim-

ited by the underlying model but rather by the limits of the sonification model and the users perception.

Using the mouse interaction with amplitude focus facilitates the interaction with a complex auditory space. When

having a limited set of auditory objects the means of interaction is more important than the way to solve contin-

uos presentation. The way we implemented the mouse interaction increases the amount of objects and the com-

plexity that is possible to interact with.

The sonification model used is robust enough from being influenced by presentation mode, at least in this specific

context. Again, this calls for further studies using other contexts in order to assess the validity of this specific ap-

proach

All these three explanations call for further studies. The final stage of this study of Towers of Hanoi is to make an ex-

tensive case study with blind subjects were the qualitative aspects of this auditory interface is investigated. This third

study will help us understanding more about auditory direct manipulation, how it could be used and what type of appli-

cations would be interesting to implement using this paradigm.

We believe that auditory direct manipulation is indeed both possible and a promising way of interacting with an

auditory interface. It will provide blind computer users with a completely new way of interacting with a computer, a

way that so far has been inaccessible.

7 .

R E F E R E N C E S

Ball, W. W. R. (1939).

Mathematical recreations & essays

(11th ed.) (pp. 303-305). London: Macmillan & Co.

Blattner, M., Sumikawa, D., & Greenberg, R. (1989). Earcons and Icons: Their Structure and Common Design

Principles.

Human-Computer Interaction, 4

(1), 11-44.

Bregman, A. S. (1990).

Auditory scene analysis: The Perceptual Organization of Sound.

Cambridge, MA: MIT Press.

Page 9

Gaver, W. W. (1994). Using and Creating Auditory Icons. In G. Kramer (Ed.),

Auditory display: Sonification,

audification, and auditory interfaces

(pp. 417-446). Reading, USA: Addison-Wesley.

Gaver, W.W., Smith, R.B., & O'Shea, T. (1991). Effective sounds in complex systems: The ARKola simulation. In

Proceedings of CHI'91

(pp. 85-90). New York: ACM.

GUIB Consortium. (1995).

Final Report of the GUIB Project: Textual and Graphical Interfaces for Blind People

London: Royal National Institute for the Blind.

Hutchins, E. L., Hollan, J. D., & Norman, D. A. (1985). Direct manipulation interfaces. In D. A. Norman & S. W.

Draper (Eds.),

User centered system design

(pp. 87-124). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Kramer, G. (1994). Some Organizing Principles for Representing Data with Sound. In G. Kramer (Ed.),

Auditory

display: Sonification, audification, and auditory interfaces

(pp. 185-221). Reading, USA: Addison-Wesley.

Mynatt, E. D. (1997). Transforming graphical interfaces into auditory interfaces for blind users.

Human-Computer

Interaction, 12

, 7-45.

Saue, S. (2000). A model for interaction in exploratory sonification displays. In

Proceedings of ICAD 2000

[online

proceedings]. URL http://www.icad.org/websiteV2.0/Conferences/ICAD2000/ICAD2000.html (visited 2000, July

11).

Walker, B.N., Kramer, G., & Lane, D.M. (2000). Psychophysical Scaling of Sonification Mappings. In

Proceedings of

ICAD 2000

[online proceedings]. URL http://www.icad.org/websiteV2.0/Conferences/ICAD2000/ICAD2000.html

(visited 2000, July 11).

Winberg, F. & Hellström, S. O. (2000). Investigating Auditory Direct Manipulation: Sonifying the Towers of Hanoi. In

CHI 2000 Extended Abstracts

(pp. 281-282). New York: ACM.