Resources page


IR-DTW Python implementation

We make available here a full implementation of the Information Retrieval-based Dynamic Time Warping (IR-DTW) algorithm using a K-means-based indexing algorithm for the search database.
IR-DTW is a dynamic programming implementation that can be used to search for very similar series of real-valued vectors (queries) inside some bigger database. It can be seen as a hybrid between IR-based algorithms used to find exact repetitions of time series, which are usually used in image search and similar domains, and the well-known DTW algorithm which allows for a certain time-warping to occur when comparing two time series.
Although the first uses of the algorithm have been in speech, it is generic enough so that it could be adapted for any domain. The Tree-based implementation of the IR-DTW algorithm uses a hierarchical K-means tree structure to perform the search for similar points in the database without performing an exhaustive comparison will all points. This allows for a scalable implementation of the system.
The code made available here is a python implementation of the algorithm and its initial development was done by Gautam Mantena during a summer internship in 2012 in Telefonica Research Barcelona. The code is made available for research purposes and to demonstrate the capabilities of IR-DTW, but is not optimized for speed. Please read the accompanying licence.txt file for detailed licensing information.
A description of the IR-DTW implementation can be found in:

Gautam Mantena and Xavier Anguera, "Speed Improvements to Information Retrieval-Based Dynamic Time Warping using Hierarchical K-means Clustering", In Proc. ICASSP 2013, Vancouver, Canada

Code versions for download:

Segmentation plugin for Wavesurfer

If you are doing research on Speaker Diarization you've either been using this tool for Wavesurfer or you've been suffering the lack of simple (but good) tools to see the RTTM results. I found this tool very useful and once I changes computers and could not find it online I asked its author, Douglas Reynolds from MIT-Lincoln lab (Doug's website) for a copy. With his permission I also post the plugin in here for all (and myself in the future) to download.
The plugin (downloadable here) needs to be installed in the plugins directory inside the .Wavesurfer/x.y/ directory in your home directory (where x.y is the version of Wavesurfer you have, I have 1.8). Just unzip and copy in there all files.
Once copied, you will be able to right-click in top of any pre-open waveforms in Wavesurfer and selecting "create pane" and "Segments". Once done, an empty pane will open and the following menu will appear.

The fields are self-explanatory, basically your rttm file goes in the top-left, a UEM file (if you have one) in the top-right and in "Base file name" goes the pattern that the plugin will use to fetch the segments from your particular waveform (useful in case you have all your system outputs in the same file). Once done, press ok and if everything is good you should see something like

MAMI database

The MAMI database was recorded at my company by several volunteers using a mobile phone software, and has been released hoping that it will be useful for researchers around. The full database can be downloaded here(180Mb). Following I copy part of the readme file in the package, describing the database:

==== MAMI Database of isolated spoken words through a mobile phone====
-- Introduction --
The database accompanying this file is called MAMI, which stands for Multimodal Automatic Mobile Indexing, which is the name of the project that triggered its recording. It contains recordings for 23 people, each one repeating 5 times each one of 47 words, spoken in isolation. All words are spanish words, although the speakers might not be.
The words were chosen to reflect a broad range of usual words people would like to use when tagging pictures using a mobile phone, and are described in next section. All users recorded the database using an HTC Touch(TM) phone running Windows Mobile 6.0. For each word to be recorded (selected at random from the 235 words a user had to record), the user had to touch the screen both to start and stop playing. Although the device did not start recording until after it was pressed there are sometimes click sounds either at the start of the end of the recordings. In all the database there is a random amount of silence/noise before and after the spoken words
-- Contents of the database --
The database contains a total of 23 people's recordings, each one in a different folder labelled spkr01 to spkr23. Each speaker recorded 47 different words a total of 5 times each by using a custom-made application in a touch-screen phone. Each user recorded all words in random order. Acoustic conditions for each speaker are all inside a building with different levels of background speaking noise and reverberation (it is not ensured that the same background conditions remain constant for all recordings of a single speaker). The words being recorded were initially classified into the 6 following categories:

1. Nature: rio, playa, parque, nieve, montana (ascii form of montaña), lago, isla, cascada
2. Cities: Zaragoza, Sevilla, Paris, Madrid, Londres, Granada, Barcelona, Chicago
3. People: Zapatero, Raul, Nuria, Pablo, Bill, Carlos, Clinton, Alierta
4. Events: navidad, fiesta, cumpleanos (ascii form of cumpleaños), espectaculo (ascii form or espectáculo), boda, bautizo, barbacoa
5. Family: tia (ascii form of tía), primo, padre, madre, hermano, bebe, abuelo, amigo
6. Monuments: rambla, puente, plaza, piramide (ascii form of pirámide), fuente, acueducto, estatua, catedral

In each speaker folder each recorded word is stored in an individual WAV, encoded with 16 bit/sample at a sample rate of 11025Hz. The format of the files names is as follows: spkrXX-WORD_Y.wav where XX indicates a speaker ID from 01 to 23, WORD indicates the spoken word from the list above, and Y indicates the repetition number, from 1 to 5.
In addition, each audio file contains two support files: *.txt and *.wrd. On the one hand, the "txt" file contains a single line indicating the length (in samples) of the word being spoken plus all accompanying silence (i.e. the total duration of the file). On the other hand, the "wrd" file indicates the approximate location of the spoken word within the file (in samples). Note that the estimation of the position of the spoken word was computed using a simple energy-based algorithm and might be inaccurate for certain applications, it is given for completeness.
-- References for citation --
If you want to add a reference in your paper regarding this database you can use the following:
"MAMI: Multimodal Annotations on a Camera Phone", X. Anguera and N. Oliver, in Proc. MobileHCI, Amsterdam, September 2008
Xavier Anguera (xanguera __--at--__, 2010