(Semantic) Similarity-Blog

Why ballpoint pens and pencils are similar?

Archive for August, 2006

Interesting Course: Categories and Concepts

The Categories and Concepts course given in 2003 at the psychology department of University of Texas at Austin by Bradley Love and Jody Hendrix contains a list (and download links) of selected core readings for people interested in concepts & categories but also in the role of similarity.

Categories and Concepts-course at: http://love.psy.utexas.edu/~love/concepts/

SimPack - Toolkit

The Department of Informatics at the University of Zurich has developed a similarity measurement toolkit called SimPack. Until now, it supports the following measurement approaches:

  • feature vectors
  • strings or sequences of strings
  • trees and graphs
  • information theory

The project is developed in Java and available (together with the Javadoc API) at: http://www.ifi.unizh.ch/ddis/simpack.html

Why Ballpoint Pens and Pencils are Similar?

Just in case you are wondering about the title of this blog: it is taken from an urban legend claiming that the NASA spends million dollars on developing a space pen where the ink does not run out due to the missing gravity - while the Russians just use a pencil to solve the same problem (see http://en.wikipedia.org/wiki/Pencil; ‘pencils in space’-section.) The German business magazine Handelsblatt takes up the story for an ad clip in television stating ‘its substance that matters’. The funny thing about the clip is that Handelsblatt claims that substance decides, but was not aware of the fact that the space pen story was a fake.
    Nevertheless I took the WordNet definition of ballpoint pen and pencil to compare their similarity using the MDSM approach [37] and it turns out that they were not very similar at all. This was the inspiration for my MDSM+TR [39] paper, where I tried to integrate Sowa’s Thematic Roles into MSDM to stress the importance of function for similarity assessments. Ballpoint pens and pencils are made out of different parts and materials, but both share the role of being writing implements.

Download the ad clip (German language only) at: http://www.bbdo.de/de/home/news/20030/spot_space_pen.html

Sowa’s Thematic Roles: http://www.jfsowa.com/ontology/thematic.htm

[37] Rodríguez, A. M. and M.J. Egenhofer, Comparing Geospatial Entity Classes: An Asymmetric and Context-Dependent Similarity Measure. International Journal of Geographical Information Science, 2004. 18(3): p. 229-256.

[39] Janowicz, K. (2005) Extending Semantic Similarity Measurement by Thematic Roles, in First International Conference on GeoSpatial Semantics, GeoS 2005, Mexico City, Mexico.2005, Springer Verlag: Berlin. p. 137-152. [PDF] (external link)

Role & Filler-Similarity for Description Logics

At least in my opinion there a two ways to handle similarity between role-filler pairs: The first (and maybe most straightforward one) is to define similarity as product of the similarities derived by comparing roles and fillers (see equation 1). The second approach is a weighted sum of role and filler similarities (see equation 2).

As example both equations measure overlap between existential quantifications (sime), where simr is the inter-role and simc the inter-filler (range-concept) similarity. Equation 1 returns 0 if compared roles or fillers are dissimilar (sim = 0), which is an advantage from the perspective of computation time and (more important) avoids misleading results as discussed below. Nevertheless defining role and filler similarity as equally important seems to be oversimplified. Moreover except {0, 1} the resulting similarity sime is per definition (of multiplication) smaller than simr and simc, which probably contradicts with humans way of perceiving similarity! The second approach however raises the question how to semi-automatically derive the weightings that determining the relative importance of inter-role respectively inter-filler similarity for sime. In addition, high similarity ratings (for sime) already occur if one of the measured similarities is significant, while the other may be even 0.
     Imagine a transportation device ontology, where R specifies an inside and S a disjoint relation. If both fillers C and D stand for waterways, equation 1 yields 0, while equation 2 results in ωc*1. Now one may argue that the weighting for inter-role similarity should be higher, but than you just need to switch the example (by defining dissimilar fillers) to run into the same difficulty again.
    To overcome this shortcoming I have recently added the notion of thresholds from neural networks to the additive similarity approach to define a minimum similarity value simr and simc need to overleap, else sime is 0. The question of how to derive the weightings and the threshold is still open, but maybe it is possible to integrate the notion of commonality and variability used in MDSM [37] for this purpose. However until now the theory presented in [73] uses the product similarity approach, its idea of context-awareness is comparable to MDSM and therefore a combinations seems to be promising. As start I have used a threshold t = 0.3, ωr = 0.6 and ωc = 0.4 for some first experiments within a simplified accommodation ontology.

[37] Rodríguez, A. M. and M.J. Egenhofer, Comparing Geospatial Entity Classes: An Asymmetric and Context-Dependent Similarity Measure. International Journal of Geographical Information Science, 2004. 18(3): p. 229-256

[73] Janowicz, K. (2006). Sim-DL: Towards a Semantic Similarity Measurement Theory for the Description Logic ALCNR in Geographic Information Retrieval. R. Meersman, Z. Tari, P. Herrero et al. (Eds.): SeBGIS 2006, OTM Workshops 2006, LNCS 4278, pp. 1681 – 1692, 2006. 

Hybrid Approaches to Similarity?

I have added a new category called ‘Hybrid Approaches to Similarity‘ to the literature section; however I am not satisfied doing so. Some authors explicitly state that their approaches are hybrid, but in my opinion this is the case for most recent theories. For instance MDSM [37] is an extended version of Tversky’s ratio model [4] and therefore a classical feature-based approach. Nevertheless in equation 2 and 3 a network model (based on the distance to the least upper bound) is chosen to determine the weighting α and therefore asymmetry. Should this be called hybrid?
    As a start I put some papers into this section that clearly combine several approaches. A good example may be Schwering’s hybrid model [51].

[37] Rodríguez, A. M. and M.J. Egenhofer, Comparing Geospatial Entity Classes: An Asymmetric and Context-Dependent Similarity Measure. International Journal of Geographical Information Science, 2004. 18(3): p. 229-256

[4] Tversky, A. (1977) Features of Similarity. Psychological Review. 84(4): p.327-352.

[51] Schwering, A. (2005). Hybrid model for semantic similarity measurement. 4th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE05). Agia Napa, Cyprus. Springer.


The literature section is up-to-date again. Please let me know if something is missing or a link is broken: [Literature]