The Distributed Search Breakout Group

Bob , Luca , Ethan, Chris, Bryan, Mick

 

We decided the issue was about both searching and browsing data.

 

Key requirements are to support both free text search and searches via controlled vocabularies.

 

Recommendation: The ESP community needs to work out how to fund the long-term maintenance of vocabularies. The maintenance of vocabulary needs both resource and a set of public procedures.

 

In the future we need to move to thesauri and ontologies.

 

Much of the distributed search problem is not specific to our discipline, and we should avoid discipline specific solutions to generic problems.

 

We should learn lessons: large distributed searches in real time do not perform well enough if you have to wait for the lowest common demoninator, and/or the data you want is behind a search interface which is down.

 

Agreed that a harvesting methodology was most appropriate.

 

Recommendation: Suppliers of search metadata should serve up one or more of Dublin Core and GCMD DIF to an OAI server/client structure.

(It should be noted that we thought the voting community for the maintenance of the DIF wasn’t representative of our community, which was an issue that eventually needs addressing).

-         serve up Dublin Core

-         serve up DIF   (DIF: voting community a problem)

-         Can OAI allow users to retrieve either? Build services.

 

Recommendation: A number of us should put up OAI servers and build a testbed for the next meeting.

 

One or more places should harvest the data and build search portals, but the actual structure of the search portal should be up to the local institutions. We should try a few things: lucene based searching, sql based searching, making the best use of spatial and temporal bounding boxes.

 

The records clearly need direct URI  pointers to the data, but the type of pointer should itself be marked up (as it is in DIF).

 

When we get to the results from the search and wish to specifiy the data, we will need a standardised way for representing spatial and temporal constraints when you go past the level of discovery … we thought the ISO standards can be used for this if the user interface can be made simple enough and the documentation is available.

 

One optional (and desirable) pointer could be to return a catalogue document URI as well to allow further complexity of searching :

-         need to preserve attribution of the data hoster (which is often obscured in current implementations)

-         access control needs to be built into metadata catalogues …

Thredds is a suitable technology for doing this.

 

It is desirable to embed in the catalogue some sort of query capability description. An example of this could be the DQC option of Thredds.

 

 

New Technologies and the future:

 

With a small list of returned URIs, one could offer the option to more specific distributed browsing, based on a distributed search if the returns share common searching APIs.  The key requirement here is that it should be possible to understand the query capabilities: in principle this on the road map for OGSA/DAI for example, where it should eventually be possible to discover the underlying schema and then generate common SQL queries and do distributed table joins. Even without that capability, a key first step is that it should be possible to discover the types of queries that a host supports in enough detail to compose simple specific queries against more sophisticated local schema (for example the ESG schema or ISO19115).

 

Recommendation: We need to make sure that we design a method of discovering the query capabilities of a portal.

 

One risk of this approach of the distributed query is the time delay associated with queries. An important part of the search system should be to provide a fundamental estimation of computational cost so that unexplained timeouts are avoided.

 

We also need to address how we rank search results. A new algorithm may make use of one or more of the following: friends and neighbours links within metadata, weighting by papers and reviews (which will require post-hoc metadata additions). Provenance may also be important in terms of ranking as will quality control information if present.