[xquery-talk] Regular Expression search

John Snelson jsnelson at sleepycat.com
Fri Dec 16 10:45:11 PST 2005


Martin Probst wrote:
>>>Are you sure? It's probably possible for simple cases (e.g. "Foo|Bar"),
>>>but for general regular expressions? How would you do that?
>>
>>Sleepycat's Berkeley DB XML has what it calls a "substring" index which 
>>is currently used to optimise fn:contains(), fn:starts-with() and 
>>fn:ends-with(). This works by splitting the content down into sequential 
>>three character segments, ie:
>>
>>"abccccb" is split into "abc", "bcc", "ccc", "ccb"
>>
>>This type of index could be used to optimise regular expression. If you 
>>define a regular expression to match the string above, it might look 
>>like this:
>>
>>"abc+b"
>>
>> From this regular expression, you can see that the keys you need to 
>>look up in the container are:
>>
>>"abc" & ("bcb" | ("bcc" & "ccb") | ("bcc" & "ccc" & "ccb"))
> 
> Interesting. Though of course the real general case for Regular
> Expressions is probably just out of reach.

Yes. This scheme is never going to be able to optimise regular 
expressions with less than three sequential non-wilcard characters in 
it. However I would suggest that quite often this sort of index 
optimisation will be useful to real world applications.

John

-- 
John Snelson, Berkeley DB XML Engineer
Sleepycat Software, Inc
http://www.sleepycat.com

Contracted to Sleepycat through Parthenon Computing Ltd
http://blog.parthcomp.com/dbxml


More information about the talk mailing list