Where are the shortcomings of the full-text search, used in modern DBMS's? You cannot specify how many words before and after of the found match should be extracted, which tags should surround found match, including in what part of string are they found, which and how many variants of permutations of found words should be extracted. You have desire to write arbitrary FTS-query, but have no possibility. Let's try to eliminate that limitation.
Let's have table 's' with fields 'pk' (primary key in it), 's1' and 's2', single record containing
1, 10, "In the morning, dog comes, cat comes home too. Continue in the NEXT issue."
Imagine, that string is divided into words, and words are stored in nested table with fields
|
Nested column, as well as the result of function, at least one argument of which is nested column, behaves itself
Operations with nested column have the following features
Several samples can be found even in one string, so even one record can cause several records: we shall name this process as propagation, and records, created from one - as propagated group. So always fictitious integer field SYS_CLUE, which contain different meanings for records of one propagated group, is appended automatically to result set [4]. E.g. next query, asking words from particular set [5], and extracting them and one word on the left and on the right of them
SELECT s1, s2.@TOKEN FROM s WHERE s2.@SN in ( SELECT DISTINCT s2.@SN FROM s, ( SELECT s2.@SN as fn FROM s WHERE s2.@TOKEN in "comes next" ) WHERE abs(s2.@SN-fn) <= 1 );finds two samples and returns two records
|
If
It is guaranteed in all cases, that repeated full text search in the same record, or in results of other full text search will give propagated records with the same meanings of field SYS_CLUE.
To make different operation (to surround by different tags) with different words, it is enough to allow to give aliases to arguments of functions, in particular - to function of concatenation. Than, for example, surrounding of words from particular set by tags <b> and </b>, one word on the left and on the right of them by tags <em> and </em>, and returning of all other words between them without surrounding looks so
SELECT s1, ("<b>" ||s2.@TOKEN as f1 ||"</b>" ) || ("<em>"||s2.@TOKEN as f2 ||"</em>") || ( s2.@TOKEN as f3 ) FROM s WHERE f1 in "comes next" AND f2 IN ( SELECT DISTINCT ON(s2.@token, s2.@SN) s2.@token FROM s, ( SELECT s2.@SN as fn FROM s WHERE s2.@TOKEN in "comes next" ) WHERE abs(s2.@SN-fn)=1 ) AND f3 between SELECT MIN(s2.@SN) FROM s WHERE s2.@TOKEN in "comes next" AND SELECT MAX(s2.@SN) FROM s WHERE s2.@TOKEN in "comes next" AND NOT IN ( SELECT DISTINCT ON(s2.@token, s2.@SN) s2.@token FROM s, ( SELECT s2.@SN as fn FROM s WHERE s2.@TOKEN in "comes next" ) WHERE abs(s2.@SN-fn)=1 );And returns the following result
|
Sub-fields are appended as result of indexing
All grammatical forms of one word can be considered as one lexeme. So the following sub-field is appended
Directory of grammatical forms can be not loaded, or can not contain some words or their forms. Than indexed search by all words (or their forms) is impossible - only by indexed ones. So as soon as index for text field is built
|
|
CREATE INDEX i1 ON tokens( idtoken ); CREATE INDEX i2 ON tokens( token ); CREATE INDEX i3 ON tokens( idlexeme ); CREATE INDEX i4 ON items( idfield, pk, idtoken ); CREATE INDEX i5 ON items( idfield, pk, sn );All these indexes must be automatically removed at deleting of any table 'delimiters', 'tokens', 'items' (it is impossible to build second table, similar to 'items, on template for comparison without 'delimiters' and 'tokens' - on constant "come next" in our case).
That indexing would possible without directory of lexemes, let's enter command (separate from command to fill table 'items')
TOKENIZE s(s2) INTO tokens DELIMITING delimiters [, delimiters2];which leave field 'idlexeme' un-filled. And we will use command to fill a table from file to load directory of lexemes (field 'idtoken' will be filled from its own sequence)
COPY tokens( idlexeme, token ) FROM c:/lexeme.txtWe will make factorization of field 's2' of all records by command
ITEMIZE s(s2) INTO items DELIMITING delimiters [, delimiters2] TOKENIZING tokens;Operations '=', IN and others, working with text fields and with 's2' in particular, use indexes, built not for 's2', but for tables, specified in parameter NOMENCLARURE [10], and for ones, to which tables from NOMENCLARURE refer by foreign key, mentioned above
SET NOMENCLARURE items [, items2];
[1] I.e. there is no need to write 'ORDER BY s2.@SN'
[2] Extraction of field s2.@SN returns string, consisting of serial numbers of found words, instead of words themselves; s2.@BEGINNING - of offsets of first letter of words, s2.@END - of offsets of last letter of words
[3] Symbols, specified in OMITTED_FIRST and OMITTED_LAST, are appended to beginnning and/or end of found string, if it is necessary to throw initial/terminal words to obtain string
[4] "Always" mean even if all propagated groups consist of one record, and field SYS_CLUE does not mentioned in query. Field SYS_CLUE can contain identical meanings in different different groups. Meaning of field is necessary for client program to inform server, which particular sample from group have been chosen by user. If there is no primary key, it is impossible to distinguish groups
[5] It is possible to specify permutation of words of particular set (details about permutation '=~' are on p.183-186 of pdf-document)
WHERE s2.@TOKEN =~ "come next"including with restriction of quantity of permutations (results are always given, beginning from the least quantity of permutations, into direction of increasing of quantity)
WHERE s2.@TOKEN TO "come next" PERMUTATIONS <=2
[6] Field SYS_CLUE can contain identical meanings in Cartesian product of different pairs of groups
[7] We can use quantifier ALL before name of sub-field to force to non-index search by all words
SELECT s1, ALL s2.@TOKEN FROM s;
CREATE SEQUENCE delimiters_seq; CREATE TABLE delimiters ( iddelimiter integer DEFAULT nextval('delimiters_seq'), delimiter string );
[9] 'idfield' is unique system identifier of field 's2' itself. It is filled by command ITEMIZE, that it would possible to search by command 'SELECT ... FROM items' at once in much fields of much tables
[10] NOMENCLARURE is session parameter
P.S.
Article clirifies p.191-197 of pdf-document.
Dima Turin, dmitryturin@yandex.ru