List of articles   Choose language


Make full text search controllable and usable

Where are the shortcomings of the full-text search, used in modern DBMS's? You cannot specify how many words before and after of the found match should be extracted, which tags should surround found match, including in what part of string are they found, which and how many variants of permutations of found words should be extracted. You have desire to write arbitrary FTS-query, but have no possibility. Let's try to eliminate that limitation.

Controllability

Nested fields

Let's have table 's' with fields 'pk' (primary key in it), 's1' and 's2', single record containing

1, 10, "In the morning, dog comes, cat comes home too. Continue in the NEXT issue."

Imagine, that string is divided into words, and words are stored in nested table with fields
Table, nested in text field
@TOKEN@SN@BEGINNING@END
In112
the246
morning3814
dog41719
comes52124
cat62729
comes73134
home83639
too94143
Continue104653
in115556
the125860
NEXT136265
issue146771

And syntactically we have access to these fields as to fields of table 's'(as to nested fields). Each text field of each record of each table has this (syntactical) representation.

Nested column, as well as the result of function, at least one argument of which is nested column, behaves itself

Presentation of text field in DBMS as nested table allows to formulate conditions for full text search flexibly, having mentioned nested columns under clause WHERE, and to see result of search as text string automatically at extraction into external world.

Operations with nested column have the following features

  1. insertion, updating, removal are so, that
  2. any function of nested column and non-nested field of any table, or of nested column and constant
  3. concatenation of two nested columns
First feature allows to change text fields by SQL without resort of bulk of string functions in DBMS (this topic remains outside current article); second and third - to surround words by tags-constants.

Solution of collisions

Several samples can be found even in one string, so even one record can cause several records: we shall name this process as propagation, and records, created from one - as propagated group. So always fictitious integer field SYS_CLUE, which contain different meanings for records of one propagated group, is appended automatically to result set [4]. E.g. next query, asking words from particular set [5], and extracting them and one word on the left and on the right of them

SELECT s1, s2.@TOKEN
FROM   s
WHERE  s2.@SN in (
  SELECT DISTINCT s2.@SN
  FROM   s, (
    SELECT s2.@SN as fn
    FROM   s
    WHERE  s2.@TOKEN in "comes next"
            )
  WHERE  abs(s2.@SN-fn) <= 1
                 );
finds two samples and returns two records

Search with surrounding
s1s2SYS_CLUE
10dog comes, cat ... the NEXT issue1
10cat comes home ... the NEXT issue2

If

It is guaranteed in all cases, that repeated full text search in the same record, or in results of other full text search will give propagated records with the same meanings of field SYS_CLUE.

Surrounding by tags

To make different operation (to surround by different tags) with different words, it is enough to allow to give aliases to arguments of functions, in particular - to function of concatenation. Than, for example, surrounding of words from particular set by tags <b> and </b>, one word on the left and on the right of them by tags <em> and </em>, and returning of all other words between them without surrounding looks so

SELECT s1, ("<b>" ||s2.@TOKEN as f1 ||"</b>" ) ||
           ("<em>"||s2.@TOKEN as f2 ||"</em>") ||
           (        s2.@TOKEN as f3          ) 
FROM   s
WHERE  f1 in "comes next"
  AND  f2 IN (
         SELECT DISTINCT ON(s2.@token, s2.@SN) s2.@token
         FROM   s, (
           SELECT s2.@SN as fn
           FROM   s
           WHERE  s2.@TOKEN in "comes next"
                   )
         WHERE  abs(s2.@SN-fn)=1
             )
  AND f3 between             
         SELECT MIN(s2.@SN)
         FROM   s
         WHERE  s2.@TOKEN in "comes next"
      AND
         SELECT MAX(s2.@SN)
         FROM   s
         WHERE  s2.@TOKEN in "comes next"
      AND NOT IN (
         SELECT DISTINCT ON(s2.@token, s2.@SN) s2.@token
         FROM   s, (
           SELECT s2.@SN as fn
           FROM   s
           WHERE  s2.@TOKEN in "comes next"
                   )
         WHERE  abs(s2.@SN-fn)=1
                   );
And returns the following result
Search with surrounding
s1s2SYS_CLUE
10<em>dog</em> <b>comes</b>, <em>cat</em> comes home too. Continue in <em>the</em> <b>NEXT</b> <em>issue</em>1
10<em>cat</em> <b>comes</b> <em>home</em> too. Continue in <em>the</em> <b>NEXT</b> <em>issue</em>2

Indexing

Sub-fields are appended as result of indexing

to which access are possible syntactically as to

Usa of lexeme indexing

All grammatical forms of one word can be considered as one lexeme. So the following sub-field is appended

to which access is possible syntactically as to

Usability

Basics of indexing searching

Directory of grammatical forms can be not loaded, or can not contain some words or their forms. Than indexed search by all words (or their forms) is impossible - only by indexed ones. So as soon as index for text field is built

There is a need for
tokens
idtokentokenidlexeme
1in1
2the2
3morning3
4dog4
5comes5
12come5
6cat6
7home7
8too8
9continue9
10next10
11issue11
items
idfieldpkidtokenown nameabbreviationsnbeginningend
50511yes 112
50511  115556
50512  246
50512  125860
50513  3814
50514  41719
50515  52124
50515  73134
50516  62729
50517  83639
50518  94143
50519yes 104653
505110 yes136265
505111  146771

Than indexing is building of five indexes
CREATE INDEX i1 ON tokens( idtoken  );
CREATE INDEX i2 ON tokens( token    );
CREATE INDEX i3 ON tokens( idlexeme );

CREATE INDEX i4 ON items( idfield, pk, idtoken );
CREATE INDEX i5 ON items( idfield, pk, sn      );
All these indexes must be automatically removed at deleting of any table 'delimiters', 'tokens', 'items' (it is impossible to build second table, similar to 'items, on template for comparison without 'delimiters' and 'tokens' - on constant "come next" in our case).

Building and appling of index

That indexing would possible without directory of lexemes, let's enter command (separate from command to fill table 'items')

TOKENIZE s(s2) INTO tokens DELIMITING delimiters [, delimiters2];
which leave field 'idlexeme' un-filled. And we will use command to fill a table from file to load directory of lexemes (field 'idtoken' will be filled from its own sequence)

COPY tokens( idlexeme, token ) FROM c:/lexeme.txt
We will make factorization of field 's2' of all records by command
ITEMIZE s(s2) INTO items DELIMITING delimiters [, delimiters2] TOKENIZING tokens;
Operations '=', IN and others, working with text fields and with 's2' in particular, use indexes, built not for 's2', but for tables, specified in parameter NOMENCLARURE [10], and for ones, to which tables from NOMENCLARURE refer by foreign key, mentioned above
SET NOMENCLARURE items [, items2];

[1] I.e. there is no need to write 'ORDER BY s2.@SN'

[2] Extraction of field s2.@SN returns string, consisting of serial numbers of found words, instead of words themselves; s2.@BEGINNING - of offsets of first letter of words, s2.@END - of offsets of last letter of words

[3] Symbols, specified in OMITTED_FIRST and OMITTED_LAST, are appended to beginnning and/or end of found string, if it is necessary to throw initial/terminal words to obtain string

[4] "Always" mean even if all propagated groups consist of one record, and field SYS_CLUE does not mentioned in query. Field SYS_CLUE can contain identical meanings in different different groups. Meaning of field is necessary for client program to inform server, which particular sample from group have been chosen by user. If there is no primary key, it is impossible to distinguish groups

[5] It is possible to specify permutation of words of particular set (details about permutation '=~' are on p.183-186 of pdf-document)

WHERE s2.@TOKEN =~ "come next"
including with restriction of quantity of permutations (results are always given, beginning from the least quantity of permutations, into direction of increasing of quantity)
WHERE s2.@TOKEN TO "come next" PERMUTATIONS <=2

[6] Field SYS_CLUE can contain identical meanings in Cartesian product of different pairs of groups

[7] We can use quantifier ALL before name of sub-field to force to non-index search by all words

SELECT s1, ALL s2.@TOKEN
FROM   s;

[8]

CREATE SEQUENCE delimiters_seq;
CREATE TABLE delimiters (
  iddelimiter  integer DEFAULT nextval('delimiters_seq'),
  delimiter    string
);

[9] 'idfield' is unique system identifier of field 's2' itself. It is filled by command ITEMIZE, that it would possible to search by command 'SELECT ... FROM items' at once in much fields of much tables

[10] NOMENCLARURE is session parameter



P.S.

Article clirifies p.191-197 of pdf-document.



Dima Turin, dmitryturin@yandex.ru



List of articles   Choose language


Используются технологии uCoz