Make full text search controllable and usable

Where are the shortcomings of the full-text search, used in modern DBMS's? You cannot specify how many words before and after of the found match should be extracted, which tags should surround found match, including in what part of string are they found, which and how many variants of permutations of found words should be extracted. You have desire to write arbitrary FTS-query, but have no possibility. Let's try to eliminate that limitation.

Controllability

Nested fields

Let's have table 's' with fields 'pk' (primary key in it), 's1' and 's2', single record containing

1, 10, "In the morning, dog comes, cat comes home too. Continue in the NEXT issue."

Imagine, that string is divided into words, and words are stored in nested table with fields

@TOKEN (word itself)
@SN (serial number of word in field)
@BEGINNING (offset of first letter of word)
@END (offset of last letter of word)

**Table, nested in text field**
@TOKEN	@SN	@BEGINNING	@END
In	1	1	2
the	2	4	6
morning	3	8	14
dog	4	17	19
comes	5	21	24
cat	6	27	29
comes	7	31	34
home	8	36	39
too	9	41	43
Continue	10	46	53
in	11	55	56
the	12	58	60
NEXT	13	62	65
issue	14	67	71

And syntactically we have access to these fields as to fields

s2.@TOKEN
s2.@SN
s2.@BEGINNING
s2.@END

of table 's'(as to nested fields). Each text field of each record of each table has this (syntactical) representation.

Nested column, as well as the result of function, at least one argument of which is nested column, behaves itself

if it is not position itself under clause SELECT in query (i.e. is under clause UPDATA, DELETE, WHERE, or in any place in sub-query, including under its clause SELECT) - as nested column
if it is under clause SELECT in query - as text non-nested field with name 's2', created by aggregate of concatenation for requested nested column. And it is guaranteed, that
- order of words remain unchanged [1]
- all delimiters (marks of punctuation), presented between found words, been neighbors in original string, come into result of query [2]
- found words, not been neighbors, are separated by symbols, specified by command 'SET OMITTED_MEDIATE ...' [3]

Presentation of text field in DBMS as nested table allows to formulate conditions for full text search flexibly, having mentioned nested columns under clause WHERE, and to see result of search as text string automatically at extraction into external world.

Operations with nested column have the following features

insertion, updating, removal are so, that

at insertion of record with serial number, already existing for some record 'INSERT INTO s (s2.@TOKEN, s2.@SN) VALUES ("new", 15)', new record is located before, than old record ("insertion before")
updating and removal (for example, 'UPDATE s SET s2.@TOKEN = ""||s2.@TOKEN||"" ', 'DELETE FROM s WHERE s2.@BEGINNING >= 100') have no any new features (for example, to replace three words by other four words, it is necessary to remove three words by command DELETE, and to insert four words by command INSERT - instead of usage of command UPDATE)
after insertion or removal, automatically changing of @SN, @BEGINNING, @END occurs for all records, located after new inserted record or first deleted record

any function of nested column and non-nested field of any table, or of nested column and constant

gives new nested table (with one column) of the same parent table
new nested table can has not column explicitly, containing serial numbers of words, but implicitly this column always exists
order of following in it repeats order of following in original nested table
concatenation of requested nested column into non-nested field occurs in order, specified in this implicit column

concatenation of two nested columns

is UNION for them, and gives new nested table (with column name, coinciding with name of left operand of concatenation) of the same parent table
similarly, new nested table can has not column explicitly, containing serial numbers of words, but implicitly this column always exists
order of following in it is specified by serial numbers in both original columns (meanings from two original nested columns can alternate in it); if two meanings in it have identical serial numbers, than meaning from left operand of concatenation goes first, than from right operand
assignment of new actual serial numbers is occurs automatically after concatenation in new nested table

First feature allows to change text fields by SQL without resort of bulk of string functions in DBMS (this topic remains outside current article); second and third - to surround words by tags-constants.

Solution of collisions

Several samples can be found even in one string, so even one record can cause several records: we shall name this process as propagation, and records, created from one - as propagated group. So always fictitious integer field SYS_CLUE, which contain different meanings for records of one propagated group, is appended automatically to result set [4]. E.g. next query, asking words from particular set [5], and extracting them and one word on the left and on the right of them

SELECT s1, s2.@TOKEN
FROM   s
WHERE  s2.@SN in (
  SELECT DISTINCT s2.@SN
  FROM   s, (
    SELECT s2.@SN as fn
    FROM   s
    WHERE  s2.@TOKEN in "comes next"
            )
  WHERE  abs(s2.@SN-fn) <= 1
                 );

finds two samples and returns two records

**Search with surrounding**
s1	s2	SYS_CLUE
10	dog comes, cat ... the NEXT issue	1
10	cat comes home ... the NEXT issue	2

full text search is performed once again in result of previous search, than each of records of propagated group in its turn can cause new propagated group (group of second order), but field SYS_CLUE contains different meanings as before for records of all groups of second, third, and next orders, derived from one original record (i.e. second fictitious field will not necessary to distinguish records of group of second order)
query performs full text search in two (three and so on) fields of one table, than samples of different columns give Cartesian product, but field SYS_CLUE contain different meanings as before for records of Cartesian product of each two (three and so on) group (i.e. second, third, and so on fictitious fields will not necessary to distinguish records of Cartesian product) [6]
query makes Cartesian product of different tables, and performs full text search in fields, initially belonged to different tables, than field SYS_CLUE contains the same meanings, as if fields belong to one table

It is guaranteed in all cases, that repeated full text search in the same record, or in results of other full text search will give propagated records with the same meanings of field SYS_CLUE.

Surrounding by tags

To make different operation (to surround by different tags) with different words, it is enough to allow to give aliases to arguments of functions, in particular - to function of concatenation. Than, for example, surrounding of words from particular set by tags and , one word on the left and on the right of them by tags and , and returning of all other words between them without surrounding looks so

SELECT s1, ("<b>" ||s2.@TOKEN as f1 ||"</b>" ) ||
           ("<em>"||s2.@TOKEN as f2 ||"</em>") ||
           (        s2.@TOKEN as f3          ) 
FROM   s
WHERE  f1 in "comes next"
  AND  f2 IN (
         SELECT DISTINCT ON(s2.@token, s2.@SN) s2.@token
         FROM   s, (
           SELECT s2.@SN as fn
           FROM   s
           WHERE  s2.@TOKEN in "comes next"
                   )
         WHERE  abs(s2.@SN-fn)=1
             )
  AND f3 between             
         SELECT MIN(s2.@SN)
         FROM   s
         WHERE  s2.@TOKEN in "comes next"
      AND
         SELECT MAX(s2.@SN)
         FROM   s
         WHERE  s2.@TOKEN in "comes next"
      AND NOT IN (
         SELECT DISTINCT ON(s2.@token, s2.@SN) s2.@token
         FROM   s, (
           SELECT s2.@SN as fn
           FROM   s
           WHERE  s2.@TOKEN in "comes next"
                   )
         WHERE  abs(s2.@SN-fn)=1
                   );

And returns the following result

**Search with surrounding**
s1	s2	SYS_CLUE
10	<em>dog</em> <b>comes</b>, <em>cat</em> comes home too. Continue in <em>the</em> <b>NEXT</b> <em>issue</em>	1
10	<em>cat</em> <b>comes</b> <em>home</em> too. Continue in <em>the</em> <b>NEXT</b> <em>issue</em>	2

Indexing

Sub-fields are appended as result of indexing

@IDTOKEN
@IDFIELD

to which access are possible syntactically as to

s2.@IDTOKEN
s2.@IDFIELD

Usa of lexeme indexing

All grammatical forms of one word can be considered as one lexeme. So the following sub-field is appended

@IDLEXEME

to which access is possible syntactically as to

s2.@IDLEXEME

Usability

Basics of indexing searching

Directory of grammatical forms can be not loaded, or can not contain some words or their forms. Than indexed search by all words (or their forms) is impossible - only by indexed ones. So as soon as index for text field is built

not only speed of search is increased
but range of words, on which search is performed, can be narrowed [7]

There is a need for

table of delimiters 'delimiters', containing marks blank, tab, carriage return, new line, all marks of punctuation [8]
factorization of string into tables 'tokens' and 'items' [9], bound by foreign key 'ALTER TABLE items ADD FOREIGN KEY (idtoken) REFERENCES tokens (idtoken)'

**tokens**
idtoken	token	idlexeme
1	in	1
2	the	2
3	morning	3
4	dog	4
5	comes	5
12	come	5
6	cat	6
7	home	7
8	too	8
9	continue	9
10	next	10
11	issue	11

**items**
idfield	pk	idtoken	own name	abbreviation	sn	beginning	end
505	1	1	yes		1	1	2
505	1	1			11	55	56
505	1	2			2	4	6
505	1	2			12	58	60
505	1	3			3	8	14
505	1	4			4	17	19
505	1	5			5	21	24
505	1	5			7	31	34
505	1	6			6	27	29
505	1	7			8	36	39
505	1	8			9	41	43
505	1	9	yes		10	46	53
505	1	10		yes	13	62	65
505	1	11			14	67	71

Than indexing is building of five indexes

CREATE INDEX i1 ON tokens( idtoken  );
CREATE INDEX i2 ON tokens( token    );
CREATE INDEX i3 ON tokens( idlexeme );

CREATE INDEX i4 ON items( idfield, pk, idtoken );
CREATE INDEX i5 ON items( idfield, pk, sn      );

All these indexes must be automatically removed at deleting of any table 'delimiters', 'tokens', 'items' (it is impossible to build second table, similar to 'items, on template for comparison without 'delimiters' and 'tokens' - on constant "come next" in our case).

Building and appling of index

That indexing would possible without directory of lexemes, let's enter command (separate from command to fill table 'items')

TOKENIZE s(s2) INTO tokens DELIMITING delimiters [, delimiters2];

which leave field 'idlexeme' un-filled. And we will use command to fill a table from file to load directory of lexemes (field 'idtoken' will be filled from its own sequence)

COPY tokens( idlexeme, token ) FROM c:/lexeme.txt

We will make factorization of field 's2' of all records by command

ITEMIZE s(s2) INTO items DELIMITING delimiters [, delimiters2] TOKENIZING tokens;

Operations '=', IN and others, working with text fields and with 's2' in particular, use indexes, built not for 's2', but for tables, specified in parameter NOMENCLARURE [10], and for ones, to which tables from NOMENCLARURE refer by foreign key, mentioned above

SET NOMENCLARURE items [, items2];

[1] I.e. there is no need to write 'ORDER BY s2.@SN'

[2] Extraction of field s2.@SN returns string, consisting of serial numbers of found words, instead of words themselves; s2.@BEGINNING - of offsets of first letter of words, s2.@END - of offsets of last letter of words

[3] Symbols, specified in OMITTED_FIRST and OMITTED_LAST, are appended to beginnning and/or end of found string, if it is necessary to throw initial/terminal words to obtain string

[4] "Always" mean even if all propagated groups consist of one record, and field SYS_CLUE does not mentioned in query. Field SYS_CLUE can contain identical meanings in different different groups. Meaning of field is necessary for client program to inform server, which particular sample from group have been chosen by user. If there is no primary key, it is impossible to distinguish groups

[5] It is possible to specify permutation of words of particular set (details about permutation '=~' are on p.183-186 of pdf-document)

WHERE s2.@TOKEN =~ "come next"

including with restriction of quantity of permutations (results are always given, beginning from the least quantity of permutations, into direction of increasing of quantity)

WHERE s2.@TOKEN TO "come next" PERMUTATIONS <=2

[6] Field SYS_CLUE can contain identical meanings in Cartesian product of different pairs of groups

[7] We can use quantifier ALL before name of sub-field to force to non-index search by all words

SELECT s1, ALL s2.@TOKEN
FROM   s;

[8]

CREATE SEQUENCE delimiters_seq;
CREATE TABLE delimiters (
  iddelimiter  integer DEFAULT nextval('delimiters_seq'),
  delimiter    string
);

[9] 'idfield' is unique system identifier of field 's2' itself. It is filled by command ITEMIZE, that it would possible to search by command 'SELECT ... FROM items' at once in much fields of much tables

[10] NOMENCLARURE is session parameter

P.S.

Article clirifies p.191-197 of pdf-document.

Dima Turin, dmitryturin@yandex.ru

List of articles Choose language

Используются технологии uCoz