Postgres Full-Text Search (FTS) is a powerful tool for searching through textual data. However, when this data includes a significant number of integers like product IDs, serial numbers, or document codes, default FTS behavior can sometimes lead to inefficient indexes and slower search performance.
The dict_int
extension is designed to address this issue by providing a specialized dictionary template that optimizes how integers are tokenized and indexed. This can lead to more compact indexes and faster, more relevant searches.
Imagine searching a vast product catalog for "ID 1234567890". Without dict_int
, FTS might break this number down in various ways, or index the entire long string, potentially creating many unique terms that aren't always useful for searching and can bloat the index. dict_int
allows you to define rules for how these numbers are processed, ensuring they are handled efficiently.
Try it on Neon!
Neon is Serverless Postgres built for the cloud. Explore Postgres features and functions in our user-friendly SQL editor. Sign up for a free account to get started.
dict_int
extension
Enable the You can enable the extension by running the following CREATE EXTENSION
statement in the Neon SQL Editor or from a client such as psql that is connected to your Neon database.
CREATE EXTENSION IF NOT EXISTS dict_int;
Version availability:
Please refer to the list of all extensions available in Neon for up-to-date extension version information.
dict_int
Understanding The dict_int
dictionary template offers precise control over how integer strings are tokenized for full-text search. This is achieved through three key parameters that you can tailor when creating a dictionary to optimize index size and ensure that searches for numbers are both efficient and relevant to your needs. These settings dictate how numbers are processed before they even make it into the search index.
maxlen
This parameter sets the maximum number of digits an integer token is allowed to have. Its default value when creating a dictionary is 6
.
- When an integer token processed by the dictionary has more digits than
maxlen
, it will either be shortened or discarded entirely, depending on therejectlong
setting. - The primary purpose of
maxlen
is to help prevent extremely long, and often less significant, numerical sequences from consuming valuable index space and potentially slowing down searches.
For example, if maxlen
is 5
in your custom dictionary and it encounters the number 1234567
(and rejectlong
is false
), it will be processed as 12345
.
rejectlong
This parameter determines what happens to an integer token that exceeds the number of digits specified by maxlen
. The default for rejectlong
is false
.
- If
rejectlong
isfalse
, integers longer thanmaxlen
are truncated. This means only their initialmaxlen
digits are kept and indexed. - If
rejectlong
is set totrue
, these overlength integers are instead treated as "stop words." This effectively means they are entirely discarded by the dictionary, will not be indexed, and therefore cannot be found via full-text search.
Choosing rejectlong = true
can be beneficial if very long numbers in your dataset are generally considered noise or are irrelevant to your typical search use cases, as it helps keep the index leaner. Conversely, if the leading portion of a long number is still important for searching, keeping rejectlong = false
is the appropriate choice.
absval
The absval
parameter controls the handling of leading positive (+
) or negative (-
) signs in integer tokens. By default, absval
is false
.
- When
absval
isfalse
, any leading signs are typically preserved as part of the token. - If
absval
is set totrue
, any leading+
or-
signs are stripped from the integer before themaxlen
logic is applied. For example, ifabsval
istrue
in your custom dictionary, both-12345
and+12345
would be normalized to12345
.
This feature is very useful when the sign of a number isn't relevant for your search criteria, allowing, for instance, a search for ID 789
to successfully match entries like ID: -789
or REF: +789
without needing to account for the sign explicitly in the search query.
dict_int
Using The dict_int
extension provides a template for creating custom integer dictionaries. This allows you to define how integers are processed during full-text search indexing and querying. A default dictionary named intdict
is provided, which has default parameters set to maxlen = 6
, rejectlong = false
, and absval = false
.
Modifying the default intdict dictionary on Neon
The default intdict
dictionary is owned by superuser. On Neon, you do not have permissions to directly ALTER
this default dictionary, which can result in an "ERROR: must be owner of text search dictionary intdict".
The recommended approach is to create your own custom dictionary from the intdict_template
. This gives you full control over its parameters
Creating and configuring a custom integer dictionary
You can create new dictionaries from the intdict_template
and specify your desired parameters.
Example:
Create dictionary named my_custom_intdict
with maxlen
set to 4
, rejectlong
to true
, and absval
to true
:
CREATE TEXT SEARCH DICTIONARY my_custom_intdict (
TEMPLATE = intdict_template,
MAXLEN = 4,
REJECTLONG = true,
ABSVAL = true
);
If you need to change its parameters later, you can ALTER
the dictionary:
ALTER TEXT SEARCH DICTIONARY my_custom_intdict (
MAXLEN = 3,
REJECTLONG = false,
ABSVAL = false
);
ts_lexize
Utilizing with The ts_lexize
function is used for testing how a dictionary processes input tokens. It shows what lexemes (if any) are produced.
To test the behavior of custom dictionary, use ts_lexize
with the dictionary name and an integer string.
CREATE TEXT SEARCH DICTIONARY intdict_for_testing (
TEMPLATE = intdict_template,
MAXLEN = 3,
REJECTLONG = false,
ABSVAL = true
);
Now, test this dictionary with various integer inputs:
SELECT ts_lexize('intdict_for_testing', '123'); -- Result: {123}
SELECT ts_lexize('intdict_for_testing', '12345'); -- Result: {123} (truncated)
SELECT ts_lexize('intdict_for_testing', '-98765'); -- Result: {987} (absval applied)
SELECT ts_lexize('intdict_for_testing', '+12'); -- Result: {12} (absval applied)
Test with rejectlong
set to true
:
ALTER TEXT SEARCH DICTIONARY intdict_for_testing (
REJECTLONG = true
);
SELECT ts_lexize('intdict_for_testing', '1234567'); -- Result: {} (empty, rejected)
SELECT ts_lexize('intdict_for_testing', '987'); -- Result: {987} (within limit)
dict_int
into a text search configuration
Integrating For a custom integer dictionary to be used during indexing and searching, it must be associated with specific token types in a text search configuration.
Modifying default text search configurations on Neon
Altering default text search configurations (like english
) requires superuser privileges on Neon. If you encounter an "ERROR: must be owner of text search configuration english", you will need to first create a copy of an existing configuration (e.g., english
) and then modify your own copy.
Here's the recommended approach:
-
Create your custom integer dictionary (if you haven't already):
CREATE TEXT SEARCH DICTIONARY my_custom_intdict ( TEMPLATE = intdict_template, MAXLEN = 8, -- Example: Max 8 digits REJECTLONG = false, -- Example: Truncate long numbers ABSVAL = true -- Example: Ignore signs );
-
Create a copy of an existing text search configuration (e.g.,
english
):CREATE TEXT SEARCH CONFIGURATION public.my_app_search_config (COPY = pg_catalog.english);
The above sql creates a new configuration named
my_app_search_config
that inherits the settings of theenglish
configuration. -
Alter the copied configuration to use custom dictionary for integer token types (
int
anduint
):ALTER TEXT SEARCH CONFIGURATION public.my_app_search_config ALTER MAPPING FOR int, uint WITH my_custom_intdict;
Now, public.my_app_search_config
is set up to use my_custom_intdict
for processing integers. public.my_app_search_config
can now be used in to_tsvector
and to_tsquery
functions to process integer tokens according to the rules defined in my_custom_intdict
.
Example
Let's consider a scenario where we have a documents
table with a version_code
field stored as text. These codes can be like "v1", "V2.0", "Rev 003", or purely numeric like "1001", "005". We want to full-text search these, focusing on the numeric parts using a custom integer dictionary and a custom text search configuration.
Sample table and data
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
version_code TEXT
);
INSERT INTO documents (title, content, version_code) VALUES
('Intro Guide', 'Content of version 1...', '1'),
('Advanced Manual', 'More content...', '0042'),
('Internal Spec', 'Spec details...', '7654321'),
('Internal Spec v2', 'Updated spec...', '+7654321'),
('Draft Notes', 'Preliminary ideas...', 'ver003');
Create custom dictionary and text search configuration
-
Create a custom integer dictionary,
doc_version_intdict
, withmaxlen
set to 4,rejectlong
totrue
, andabsval
totrue
.CREATE TEXT SEARCH DICTIONARY doc_version_intdict ( TEMPLATE = intdict_template, MAXLEN = 4, REJECTLONG = true, ABSVAL = true );
-
Create a copy of the
english
text search configuration, naming itdoc_search_config
.CREATE TEXT SEARCH CONFIGURATION public.doc_search_config (COPY = pg_catalog.english);
-
Alter
doc_search_config
to usedoc_version_intdict
for integers.ALTER TEXT SEARCH CONFIGURATION public.doc_search_config ALTER MAPPING FOR int, uint WITH doc_version_intdict;
tsvector
column and index data
Add Index the version_code
column using the custom configuration. First, add a tsvector
column to the documents
table:
ALTER TABLE documents ADD COLUMN version_tsv TSVECTOR;
UPDATE documents
SET version_tsv = to_tsvector('public.doc_search_config', version_code); -- Use custom config
Examine the indexed tokens
To see how the version_code
values are indexed, you can query the documents
table:
SELECT id, title, version_code, version_tsv FROM documents;
id | title | version_code | version_tsv
----+------------------+--------------+-------------
1 | Intro Guide | 1 | '1':1
2 | Advanced Manual | 0042 | '0042':1
3 | Internal Spec | 7654321 |
4 | Internal Spec v2 | +7654321 |
5 | Draft Notes | ver003 | 'ver003':1
(5 rows)
In this example:
- The
version_code
"1" is indexed as'1':1
. - The
version_code
"0042" is indexed as'0042':1
. - The long version code "7654321" and "+7654321" are not indexed at all due to
maxlen
andrejectlong
. - The version code "ver003" is indexed as
'ver003':1
because it doesn't exceed themaxlen
and is not purely numeric.
Searching
Using the custom configuration, you can now search for specific version codes:
-- Find documents with version code '0042'
SELECT title, version_code FROM documents
WHERE version_tsv @@ to_tsquery('public.doc_search_config', '0042');
-- (Advanced Manual, '0042')
-- Try to find the long version code
SELECT title, version_code FROM documents
WHERE version_tsv @@ to_tsquery('public.doc_search_config', '7654321');
-- null (Expected: No results, as it was rejected)
Limitations
- Integer-specific:
dict_int
is designed for whole numbers (integers). It does not process floating-point numbers (e.g.,3.14159
). Standard FTS tokenizers will handle floating-point numbers, butdict_int
's logic won't apply to them. - Text representation: It operates on the textual representation of numbers as tokenized by the FTS parser. If your column is of type
INTEGER
and you cast it toTEXT
forto_tsvector
,dict_int
will then process that text.
Conclusion
The dict_int
dictionary template is a valuable tool in Postgres for fine-tuning how integer values are handled in Full-Text Search. By customizing the way integers are indexed, you can achieve several benefits:
- Reduced index size: Custom integer dictionaries help prevent the proliferation of unique numeric lexemes by truncating or rejecting overly long numbers and normalizing signed ones. This keeps your FTS indexes smaller and more manageable.
- Improved search performance: As a general rule, smaller, more optimized indexes lead to faster search query execution.
- More relevant search results: By tailoring how numbers are processed, you can ensure that searches for numeric data are more aligned with user expectations and less susceptible to noise from irrelevant number formats.
Resources
- PostgreSQL
dict_int
documentation - Dictionary Testing with
ts_lexize
- PostgreSQL Full Text Search
- Full Text Search using tsvector with Neon Postgres
- Postgres tsvector data type
Need help?
Join our Discord Server to ask questions or see what others are doing with Neon. Users on paid plans can open a support ticket from the console. For more details, see Getting Support.