I have a table with millions of text documents in a Postgres 14.7 DB. When I try to generate a GIN index I get this error:
create index idx__txt_document__doc_text__gin on txt_document using gin(to_tsvector('simple', doc_text));
ERROR: string is too long for tsvector (4040510 bytes, max 1048575 bytes)
Time: 6221.251 ms (00:06.221)
Is there a way to find out the offending text id?
I get the same error when I create a temp table that contains the text id and the length of document’s tsvector.
create temp table tmp_foo as
select id, length(to_tsvector('simple', doc_text))
from txt_document;
ERROR: string is too long for tsvector (4040510 bytes, max 1048575 bytes)
CONTEXT: parallel worker
Time: 1912.090 ms (00:01.912)
Any idea how to get the text that creates the error?
I found the largest document which is about 7MB but creating the tsvector works just fine.
select length(to_tsvector('simple', doc_text)) from txt_document where id = ID_LARGEST_TEXT;
2
Answers
You can loop through your table, catch the exception, and
RAISE
a notice (or whatever):Using a
DO
command to execute a PL/pgSQL code block quickly.-I got the error code from provoking the same error. Postgres error messages by default add this line:
Your client seems to suppress it, or you did not include it in the question.
About Postgres error codes.
About the
EXCEPTION
clause:Note that duplicative lexemes in input strings are stored "compressed" in a
tsvector
. Hence, the longest string does not necessarily produce the longesttsvector
. Consider this demo:fiddle
It is a bit annoying, but you could create a wrapper function which catches the error and converts it to NULL (or some other sentinel value):
Then: