I downloaded the titles from:
https://dumps.wikimedia.org/enwiki/20240420/enwiki-20240420-all-titles-in-ns0.gz
And extracted they are a newline delimited text file.
In postgresql I created a table to store the data:
create table titles (
title varchar(255) primary key
);
And I tried to import it using text or csv format:
thisdb=# COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (format 'text');
ERROR: duplicate key value violates unique constraint "titles_pkey"
DETAIL: Key (title)=(//Xekwi_language) already exists.
CONTEXT: COPY titles, line 23830
But when I did a less search of the file only that one line had that record, which is odd. Maybe this was an escaping issue?
Somehow it is seeing the lines:
//Xekwi_language
//Xekwi_language
As the same entry.
When I insert them individually there is no collision, but when I do the format text copy it sees it as a collision… why?
or csv:
thisdb=# COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (format 'csv');
ERROR: extra data after last expected column
CONTEXT: COPY titles, line 45: "!Alfaro_Vive,_Carajo!"
Does anyone know how to do this?
2
Answers
Titles are 255 characters max so the field size is not an issue. Somehow this ingestion issue is being caused by escaping that postgres is running on the inputs. Looking at the definition of titles, we see that # < and > are unused characters in titles.
I was able to get this working with csv ingestion and using the unused characters as delimiters, escaping, and quote characters:
I got it working using a different delimiter (tab) and an escape for the backslash :