How can I import wikipedia english page titles into a postgresql table?

spacether
April 24, 2024
223 views
0 votes
2 Answers

I downloaded the titles from:
https://dumps.wikimedia.org/enwiki/20240420/enwiki-20240420-all-titles-in-ns0.gz

And extracted they are a newline delimited text file.
In postgresql I created a table to store the data:

create table titles (
  title varchar(255) primary key
);

And I tried to import it using text or csv format:

thisdb=# COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (format 'text');
ERROR:  duplicate key value violates unique constraint "titles_pkey"
DETAIL:  Key (title)=(//Xekwi_language) already exists.
CONTEXT:  COPY titles, line 23830

But when I did a less search of the file only that one line had that record, which is odd. Maybe this was an escaping issue?
Somehow it is seeing the lines:

//Xekwi_language
//Xekwi_language

As the same entry.
When I insert them individually there is no collision, but when I do the format text copy it sees it as a collision… why?

or csv:

thisdb=# COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (format 'csv');
ERROR:  extra data after last expected column
CONTEXT:  COPY titles, line 45: "!Alfaro_Vive,_Carajo!"

Does anyone know how to do this?

Tags: postgresql wikipedia

Answers

Chosen as BEST ANSWER
- spacether
- April 24, 2024 at 12:03 am
- 0 votes
0
Titles are 255 characters max so the field size is not an issue. Somehow this ingestion issue is being caused by escaping that postgres is running on the inputs. Looking at the definition of titles, we see that # < and > are unused characters in titles.

I was able to get this working with csv ingestion and using the unused characters as delimiters, escaping, and quote characters:
```
COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (
    FORMAT 'csv',
    DELIMITER '<',
    ESCAPE '>',
    QUOTE '#'
);
```

(Edit)

- FrankHeikens
- April 24, 2024 at 6:29 am
- 0 votes
0
I got it working using a different delimiter (tab) and an escape for the backslash :
```
COPY titles(title) FROM '/Users/frank/enwiki-20240420-all-titles-in-ns0' 
WITH (
  FORMAT csv
, DELIMITER E't'
, ESCAPE ''
, HEADER TRUE
, ENCODING 'utf8');
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.