I am new into databases and PostgreSQL and today I wanted to import this dataset: https://www.kaggle.com/datasets/wilmerarltstrmberg/recipe-dataset-over-2m/data into a PostgreSQL so I can make a recipe finder app.
I started by creating a new database, then creating a table and then creating the exact columns from the dataset:
create table recipes_table (
title text,
ingredients text,
directions text,
link text,
source text,
NER text,
site text
);
Then I proceeded to import all the data from my .csv (by rightclicking on the table and selecting import). I left everything on default and started importing but after 10 seconds I always get errors.
Unfortunately I lost track of every modification I did to my code and table (there were a lot), but I’ll leave the results I had from the processes tab:
--command " "\copy public.recipes_table (title, ingredients, directions, link, source, "NER", site) FROM 'C:/Users/silvi/Desktop/WEBPRO~1/archive/RECIPE~1.CSV' DELIMITER ',' CSV HEADER QUOTE '"' ESCAPE '''';""
ERROR: unquoted newline found in data HINT: Use quoted CSV field to represent newline. CONTEXT: COPY recipes_table, line 1274879
– first iteration;
I opened my terminal in VS Code and did this to solve it: sed -i ‘s/r//g’ recipes_data.csv
And then, with this command:
--command " "\copy public.recipes_table (title, ingredients, directions, link, source, "NER", site) FROM 'C:/Users/silvi/Desktop/WEBPRO~1/archive/RECIPE~1.CSV' DELIMITER ',' CSV HEADER QUOTE '"' ESCAPE '''';""
I got this error
ERROR: extra data after last expected column CONTEXT: COPY recipes_table, line 1274879: "Spaghetti Sauce,"[""1-2 lb. hamburger"", ""1-2 lb. Italian Sausage"", ""2 clove garlic"", ""salt"", ..."
Or this similar one:
ERROR: extra data after last expected column CONTEXT: COPY recipes_table, line 2: "No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. evaporated milk"", ""1/2 tsp. van..."
Now I’m as confused as I can get. How can I import this table in my pgAdmin app so I can visualize it and use it in my project?
And even more, I don’t need every column from this .csv so how can I get for example only the title, ingredients, directions and NER?
Thank you all for your help!
2
Answers
Looking at your error code it shows
ESCAPE ''''
and I suspect that is causing the issue (including the one you band-aided with thesed
command).The first two lines of this file are:
The
ingredients
column is encapsulated in double quotes"
characters, HOWEVER there are double quote characters also inside the data, so they have escaped those double quotes with a second double quote.A simple example of what this looks like: Imagine your data is
Hi, they call me "Joe"
, in this data that would be encapsulated with double quotes, and the already existing double quotes would be doubled to indicate that they are part of the data, so:Therefore your
ESCAPE
character is"
not'
.What I would do is download the file once again so you have a fresh copy that has been tampered with by
sed
and the set everything up the same, but change theESCAPE
character to"
and rerun the import. This will change that command line run to look likeESCAPE '"'
You may find it helpful to read through the postgres
COPY
command documentation to understand all of the options. Even if you are using a UI to set this up, thatCOPY
command is being ran under the hood: https://www.postgresql.org/docs/current/sql-copy.htmlLastly, the
ingredient
,directions
andNER
columns look to be JSON arrays, so you may want to think about how that affects your schema and app and treat them appropriately (not as text).Then:
PS: I downloaded and tested the same:
2,231,142 rows affected in 26 s 752 ms