I am trying to store a large number of news articles (eventually maybe more than several thousand) that I am using in a Python script. For convenience, I would like to store them in a 2D array within a text file, like [[ID, Title, Article],[1, ‘Bill’s Burgers’, ‘The owner, Bill, makes good burgers."]]. However, the solutions I find online require some character such as comma, space, newline, etc to delimited the entries. As these commonly appear in news articles, I can’t use them to delimited elements.
I tried to format my 2D array using json, but found this didn’t do anything to my array. When printing/opening the txt file, it appears exactly as when I declare it – "[[[["ID", "URL", "Title", "Date", "Article"],["1","2","3","4","5"]]". My code is as follows:
scraped_articles_array_headings = [["ID", "URL", "Title", "Date", "Article"],["1","2","3","4","5"]]
headings_encoded = json.dumps(scraped_articles_array_headings)
print(headings_encoded)
f = open("articles_encoded2.txt", "w", encoding="utf-8")
f.write(headings_encoded)
f.close()
How am I mis-using json here? I would welcome any suggestions regarding a suitable approach to storing this data – Ideally I just want a system to allow for easy searching of the contents of each parameter (ID, Title, etc) and I appreciate the above approach might not be following a sensible path to achieve this.
2
Answers
I think you may want to look into an example JSON formatted document.
Keep in mind that this is different from a CSV format which is line based and uses a delimiter as you’ve mentioned.
To store those values in JSON you may want to do something like this:
For storing large amounts of data you may want to consider a database to improve performance and also allows for querying (eg. finding all articles from a certain URL).
You can have articles that contain "comma, space, newline, etc" and still use common formats that use them for delimiters. An example using the Python
csv
module:Resulting
articles.csv
:Output after reading data back: