skip to Main Content

I am trying to store a large number of news articles (eventually maybe more than several thousand) that I am using in a Python script. For convenience, I would like to store them in a 2D array within a text file, like [[ID, Title, Article],[1, ‘Bill’s Burgers’, ‘The owner, Bill, makes good burgers."]]. However, the solutions I find online require some character such as comma, space, newline, etc to delimited the entries. As these commonly appear in news articles, I can’t use them to delimited elements.

I tried to format my 2D array using json, but found this didn’t do anything to my array. When printing/opening the txt file, it appears exactly as when I declare it – "[[[["ID", "URL", "Title", "Date", "Article"],["1","2","3","4","5"]]". My code is as follows:

scraped_articles_array_headings = [["ID", "URL", "Title", "Date", "Article"],["1","2","3","4","5"]]
headings_encoded = json.dumps(scraped_articles_array_headings)
print(headings_encoded)
f = open("articles_encoded2.txt", "w", encoding="utf-8")
f.write(headings_encoded)
f.close()

How am I mis-using json here? I would welcome any suggestions regarding a suitable approach to storing this data – Ideally I just want a system to allow for easy searching of the contents of each parameter (ID, Title, etc) and I appreciate the above approach might not be following a sensible path to achieve this.

2

Answers


  1. I think you may want to look into an example JSON formatted document.
    Keep in mind that this is different from a CSV format which is line based and uses a delimiter as you’ve mentioned.

    To store those values in JSON you may want to do something like this:

    [
        {
            "id": 1,
            "url": "https://example.com",
            "title": "Example Title",
            "date": "21-03-2003",
            "article": "Some article content..."
        },
        {
            "id": 2,
            "url": "https://example.com",
            "title": "Example Title",
            "date": "21-03-2003",
            "article": "Some article content..."
        }
    ]
    

    For storing large amounts of data you may want to consider a database to improve performance and also allows for querying (eg. finding all articles from a certain URL).

    Login or Signup to reply.
  2. You can have articles that contain "comma, space, newline, etc" and still use common formats that use them for delimiters. An example using the Python csv module:

    import csv
    
    scraped_articles = [['ID', 'URL', 'Title', 'Date', 'Article'],
                        ['1','2','3','4','article1 withn "quotes" and commas(,) and newlines'],
                        ['1','2','3','4','article2 withn "quotes" and commas(,) and newlines']]
    
    with open('articles.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerows(scraped_articles)
    
    with open('articles.csv', 'r', newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for line in reader:
            print('-'*80)
            print(line['Article'])
        print('-'*80)
    

    Resulting articles.csv:

    ID,URL,Title,Date,Article
    1,2,3,4,"article1 with
     ""quotes"" and commas(,) and newlines"
    1,2,3,4,"article2 with
     ""quotes"" and commas(,) and newlines"
    

    Output after reading data back:

    --------------------------------------------------------------------------------
    article1 with
     "quotes" and commas(,) and newlines
    --------------------------------------------------------------------------------
    article2 with
     "quotes" and commas(,) and newlines
    --------------------------------------------------------------------------------
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search