I’m using an app-store-scraper (https://github.com/facundoolano/app-store-scraper/) to pull reviews and read them using python. The problem is that the output of the app isn’t quite in json format, so the data needs some fixing.
Here is the node.js code that produces the output:
var store = require('app-store-scraper');
store.reviews({
id: 300238550,
sort: store.sort.HELPFUL,
page: 1
})
.then(console.log)
.catch(console.log);
Here’s a sample record from the output:
{
id: '123456789',
userName: 'user123',
userUrl: 'https://itunes.apple.com/us/reviews/id1234567',
version: '150.71.0',
score: 1,
title: 'Difficulties and errors signing up',
text: 'The site asks you to put all your information and to link all your bank accounts and financial accounts accounts using your passwords. n' +
'Multiple times reports errors. "Sorry, our site is not working at this time". For me, a waist of my time.',
url: 'https://itunes.apple.com/us/review?id=11111111111'
},
where the "text" value is being printed on two separate lines in the output file.
I’m using pandas.read_json, which gives a lot of errors when trying to read the files as-is. I think the data should be in the format as follows:
{
"id": "123456789",
"userName": "user123",
"userUrl": "https://itunes.apple.com/us/reviews/id1234567",
"version": "150.71.0",
"score": 1,
"title": "Difficulties and errors signing up",
"text": "The site asks you to put all your information and to link all your bank accounts and financial accounts accounts using your passwords. Multiple times reports errors. 'Sorry, our site is not working at this time'. For me, a waist of my time.",
"url": "https://itunes.apple.com/us/review?id=11111111111"
},
The problems are:
- Quotations are missing around the key values.
- Quotations are missing around strings. A challenge here is that some strings contain apostrophes and quotations as well.
2a. Some strings have multiple occurrences of n + which splits the strings into multiple pieces on separate lines.
I’m working on macOS Sonoma 14.0, so am using BASH. Here’s my solution that fixes problem 1:
find . -name '*.json' | xargs awk 'BEGIN {FS=OFS=":"}
{
if (NF > 1) {
gsub(/^ *| *$/, "", $1); # Trim leading and trailing spaces
$1 = """ $1 """;
print $0;
next;
}
print;
}' >> output.txt
Problem 2 is a little more complicated, so I would greatly appreciate any help devising a fix for it. I’m not concerned with quotations within the overall text (like the ‘Sorry …time’ portion in the example), so if it’s easier to remove all apostrophes and quotations that are within the string, that works for me as long as the text ends up in one continuous string wrapped in quotations.
Thank you.
2
Answers
Try this:
Post-processing solution in TXR:
Colorized by Vim: