skip to Main Content

Dataset

Please advice on the best way to read this type of data into a data frame in R.

Using read.table("Software.txt") only gives the error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
line 1 did not have 6 elements.

Furthermore, this data (Amazon dataset) is not in the traditional rows and columns format, so would appreciate any help on that as well.

2

Answers


  1. Here a solution based on readLines.

    r1 <- readLines('~/Downloads/Software.txt')  ## read raw text
    r2 <- r1[r1 != '']  ## remove blank elements, realize repeats every 10th
    r3 <- strsplit(r2, ': ')  ## split at `: `
    ## remove part before `: ` and make matrix with 10 rows
    r4 <- matrix(sapply(r3, `[`, 2), 10, dimnames=list(sapply(r3[1:10], `[`, 1), NULL))  
    r5 <- as.data.frame(t(r4))  ## transpose and coerce to df
    r6 <- setNames(r5, make.names(names(r5)))  ## names
    r6[r6 == 'unknown'] <- NA  ## generate NA's 
    r7 <- type.convert(r6, as.is=TRUE)  ## convert proper classes
    

    You can, of course, streamline this a little. I just wanted to show you the individual steps.

    Result

    str(r7)  
    # 'data.frame': 95084 obs. of  10 variables:
    # $ product.productId : chr  "B000068VBQ" "B000068VBQ" "B000068VBQ" "B000068VBQ" ...
    # $ product.title     : chr  "Fisher-Price Rescue Heroes" "Fisher-Price Rescue Heroes"  ...
    # $ product.price     : num  8.88 8.88 8.88 8.88 8.88 8.88 8.88 NA NA NA ...
    # $ review.userId     : chr  NA NA "A10P44U29RNOT6" NA ...
    # $ review.profileName: chr  NA NA "D. Jones" NA ...
    # $ review.helpfulness: chr  "11/11" "9/10" "6/6" "4/4" ...
    # $ review.score      : num  2 2 1 1 4 5 1 4 5 4 ...
    # $ review.time       : int  1042070400 1041552000 1126742400 1042416000 1045008000  ...
    # $ review.summary    : chr  "Requires too much coordination" "You can't pick which  ...
    # $ review.text       : chr  "I bought this software for my 5 year old. He has a couple ... 
    
    Login or Signup to reply.
  2. Your data appears to be in the same "Debian control file" (DCF) format that’s used to store package metadata. The correct import function for such data is

    read.dcf("Software.txt")
    

    Check out the ?read.dcf help page for more info.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search