skip to Main Content

I am trying to use the result of data frame that I generated from Azure Blob Storage and apply to the next step (where it extracts data in certain way).

I have tested both sides (generating data from Azure Blob Storage & extracting data using Regex (and it works if I tested separately)), but my challenge now is putting two pieces of code together.

Here is first part (getting data frame from Azure Blob Storage):

import re 
from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient


blob = BlobClient(account_url="https://test.blob.core.windows.net",
              container_name="xxxx",
              blob_name="Text.csv",
              credential="xxxx")

data = blob.download_blob()
df = pd.read_csv(data)

Here is second part (extracting only some parts from a csv file):

def read_block(names, igidx=True):
    with open("Test.csv") as f:   ###<<<This is where I would like to modify<<<###              
        pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
        return pd.concat([
            pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
                .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                    pat, f.read(), flags=re.M|re.S) if m.group(1) in names # optional
        ], keys=names, ignore_index=igidx)

df2 = read_block(names=["Admissions", "Readmissions"],igidx=False).droplevel(1).reset_index(names="Admission")   

So, what I am trying to do is use df from the first code and apply into the input section of second code where it says "with open ("Test.csv") as f.

How do I modify the second part of this code to take the data result from first part?

enter image description here

Or if that does not work, is there a way to use the file path ID (data) generated from Azure like below?

<azure.storage.blob._download.StorageStreamDownloader object at 0x00000xxxxxxx>

Update:

I modified the code as below, and now I am getting concat error:

I am not sure it is due to not having any looping function (as I modified to delete "with open("Test.csv") as f:).

...

data = blob.download_blob()
df = pd.read_csv(data)
df1 = df.to_csv(index=False, header=False)

def read_block(names, igidx=True):    
    pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
    return pd.concat([
        pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
            .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                pat, df1, flags=re.M|re.S) if m.group(1) in names 
    ], keys=names, ignore_index=igidx)

df2 = read_block(names=["Admissions", "Readmissions"], igidx=False).droplevel(1).reset_index(names="Admission")   
print(df2) 

enter image description here

New Image:

enter image description here

This is df1:

Not Started: 12,Sent: 3,Completed: 3,,,
,,,,,
Division,Community,Resident Name,Date,Document Status,Last Update
,Test Station,Jane Doe ,9/12/2023,Sent,9/12/2023
,Test Station 2,John Doe,9/12/2023,Not Started,
,Alibaba Fizgerald,Super Man,9/12/2023,Not Started,
,Iceland Kingdom,Super Woman,9/12/2023,Not Started,
,,,,,
,,,,,
Readmissions,,,,,
Not Started: 1,Sent: 0,Completed: 1,,,
,,,,,
Division,Community,Resident Name,Date,Document Status,Last Update
,Station Kingdom,Pretty Woman ,9/12/2023,Not Started,
,My Goodness,Ugly Man,7/21/2023,Completed,7/26/2023
,,,,,
,,,,,
Discharge,,,,,
,,,,,
Division,Community,Resident Name,Date,,
,Station Kingdom1 ,Pretty Woman2 ,8/22/2023,,
,My Goodness1 ,Ugly Man1,4/8/2023,,
,Landmark2,Nice Guys,9/12/2023,,
,Iceland Kingdom2,Mr. Heroshi2,7/14/2023,,
,More Kingdom 2,King Kong ,8/31/2023,,

This is the image csv file (that data gets generated into df1):

enter image description here

This is latest error message:
enter image description here

This is my latest code (11/13/2023-1):

import re 
from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient
blob = 
BlobClient(account_url="https://xxxx.blob.core.windows.net",
              container_name="xxxx",
              blob_name="SampleSafe.csv",               
              credential="xxxx")

data = blob.download_blob(); 
df = pd.read_csv(data); 
df1 = df.to_csv(index=False)

def read_block(names, igidx=True):    
    pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
    return pd.concat([
        pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
            .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                pat, data.readall(), flags=re.M|re.S)
               if m.group(1) in names], keys=names, ignore_index=igidx)

df2 = read_block(names=["Admissions", "Readmissions"], igidx=False).droplevel(1).reset_index(names="block")
print(df2) 

This is detailed error message (updated 11/13/2023-1):

enter image description here

2

Answers


  1. I modified the code … and now I am getting concat error !

    IIUC, that’s because regex fails to match the required blocks. You need to remove header=False when making the buffer df1 = df.to_csv(index=False). Or simply readall the downloaded blob and make a string to avoid reading the input csv as a DataFrame :

    data = blob.download_blob(max_concurrency=1, encoding="UTF-8")
    
    def read_block(names, igidx=True):    
        pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
        return pd.concat([
            pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
                .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                    pat, data.readall(), flags=re.M|re.S)
                   if m.group(1) in names], keys=names, ignore_index=igidx)
    
    Login or Signup to reply.
  2. I think you’re making life too hard by trying to regex match the strings. Better to start with your original df and extract rows and columns from it:

    # This code assumes that `df` is what you get from `pd.read_csv(df1)`
    # with the missing `Admissions` row reinstated
    
    # find the sections
    sections = df.loc[(df[0].ne('')) & (df[0].ne('Division')) & (~df[0].str.startswith('Not Started')), 0].to_list()
    
    # split the df into sections
    # find section headings
    section = df[0].isin(sections).cumsum()
    
    # find division headings
    division = (df[0] == 'Division').cumsum()
    
    # now iterate the sections and extract the appropriate pieces
    # into a dict of dataframes, removing blank lines and setting column
    # names as we go
    dfs = {}
    for i, sec in enumerate(sections):
        # extract all the lines from Division until the next section
        tmp = df.loc[(section == division) & (section == i+1), 1:]
        # rename the columns from the first row and drop it
        tmp = tmp.rename(columns=tmp.iloc[0]).iloc[1:]
        # remove any empty rows and columns
        mask = tmp.astype(bool)
        tmp = tmp.loc[mask.any(axis=1), mask.any(axis=0)]
        dfs[sec] = tmp.copy().reset_index(drop=True)
    
    print(dfs)
    

    Output:

    {
    'Admissions':
               Community Resident Name       Date Document Status Last Update
    0       Test Station      Jane Doe  9/12/2023            Sent   9/12/2023
    1     Test Station 2      John Doe  9/12/2023     Not Started
    2  Alibaba Fizgerald     Super Man  9/12/2023     Not Started
    3    Iceland Kingdom   Super Woman  9/12/2023     Not Started            
    ,
    'Readmissions': 
             Community  Resident Name       Date  Document Status Last Update
    0  Station Kingdom   Pretty Woman   9/12/2023     Not Started
    1      My Goodness       Ugly Man   7/21/2023       Completed   7/26/2023
    ,
    'Discharge': 
               Community   Resident Name       Date
    0  Station Kingdom1    Pretty Woman2  8/22/2023
    1      My Goodness1        Ugly Man1   4/8/2023
    2          Landmark2       Nice Guys  9/12/2023
    3   Iceland Kingdom2    Mr. Heroshi2  7/14/2023
    4     More Kingdom 2       King Kong  8/31/2023
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search