Azure - python how to insert result of data frame into read_block function

JaVa
November 19, 2023
199 views
0 votes
2 Answers

I am trying to use the result of data frame that I generated from Azure Blob Storage and apply to the next step (where it extracts data in certain way).

I have tested both sides (generating data from Azure Blob Storage & extracting data using Regex (and it works if I tested separately)), but my challenge now is putting two pieces of code together.

Here is first part (getting data frame from Azure Blob Storage):

import re 
from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient


blob = BlobClient(account_url="https://test.blob.core.windows.net",
              container_name="xxxx",
              blob_name="Text.csv",
              credential="xxxx")

data = blob.download_blob()
df = pd.read_csv(data)

Here is second part (extracting only some parts from a csv file):

def read_block(names, igidx=True):
    with open("Test.csv") as f:   ###<<<This is where I would like to modify<<<###              
        pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
        return pd.concat([
            pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
                .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                    pat, f.read(), flags=re.M|re.S) if m.group(1) in names # optional
        ], keys=names, ignore_index=igidx)

df2 = read_block(names=["Admissions", "Readmissions"],igidx=False).droplevel(1).reset_index(names="Admission")

So, what I am trying to do is use df from the first code and apply into the input section of second code where it says "with open ("Test.csv") as f.

How do I modify the second part of this code to take the data result from first part?

Or if that does not work, is there a way to use the file path ID (data) generated from Azure like below?

<azure.storage.blob._download.StorageStreamDownloader object at 0x00000xxxxxxx>

Update:

I modified the code as below, and now I am getting concat error:

I am not sure it is due to not having any looping function (as I modified to delete "with open("Test.csv") as f:).

...

data = blob.download_blob()
df = pd.read_csv(data)
df1 = df.to_csv(index=False, header=False)

def read_block(names, igidx=True):    
    pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
    return pd.concat([
        pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
            .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                pat, df1, flags=re.M|re.S) if m.group(1) in names 
    ], keys=names, ignore_index=igidx)

df2 = read_block(names=["Admissions", "Readmissions"], igidx=False).droplevel(1).reset_index(names="Admission")   
print(df2)

New Image:

This is df1:

Not Started: 12,Sent: 3,Completed: 3,,,
,,,,,
Division,Community,Resident Name,Date,Document Status,Last Update
,Test Station,Jane Doe ,9/12/2023,Sent,9/12/2023
,Test Station 2,John Doe,9/12/2023,Not Started,
,Alibaba Fizgerald,Super Man,9/12/2023,Not Started,
,Iceland Kingdom,Super Woman,9/12/2023,Not Started,
,,,,,
,,,,,
Readmissions,,,,,
Not Started: 1,Sent: 0,Completed: 1,,,
,,,,,
Division,Community,Resident Name,Date,Document Status,Last Update
,Station Kingdom,Pretty Woman ,9/12/2023,Not Started,
,My Goodness,Ugly Man,7/21/2023,Completed,7/26/2023
,,,,,
,,,,,
Discharge,,,,,
,,,,,
Division,Community,Resident Name,Date,,
,Station Kingdom1 ,Pretty Woman2 ,8/22/2023,,
,My Goodness1 ,Ugly Man1,4/8/2023,,
,Landmark2,Nice Guys,9/12/2023,,
,Iceland Kingdom2,Mr. Heroshi2,7/14/2023,,
,More Kingdom 2,King Kong ,8/31/2023,,

This is the image csv file (that data gets generated into df1):

This is latest error message:

This is my latest code (11/13/2023-1):

import re 
from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient
blob = 
BlobClient(account_url="https://xxxx.blob.core.windows.net",
              container_name="xxxx",
              blob_name="SampleSafe.csv",               
              credential="xxxx")

data = blob.download_blob(); 
df = pd.read_csv(data); 
df1 = df.to_csv(index=False)

def read_block(names, igidx=True):    
    pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
    return pd.concat([
        pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
            .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                pat, data.readall(), flags=re.M|re.S)
               if m.group(1) in names], keys=names, ignore_index=igidx)

df2 = read_block(names=["Admissions", "Readmissions"], igidx=False).droplevel(1).reset_index(names="block")
print(df2)

This is detailed error message (updated 11/13/2023-1):

Answers

- Timeless
- November 12, 2023 at 7:33 am
- 0 votes
0
I modified the code … and now I am getting concat error !

IIUC, that’s because regex fails to match the required blocks. You need to remove header=False when making the buffer df1 = df.to_csv(index=False). Or simply readall the downloaded blob and make a string to avoid reading the input csv as a DataFrame :
```
data = blob.download_blob(max_concurrency=1, encoding="UTF-8")

def read_block(names, igidx=True):    
    pat = r"(w+),+$n[^,]+.+?n,+n(.+?)(?=n,{2,})"
    return pd.concat([
        pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
            .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                pat, data.readall(), flags=re.M|re.S)
               if m.group(1) in names], keys=names, ignore_index=igidx)
```
Login or Signup to reply.

I think you’re making life too hard by trying to regex match the strings. Better to start with your original df and extract rows and columns from it:

# This code assumes that `df` is what you get from `pd.read_csv(df1)`
# with the missing `Admissions` row reinstated

# find the sections
sections = df.loc[(df[0].ne('')) & (df[0].ne('Division')) & (~df[0].str.startswith('Not Started')), 0].to_list()

# split the df into sections
# find section headings
section = df[0].isin(sections).cumsum()

# find division headings
division = (df[0] == 'Division').cumsum()

# now iterate the sections and extract the appropriate pieces
# into a dict of dataframes, removing blank lines and setting column
# names as we go
dfs = {}
for i, sec in enumerate(sections):
    # extract all the lines from Division until the next section
    tmp = df.loc[(section == division) & (section == i+1), 1:]
    # rename the columns from the first row and drop it
    tmp = tmp.rename(columns=tmp.iloc[0]).iloc[1:]
    # remove any empty rows and columns
    mask = tmp.astype(bool)
    tmp = tmp.loc[mask.any(axis=1), mask.any(axis=0)]
    dfs[sec] = tmp.copy().reset_index(drop=True)

print(dfs)

Output:

{
'Admissions':
           Community Resident Name       Date Document Status Last Update
0       Test Station      Jane Doe  9/12/2023            Sent   9/12/2023
1     Test Station 2      John Doe  9/12/2023     Not Started
2  Alibaba Fizgerald     Super Man  9/12/2023     Not Started
3    Iceland Kingdom   Super Woman  9/12/2023     Not Started            
,
'Readmissions': 
         Community  Resident Name       Date  Document Status Last Update
0  Station Kingdom   Pretty Woman   9/12/2023     Not Started
1      My Goodness       Ugly Man   7/21/2023       Completed   7/26/2023
,
'Discharge': 
           Community   Resident Name       Date
0  Station Kingdom1    Pretty Woman2  8/22/2023
1      My Goodness1        Ugly Man1   4/8/2023
2          Landmark2       Nice Guys  9/12/2023
3   Iceland Kingdom2    Mr. Heroshi2  7/14/2023
4     More Kingdom 2       King Kong  8/31/2023
}

Please signup or login to give your own answer.

Click here to cancel reply.

Azure – python how to insert result of data frame into read_block function

Answers