I am an experienced Azure user when it comes to data infrastructure and pipelines. However, I am new to processing XML data and my particular case seems somewhat complex. After exhausting my resources I am turning to the community for help.
Please note all data shown is demo data. There is no PII being leaked
The XML data is present in my blob (ADLS2). I am using Data Factory V2 copy activity to copy the data into my Azure SQL database. All permissions are appropriately configured.
Notice in the XML inside the ‘Applicant’ section you can see ‘Type’ of ‘primary’ which is always present. IF there is a coBorrower then there will be a second ‘Applicant’ portion with Type ‘coBorrower’.
Portion of XML under question:
<LoanExport>
<Application>
<GeneratedDate>06/15/2023</GeneratedDate>
<BusinessApplication/>
<ApplicationNumber>132</ApplicationNumber>
<ApplicationType>Joint</ApplicationType>
<Business>false</Business>
<Applicants>
<Applicant>
<Type>primary</Type>
<FirstName>Robert</FirstName>
<LastName>Ice</LastName>
<MiddleName/>
<NameSuffix/>
<DateOfBirth>02/22/1992</DateOfBirth>
<CreditScore/>
<SocialSecurityNumber>666423221</SocialSecurityNumber>
<IsUSCitizen>true</IsUSCitizen>
<PhoneNumber>5555555555</PhoneNumber>
<WorkPhoneNumber/>
<CellularPhoneNumber>3334445555</CellularPhoneNumber>
<EmailAddress>[email protected]</EmailAddress>
<DriversLicense/>
<DriversLicenseState/>
<DriversLicenseExpirationDate/>
<DriversLicenseIssueDate/>
<MaritalStatus>false</MaritalStatus>
<Addresses>
<Address>
<Type>Primary</Type>
<Address>
<Number>1</Number>
<StreetPrefix/>
<StreetName>Hawk Drive</StreetName>
<StreetType/>
<StreetSuffix/>
<ApartmentOrSuiteNumber/>
<PostalCode>60750</PostalCode>
<City>FANTASY ISLAND</City>
<State>MD</State>
</Address>
<TimeAtResidenceYears>5</TimeAtResidenceYears>
<TimeAtResidenceMonths>7</TimeAtResidenceMonths>
<Ownership>Own</Ownership>
<MonthlyPayment currency="USD">550.00</MonthlyPayment>
</Address>
</Addresses>
<Employments>
<Employment>
<isCurrentEmployment>true</isCurrentEmployment>
<Status>Employed</Status>
<EmployerName>TCI</EmployerName>
<ContactName/>
<ContactPhone>1111111111</ContactPhone>
<JobPosition>Business Analyst</JobPosition>
<Address>
<StreetName/>
<PostalCode/>
<City/>
<State/>
</Address>
<TimeAtEmployerYears>10</TimeAtEmployerYears>
<TimeAtEmployerMonths>6</TimeAtEmployerMonths>
<GrossMonthlyIncome currency="USD">0.00</GrossMonthlyIncome>
</Employment>
</Employments>
<DenialReasons>
<DenialReason>
<Reason>Applicant has no SS number</Reason>
</DenialReason>
</DenialReasons>
</Applicant>
<Applicant>
<Type>coBorrower</Type>
<FirstName>asd</FirstName>
<LastName>asd</LastName>
<MiddleName>asd</MiddleName>
<NameSuffix>IV</NameSuffix>
<DateOfBirth>07/17/1973</DateOfBirth>
<CreditScore/>
<SocialSecurityNumber>333333333</SocialSecurityNumber>
<IsUSCitizen>true</IsUSCitizen>
<PhoneNumber>1231231231</PhoneNumber>
<WorkPhoneNumber/>
<CellularPhoneNumber>1231231231</CellularPhoneNumber>
<EmailAddress>[email protected]</EmailAddress>
<DriversLicense/>
<DriversLicenseState/>
<DriversLicenseExpirationDate/>
<DriversLicenseIssueDate/>
<MaritalStatus>false</MaritalStatus>
<Addresses>
<Address>
<Type>Primary</Type>
<Address>
<Number/>
<StreetPrefix/>
<StreetName>asdasd</StreetName>
<StreetType/>
<StreetSuffix/>
<ApartmentOrSuiteNumber/>
<PostalCode>12345</PostalCode>
<City>Schenectady</City>
<State>NY</State>
</Address>
<TimeAtResidenceYears>17</TimeAtResidenceYears>
<TimeAtResidenceMonths>11</TimeAtResidenceMonths>
<Ownership>Living with relatives</Ownership>
<MonthlyPayment currency="USD">2222.00</MonthlyPayment>
</Address>
</Addresses>
<Employments>
<Employment>
<isCurrentEmployment>true</isCurrentEmployment>
<Status>Retired</Status>
<EmployerName/>
<ContactName/>
<ContactPhone/>
<JobPosition/>
<Address>
<StreetName/>
<PostalCode/>
<City/>
<State/>
</Address>
<TimeAtEmployerYears>0</TimeAtEmployerYears>
<TimeAtEmployerMonths>0</TimeAtEmployerMonths>
<GrossMonthlyIncome currency="USD">0.00</GrossMonthlyIncome>
</Employment>
</Employments>
<DenialReasons>
<DenialReason>
<Reason>Applicant has no SS number</Reason>
</DenialReason>
</DenialReasons>
</Applicant>
</Applicants>
I am using a small SQL table for testing being able to pull both the primary and coborrower data correctly.
CREATE TABLE [bronze].[loan_origination_application](
[application_number] [nvarchar](max) NULL,
[generated_date] [nvarchar](max) NULL,
[applicant_primary_first_name] [nvarchar](max) NULL,
[applicant_coborrower_first_name] [nvarchar](max) NULL
)
Here is the portion of my Azure Data Factory pipeline JSON:
},
"sink": {
"type": "AzureSqlSink",
"writeBehavior": "insert",
"sqlWriterUseTableLock": false,
"disableMetricsCollection": false
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"path": "['Application']['GeneratedDate']"
},
"sink": {
"name": "generated_date",
"type": "String"
}
},
{
"source": {
"path": "['Application']['ApplicationNumber']"
},
"sink": {
"name": "application_number",
"type": "String"
}
},
{
"source": {
"path": "['Application']['Applicants']['Applicant']['FirstName']"
},
"sink": {
"name": "applicant_primary_first_name",
"type": "String"
}
},
{
"source": {
"path": "['Application']['Applicants']['Applicant'][1]['FirstName']"
},
"sink": {
"name": "applicant_coborrower_first_name",
"type": "String"
}
}
],
"collectionReference": "$['applications']['LoanExport']",
"mapComplexValuesToString": true
}
Note in all my databse output the 4 columns are respectively:
‘application_number’, ‘generated_date’, ‘applicant_primary_first_name’, ‘applicant_coborrower_first_name’
This currently results in the following write in the databse:
132 06/15/2023 NULL asd
134 06/15/2023 Robert NULL
135 06/15/2023 Robert NULL
136 06/15/2023 Robert NULL
Note that in this case if there is a coborrower applicant (application 132) it will pull the correct name but will ignore the primary applicant. When there is no coborrower applicant then the primary applicant is pulled in accurately.
I have tried the following as well as other similar approaches:
Data Factory JSON:
{
"source": {
"path": "['Application']['Applicants']['Applicant'][0]['FirstName']"
},
"sink": {
"name": "applicant_primary_first_name",
"type": "String"
}
},
{
"source": {
"path": "['Application']['Applicants']['Applicant'][1]['FirstName']"
},
"sink": {
"name": "applicant_coborrower_first_name",
"type": "String"
}
}
Database output:
132 06/15/2023 Robert asd
134 06/15/2023 NULL NULL
135 06/15/2023 NULL NULL
136 06/15/2023 NULL NULL
In this case you can see the primary and coborrower applicant types are both pulled correctly but only when there is a coborrower not when there is only a primary applicant.
I also tried some Chat GPT suggestions involving filtering which did not help although I feel it is on the right track.
{
"source": {
"path": "['Application']['Applicants']['Applicant'][?(@['Type']=='primary')]['FirstName']"
},
"sink": {
"name": "applicant_primary_first_name",
"type": "String"
}
},
{
"source": {
"path": "['Application']['Applicants']['Applicant'][?(@['Type']=='coBorrower')]['FirstName']"
},
"sink": {
"name": "applicant_coborrower_first_name",
"type": "String"
}
}
Database output has the same effect as above. Applicant of type primary is only pulled when applicant of type coborrower is also present.
132 06/15/2023 Robert asd
134 06/15/2023 NULL NULL
135 06/15/2023 NULL NULL
136 06/15/2023 NULL NULL
For clarity I will provide the XML for an application with only the primary type applicant.
<LoanExport>
<Application>
<GeneratedDate>06/15/2023</GeneratedDate>
<BusinessApplication/>
<ApplicationNumber>134</ApplicationNumber>
<ApplicationType>Individual</ApplicationType>
<Business>false</Business>
<Applicants>
<Applicant>
<Type>primary</Type>
<FirstName>Robert</FirstName>
<LastName>Ice</LastName>
<MiddleName/>
<NameSuffix/>
<DateOfBirth>02/22/1992</DateOfBirth>
<CreditScore>0</CreditScore>
<SocialSecurityNumber>666423221</SocialSecurityNumber>
<IsUSCitizen>true</IsUSCitizen>
<PhoneNumber>5555555555</PhoneNumber>
<WorkPhoneNumber/>
<CellularPhoneNumber>3334445555</CellularPhoneNumber>
<EmailAddress>[email protected]</EmailAddress>
<DriversLicense/>
<DriversLicenseState/>
<DriversLicenseExpirationDate/>
<DriversLicenseIssueDate/>
<MaritalStatus>false</MaritalStatus>
<Addresses>
<Address>
<Type>Primary</Type>
<Address>
<Number>1</Number>
<StreetPrefix/>
<StreetName>Hawk Drive</StreetName>
<StreetType/>
<StreetSuffix/>
<ApartmentOrSuiteNumber/>
<PostalCode>60750</PostalCode>
<City>FANTASY ISLAND</City>
<State>MD</State>
</Address>
<TimeAtResidenceYears>5</TimeAtResidenceYears>
<TimeAtResidenceMonths>7</TimeAtResidenceMonths>
<Ownership>Own</Ownership>
<MonthlyPayment currency="USD">550.00</MonthlyPayment>
</Address>
</Addresses>
<Employments>
<Employment>
<isCurrentEmployment>true</isCurrentEmployment>
<Status>Employed</Status>
<EmployerName>TCI</EmployerName>
<ContactName/>
<ContactPhone>1111111111</ContactPhone>
<JobPosition>Business Analyst</JobPosition>
<Address>
<StreetName/>
<PostalCode/>
<City/>
<State/>
</Address>
<TimeAtEmployerYears>10</TimeAtEmployerYears>
<TimeAtEmployerMonths>6</TimeAtEmployerMonths>
<GrossMonthlyIncome currency="USD">0.00</GrossMonthlyIncome>
</Employment>
</Employments>
<Stipulations>
<Stipulation>
<Description>VOI Applicant</Description>
<Status>Active</Status>
</Stipulation>
</Stipulations>
</Applicant>
</Applicants>
2
Answers
Azure SQL DB can read directly from blob storage using
OPENROWSET
. See the start of the technique here.For your particular piece of XML, manipulate it with
CROSS APPLY
to drill deeper into it: