I receive a CSV from an third party in this format:
Job Requisition Id,Brand,Is Deleted,Internal Status,Job Code,Job Title-JobRequisitionLocale,Posting Start Date-JobRequisitionPosting,Posting End Date-JobRequisitionPosting,Job Posting Status-JobRequisitionPosting,Minimum Salary,Maximum Salary,Job Board Id-JobRequisitionPosting,Postal Code,Job Description-JobRequisitionLocale
1,TEST,TEST,TEST,TEST,TEST,2024-07-16T11:41:50Z,2024-07-31T00:59:59Z,TEST,00.00,00,00,TEST,TEST,"<p class="MsoNoSpacing"><span style="font-size:11pt"><span style="font-family:Calibri,sans-serif"><b><span lang="EN-US" style="font-size:12.0pt">Role: </span></b></p"
I have shortened the HTML markup, it’s much longer. However, I’m trying to read ALL the HTML markup that’s at the end of each row, to store in the Job Description column (the last column) as just plain text to store in a DB.
But it seems to keep ending the row at the first comma in the quoted HTML markup after "Calibri": <span style="font-family:Calibri,
This is my controller code:
var jobPositions = new List<JobAdvertModel>();
await foreach (var blobItem in containerClient.GetBlobsAsync())
{
var blobClient = containerClient.GetBlobClient(blobItem.Name);
var blobDownloadInfo = await blobClient.DownloadAsync();
using (var streamReader = new StreamReader(blobDownloadInfo.Value.Content))
using (var csvReader = new CsvReader(streamReader, new CsvConfiguration(CultureInfo.InvariantCulture)
{
Delimiter = ",",
BadDataFound = context =>
{
// Handle or log bad data
Console.WriteLine($"Bad data found: {context.RawRecord}");
}
}))
{
csvReader.Context.RegisterClassMap<JobAdvertMap>();
var records = csvReader.GetRecords<JobAdvertModel>().ToList();
jobPositions.AddRange(records);
}
}
return Ok(jobPositions);
(Removed the Azure bits above, but it’s reading from an azure blob that is ingested weekly)
I’m trying to get all the HTML markup into 1 column as plain text to store in a DB.
2
Answers
As already pointed out in the comments, your CSV is malformed since quotes are not properly escaped with
""
. Also the HTML seems to be malformed as well as the</p
tag is not properly closed (should be</p>
). Having said that, the best option would certainly be to correct this on the producer side and therefore actually get proper data.Assuming that’s not an option and you always get the same malformed data you could just correct the CSV by properly escaping it and then continue to use the existing CSV parser. You could also correct the HTML, if required, which is not done here.
Expected output:
Please note: I had to make a simple adjustment in your data, i.e., adding a
0
as the line index for the first row and so that I could disableHasHeaderRecord
in order to use my simpleTest
record, otherwise the CSV library would throw an error since no matching properties to the headers names are found.You can make the whole thing a whole lot more performant by using
Span<char>
s if required.Alternatively, you could of course also write your own CSV parser that can deal with/expects malformed data.
If you know that the HTML is always the last column (as you mentioned in your question), then why not split your data based on this information?
Count char by char and keep track of comma you encounter, once you have encountered 13 commas, you already know rest is HTML.
Split out the HTML from the row, then process the row as a normal CSV row.