Need help parsing HTML markup with CSVHelper through my .NET API

wiizimov
August 13, 2024
217 views
0 votes
2 Answers

I receive a CSV from an third party in this format:

Job Requisition Id,Brand,Is Deleted,Internal Status,Job Code,Job Title-JobRequisitionLocale,Posting Start Date-JobRequisitionPosting,Posting End Date-JobRequisitionPosting,Job Posting Status-JobRequisitionPosting,Minimum Salary,Maximum Salary,Job Board Id-JobRequisitionPosting,Postal Code,Job Description-JobRequisitionLocale
1,TEST,TEST,TEST,TEST,TEST,2024-07-16T11:41:50Z,2024-07-31T00:59:59Z,TEST,00.00,00,00,TEST,TEST,"<p class="MsoNoSpacing"><span style="font-size:11pt"><span style="font-family:Calibri,sans-serif"><b><span lang="EN-US" style="font-size:12.0pt">Role: </span></b></p"

I have shortened the HTML markup, it’s much longer. However, I’m trying to read ALL the HTML markup that’s at the end of each row, to store in the Job Description column (the last column) as just plain text to store in a DB.

But it seems to keep ending the row at the first comma in the quoted HTML markup after "Calibri": <span style="font-family:Calibri,

This is my controller code:

var jobPositions = new List<JobAdvertModel>();

await foreach (var blobItem in containerClient.GetBlobsAsync())
{
    var blobClient = containerClient.GetBlobClient(blobItem.Name);
    var blobDownloadInfo = await blobClient.DownloadAsync();

    using (var streamReader = new StreamReader(blobDownloadInfo.Value.Content))
    using (var csvReader = new CsvReader(streamReader, new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        Delimiter = ",",
        BadDataFound = context =>
        {
            // Handle or log bad data
            Console.WriteLine($"Bad data found: {context.RawRecord}");
        }
    }))
    {
        csvReader.Context.RegisterClassMap<JobAdvertMap>();
        var records = csvReader.GetRecords<JobAdvertModel>().ToList();
        jobPositions.AddRange(records);
    }
}

return Ok(jobPositions);

(Removed the Azure bits above, but it’s reading from an azure blob that is ingested weekly)

I’m trying to get all the HTML markup into 1 column as plain text to store in a DB.

Tags: .net c#csv csvhelper html

Answers

As already pointed out in the comments, your CSV is malformed since quotes are not properly escaped with "". Also the HTML seems to be malformed as well as the </p tag is not properly closed (should be </p>). Having said that, the best option would certainly be to correct this on the producer side and therefore actually get proper data.

Assuming that’s not an option and you always get the same malformed data you could just correct the CSV by properly escaping it and then continue to use the existing CSV parser. You could also correct the HTML, if required, which is not done here.

using System.Globalization;
using System.Text.Json;
using CsvHelper;
using CsvHelper.Configuration;

namespace MyProgram;

public class Program
{
    record Test(int LineIndex, string A, string B, string C, string D, string E, string F, string G, string H, string I, string J, string K, string L, string M,
                string Html);
    public static async Task Main()
    {
        var toBeParsed = """
        0,Job Requisition Id,Brand,Is Deleted,Internal Status,Job Code,Job Title-JobRequisitionLocale,Posting Start Date-JobRequisitionPosting,Posting End Date-JobRequisitionPosting,Job Posting Status-JobRequisitionPosting,Minimum Salary,Maximum Salary,Job Board Id-JobRequisitionPosting,Postal Code,Job Description-JobRequisitionLocale
        1,TEST,TEST,TEST,TEST,TEST,2024-07-16T11:41:50Z,2024-07-31T00:59:59Z,TEST,00.00,00,00,TEST,TEST,"<p class="MsoNoSpacing"><span style="font-size:11pt"><span style="font-family:Calibri,sans-serif"><b><span lang="EN-US" style="font-size:12.0pt">Role: </span></b></p"
        """;
        var correctCsv = CorrectCsv(toBeParsed);
        var config = new CsvConfiguration(CultureInfo.InvariantCulture)
        {
            HasHeaderRecord = false, // just to make it work with the provided "Test" record
            NewLine = Environment.NewLine,
        };
        using var reader = new StringReader(correctCsv);
        using var csvReader = new CsvReader(reader, config);
        var parsedCsv = csvReader.GetRecords<Test>().ToArray();
        Console.WriteLine(JsonSerializer.Serialize(parsedCsv, new JsonSerializerOptions { WriteIndented = true }));
        var lastLineLastRow = parsedCsv[^1].Html;
        Console.WriteLine(lastLineLastRow);
    }

    private static string CorrectCsv(string toBeParsed)
    {
        var lastSeparatorIndex = toBeParsed.LastIndexOf(","<p"); // search from back for last occurence of ",<p" within the string
        var html = toBeParsed[(lastSeparatorIndex + 2)..].TrimEnd('"'); // extract HTML from string
        var escapedHtml = html.Replace(""", """"); // escape "
        return toBeParsed[..lastSeparatorIndex] + ","" + escapedHtml + """; // concatenate new correctly escaped CSV
    }
}

Expected output:

[
  {
    "LineIndex": 0,
    "A": "Job Requisition Id",
    "B": "Brand",
    "C": "Is Deleted",
    "D": "Internal Status",
    "E": "Job Code",
    "F": "Job Title-JobRequisitionLocale",
    "G": "Posting Start Date-JobRequisitionPosting",
    "H": "Posting End Date-JobRequisitionPosting",
    "I": "Job Posting Status-JobRequisitionPosting",
    "J": "Minimum Salary",
    "K": "Maximum Salary",
    "L": "Job Board Id-JobRequisitionPosting",
    "M": "Postal Code",
    "Html": "Job Description-JobRequisitionLocale"
  },
  {
    "LineIndex": 1,
    "A": "TEST",
    "B": "TEST",
    "C": "TEST",
    "D": "TEST",
    "E": "TEST",
    "F": "2024-07-16T11:41:50Z",
    "G": "2024-07-31T00:59:59Z",
    "H": "TEST",
    "I": "00.00",
    "J": "00",
    "K": "00",
    "L": "TEST",
    "M": "TEST",
    "Html": "u003Cp class=u0022MsoNoSpacingu0022u003Eu003Cspan style=u0022font-size:11ptu0022u003Eu003Cspan style=u0022font-family:Calibri,sans-serifu0022u003Eu003Cbu003Eu003Cspan lang=u0022EN-USu0022 style=u0022font-size:12.0ptu0022u003ERole: u003C/spanu003Eu003C/bu003Eu003C/p"
  }
]
<p class="MsoNoSpacing"><span style="font-size:11pt"><span style="font-family:Calibri,sans-serif"><b><span lang="EN-US" style="font-size:12.0pt">Role: </span></b></p

Please note: I had to make a simple adjustment in your data, i.e., adding a 0 as the line index for the first row and so that I could disable HasHeaderRecord in order to use my simple Test record, otherwise the CSV library would throw an error since no matching properties to the headers names are found.

You can make the whole thing a whole lot more performant by using Span<char>s if required.

Alternatively, you could of course also write your own CSV parser that can deal with/expects malformed data.

- sunil
- August 13, 2024 at 6:05 pm
- 0 votes
0
If you know that the HTML is always the last column (as you mentioned in your question), then why not split your data based on this information?

Count char by char and keep track of comma you encounter, once you have encountered 13 commas, you already know rest is HTML.
```
public static void Main()
{
    var line = "1,TEST,TEST,TEST,TEST,TEST,2024-07-16T11:41:50Z,2024-07-31T00:59:59Z,TEST,00.00,00,00,TEST,TEST,HTMLSTARTSHEHRE";
    var commaCounter = 0;
    var charCounter = 0;
    var htmlStartPos = 0;
    foreach(char c in line.ToCharArray())
    {
        charCounter++;
        if(c == ',')
            commaCounter++;
        
        if(commaCounter == 13)
            htmlStartPos = charCounter;
    }
    var htmlText = line.Substring(htmlStartPos+1);
    Console.WriteLine("HTML Starts at:" + htmlStartPos+1);
    Console.WriteLine("HTMLTEXT: " + htmlText);
}
```
Split out the HTML from the row, then process the row as a normal CSV row.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.