Read large XML file from webserver without splitting in smaller chunks - Asp.net

Adam
August 11, 2022
176 views
2 votes
3 Answers

I’m downloading a file from a 3rd party server, like so:

Try
    req = DirectCast(HttpWebRequest.Create("https://www.example.com/my.xml"), HttpWebRequest)
    req.Timeout = 100000 '100 seconds
    Resp = DirectCast(req.GetResponse(), HttpWebResponse)
    reader = New StreamReader(Resp.GetResponseStream)
    responseString = reader.ReadToEnd()
Catch ex As Exception

End Try

The file my.xml is 1.2GB and I’m getting the error "Exception of type ‘System.OutOfMemoryException’ was thrown."
When I open Windows Task Manager I see memory usage is at just 70% of total available memory and IIS Worker Process is not growing in size to use full system memory.
When I found this: https://learn.microsoft.com/en-us/archive/blogs/tom/chat-question-memory-limits-for-32-bit-and-64-bit-processes, so the 70% failure sounds about right.

So now I’m considering splitting the file in more manageable smaller chunks. However, how can I do this without creating separate files? Is there a way to load for example 100MB into memory each time (respecting XML node endings) or perhaps by reading X number of XML nodes each time?

When I Google on "Read large XML file from webserver without splitting in smaller chunks" I get nothing but file splitting tools.

UPDATE 1

Based on Lex Li’s suggestion I searched and found this tutorial: https://learn.microsoft.com/en-us/dotnet/standard/linq/perform-streaming-transform-large-xml-documents

So I translated the code, which works as per the tutorial:

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While reader.Read()

            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Customer" Then

                While reader.Read()

                    If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Name" Then
                        name = TryCast(XElement.ReadFrom(reader), XElement)
                        Exit While
                    End If
                End While

                While reader.Read()
                    If reader.NodeType = XmlNodeType.EndElement Then Exit While

                    If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Item" Then
                        item = TryCast(XElement.ReadFrom(reader), XElement)

                        If item IsNot Nothing Then
                            Dim tempRoot As XElement = New XElement("Root", New XElement(name))
                            tempRoot.Add(item)
                            Yield item
                        End If
                    End If
                End While
            End If
        End While
    End Using
End Function

Private Shared Sub Main()
    Dim srcTree As IEnumerable(Of XElement) = From el In StreamCustomerItem("https://www.example.com/source.xml") Select New XElement("Item", New XElement("Customer", CStr(el.Parent.Element("Name"))), New XElement(el.Element("Key")))
    Dim xws As XmlWriterSettings = New XmlWriterSettings()
    xws.OmitXmlDeclaration = True
    xws.Indent = True

    Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files") + "Output.xml", xws)
        xw.WriteStartElement("Root")

        For Each el As XElement In srcTree
            el.WriteTo(xw)
        Next

        xw.WriteEndElement()
    End Using

End Sub

The example above transforms the source.xml in an output.xml, but all I want is to read product nodes exactly as is (no transformation needed) and in such a way that it reads in individual nodes so I can process large XML files.

I tried to rewrite it so it extracts values from my XML just like the structure. First I tried just getting something ready from my xml file like so:

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While reader.Read()
            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Id" Then
                name = TryCast(XElement.ReadFrom(reader), XElement)
                item = TryCast(XElement.ReadFrom(reader), XElement)

                If item IsNot Nothing Then
                    Dim tempRoot As XElement = New XElement("Root", New XElement(name))
                    tempRoot.Add(item)
                    Yield item
                End If

                Exit While
            End If
        End While
    End Using
End Function

Private Shared Sub Main()
    Dim srcTree As IEnumerable(Of XElement)

    srcTree = From el In StreamCustomerItem("https://www.example.com/mysource.xml")
              Select New XElement("product", New XElement("product", CStr(el.Parent.Element("Id"))))


    Dim xws As XmlWriterSettings = New XmlWriterSettings()
    xws.OmitXmlDeclaration = True
    xws.Indent = True

    Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files") + "Output.xml", xws)
        xw.WriteStartElement("Root")

        For Each el As XElement In srcTree
            el.WriteTo(xw)
        Next

        xw.WriteEndElement()
    End Using


End Sub

That just writes <Root /> to my output.xml though

mysource.xml

<?xml version="1.0" encoding="UTF-8" ?>
<products>
    <product>
        <Id>
            <![CDATA[122854]]>
        </Id>
        <Type>
            <![CDATA[restaurant]]>
        </Type>
        <features>
            <wifi>
                <![CDATA[included]]>
            </wifi>
        </features>         
    </product>
</products>

So to summarize my question: how can I read individual product nodes as-is from "mysource.xml" without loading the full file into memory?

UPDATE 1

Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
    Using reader As XmlReader = XmlReader.Create(uri)
        Dim name As XElement = Nothing
        Dim item As XElement = Nothing
        reader.MoveToContent()

        While Not reader.EOF
            If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "product" Then
                Dim el As XElement = TryCast(XElement.ReadFrom(reader), XElement)
                If el IsNot Nothing Then Yield el
            Else
                reader.Read()
            End If
        End While
    End Using
End Function            


Private Shared Sub Main()
    Dim element As IEnumerable(Of XmlElement) = From el In StreamCustomerItem("source.xml") Select el

    For Each str As XmlElement In grandChildData
    'here loop through `product` element
        Console.WriteLine(str)
    Next
End Sub

My full test file via Onion Share (use TOR browser to download):

http://jkntfybog2s5cc754sn7mujvyaawdqxd4q5imss66x3hsos34rrbjrid.onion
Key: YLTDQSDHTBWGDGQ6FIADTN2K7GFOFT5R7SFKWKTDER3WETD7EMKA

Answers

- SimpForJS
- August 24, 2022 at 1:45 pm
- 0 votes
0
Did you checkout this documentation from Microsoft yet? https://learn.microsoft.com/en-us/dotnet/standard/linq/stream-xml-fragments-xmlreader

I had a similar issue, but reading a large json. What I did there was I read the token reprezenting the start of a product and iterated through those tokens. This way you won’t load the entire file in memory. I believe the same solution can be achieved in XML also.

Hope it helps.

Login or Signup to reply.

This is a bit of an old-school approach, but I usually keep a track of the XPATH address of where I am inside the XML file, then use the XPATH to work out what to do with the value.

Imports System.Xml

Module Program
  Sub Main(args As String())
    Dim filename = "C:JunkJunk.xml"    
    Using reader As XmlReader = XmlReader.Create(filename)
      Dim xpath = ""
      Dim currentProduct As Product = Nothing
      Do While reader.Read
        Select Case reader.NodeType
          Case XmlNodeType.Element
            If Not reader.IsEmptyElement Then
              xpath &= "/" & reader.Name
            End If
            If xpath = "/products/product" Then
              If currentProduct IsNot Nothing Then
                Console.WriteLine(currentProduct)
              End If
              currentProduct = New Product
            End If
          Case XmlNodeType.EndElement
            xpath = xpath.Substring(0, xpath.LastIndexOf("/"))
          Case XmlNodeType.CDATA
            Select Case xpath
              Case "/products/product/Id"
                currentProduct.Id = reader.Value
              Case "/products/product/Type"
                currentProduct.ProductType = reader.Value
              Case "/products/product/features/wifi"
                If reader.Value = "included" Then
                  currentProduct.Wifi = True
                End If
            End Select
        End Select
      Loop
      If currentProduct IsNot Nothing Then
        Console.WriteLine(currentProduct)
      End If
    End Using
    Console.WriteLine("FINISHED")
  End Sub

  Class Product
    Public Property Id As String
    Public Property ProductType As String
    Public Property Wifi As Boolean
    Public Overrides Function ToString() As String
      Return $"{Id}-{ProductType}-{Wifi}"
    End Function    
  End Class
End Module

- SimonMourier
- October 3, 2022 at 12:24 pm
- 0 votes
0
The important thing is to make sure you never load the whole file, but "stream" (in the general sense, stream bytes, characters, xml nodes, etc.) everything from end to end (ie: server to client here).

For network bytes, it means you must use a raw Stream object.

For Xml nodes, it means you can use an XmlReader (not an XmlDocument which loads a full document object model from a stream). In this case, you can use an XmlTextReader which "Represents a reader that provides fast, non-cached, forward-only access to XML data".

Here is a C# piece of code (that can easily be translated to VB.NET) that does this, but can still build an intermediary small Xml document for each product in the big Gb file, using XmlReader methods ReadInnerXml and/or ReadOuterXml:
```
var req = (HttpWebRequest)WebRequest.Create("https://www.yourserver.com/spotahome_1.xml");
using (var resp = req.GetResponse())
{
    using (var stream = resp.GetResponseStream())
    {
        using (var xml = new XmlTextReader(stream))
        {
            var count = 0;
            while (xml.Read())
            {
                switch (xml.NodeType)
                {
                    case XmlNodeType.Element:
                        if (xml.Name == "product")
                        {
                            // using XmlDocument is ok here since we know
                            // a product is not too big
                            // but we could continue with the reader too
                            var product = new XmlDocument();
                            product.LoadXml(xml.ReadOuterXml());
                            Console.WriteLine(count++);
                        }
                        break;
                }
            }
        }
    }
}
```
PS: Ideally, you could use async / await code with Async counterparts methods ReadInnerXmlAsync / ReadOuterXmlAsync but this is another story and easy to setup.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Read large XML file from webserver without splitting in smaller chunks – Asp.net

Answers