I’m downloading a file from a 3rd party server, like so:
Try
req = DirectCast(HttpWebRequest.Create("https://www.example.com/my.xml"), HttpWebRequest)
req.Timeout = 100000 '100 seconds
Resp = DirectCast(req.GetResponse(), HttpWebResponse)
reader = New StreamReader(Resp.GetResponseStream)
responseString = reader.ReadToEnd()
Catch ex As Exception
End Try
The file my.xml is 1.2GB and I’m getting the error "Exception of type ‘System.OutOfMemoryException’ was thrown."
When I open Windows Task Manager I see memory usage is at just 70% of total available memory and IIS Worker Process is not growing in size to use full system memory.
When I found this: https://learn.microsoft.com/en-us/archive/blogs/tom/chat-question-memory-limits-for-32-bit-and-64-bit-processes, so the 70% failure sounds about right.
So now I’m considering splitting the file in more manageable smaller chunks. However, how can I do this without creating separate files? Is there a way to load for example 100MB into memory each time (respecting XML node endings) or perhaps by reading X number of XML nodes each time?
When I Google on "Read large XML file from webserver without splitting in smaller chunks" I get nothing but file splitting tools.
UPDATE 1
Based on Lex Li’s suggestion I searched and found this tutorial: https://learn.microsoft.com/en-us/dotnet/standard/linq/perform-streaming-transform-large-xml-documents
So I translated the code, which works as per the tutorial:
Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
Using reader As XmlReader = XmlReader.Create(uri)
Dim name As XElement = Nothing
Dim item As XElement = Nothing
reader.MoveToContent()
While reader.Read()
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Customer" Then
While reader.Read()
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Name" Then
name = TryCast(XElement.ReadFrom(reader), XElement)
Exit While
End If
End While
While reader.Read()
If reader.NodeType = XmlNodeType.EndElement Then Exit While
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Item" Then
item = TryCast(XElement.ReadFrom(reader), XElement)
If item IsNot Nothing Then
Dim tempRoot As XElement = New XElement("Root", New XElement(name))
tempRoot.Add(item)
Yield item
End If
End If
End While
End If
End While
End Using
End Function
Private Shared Sub Main()
Dim srcTree As IEnumerable(Of XElement) = From el In StreamCustomerItem("https://www.example.com/source.xml") Select New XElement("Item", New XElement("Customer", CStr(el.Parent.Element("Name"))), New XElement(el.Element("Key")))
Dim xws As XmlWriterSettings = New XmlWriterSettings()
xws.OmitXmlDeclaration = True
xws.Indent = True
Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files") + "Output.xml", xws)
xw.WriteStartElement("Root")
For Each el As XElement In srcTree
el.WriteTo(xw)
Next
xw.WriteEndElement()
End Using
End Sub
The example above transforms the source.xml in an output.xml, but all I want is to read product
nodes exactly as is (no transformation needed) and in such a way that it reads in individual nodes so I can process large XML files.
I tried to rewrite it so it extracts values from my XML just like the structure. First I tried just getting something ready from my xml file like so:
Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
Using reader As XmlReader = XmlReader.Create(uri)
Dim name As XElement = Nothing
Dim item As XElement = Nothing
reader.MoveToContent()
While reader.Read()
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Id" Then
name = TryCast(XElement.ReadFrom(reader), XElement)
item = TryCast(XElement.ReadFrom(reader), XElement)
If item IsNot Nothing Then
Dim tempRoot As XElement = New XElement("Root", New XElement(name))
tempRoot.Add(item)
Yield item
End If
Exit While
End If
End While
End Using
End Function
Private Shared Sub Main()
Dim srcTree As IEnumerable(Of XElement)
srcTree = From el In StreamCustomerItem("https://www.example.com/mysource.xml")
Select New XElement("product", New XElement("product", CStr(el.Parent.Element("Id"))))
Dim xws As XmlWriterSettings = New XmlWriterSettings()
xws.OmitXmlDeclaration = True
xws.Indent = True
Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files") + "Output.xml", xws)
xw.WriteStartElement("Root")
For Each el As XElement In srcTree
el.WriteTo(xw)
Next
xw.WriteEndElement()
End Using
End Sub
That just writes <Root />
to my output.xml though
mysource.xml
<?xml version="1.0" encoding="UTF-8" ?>
<products>
<product>
<Id>
<![CDATA[122854]]>
</Id>
<Type>
<![CDATA[restaurant]]>
</Type>
<features>
<wifi>
<![CDATA[included]]>
</wifi>
</features>
</product>
</products>
So to summarize my question: how can I read individual product
nodes as-is from "mysource.xml" without loading the full file into memory?
UPDATE 1
Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
Using reader As XmlReader = XmlReader.Create(uri)
Dim name As XElement = Nothing
Dim item As XElement = Nothing
reader.MoveToContent()
While Not reader.EOF
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "product" Then
Dim el As XElement = TryCast(XElement.ReadFrom(reader), XElement)
If el IsNot Nothing Then Yield el
Else
reader.Read()
End If
End While
End Using
End Function
Private Shared Sub Main()
Dim element As IEnumerable(Of XmlElement) = From el In StreamCustomerItem("source.xml") Select el
For Each str As XmlElement In grandChildData
'here loop through `product` element
Console.WriteLine(str)
Next
End Sub
My full test file via Onion Share (use TOR browser to download):
http://jkntfybog2s5cc754sn7mujvyaawdqxd4q5imss66x3hsos34rrbjrid.onion
Key: YLTDQSDHTBWGDGQ6FIADTN2K7GFOFT5R7SFKWKTDER3WETD7EMKA
3
Answers
Did you checkout this documentation from Microsoft yet? https://learn.microsoft.com/en-us/dotnet/standard/linq/stream-xml-fragments-xmlreader
I had a similar issue, but reading a large json. What I did there was I read the token reprezenting the start of a product and iterated through those tokens. This way you won’t load the entire file in memory. I believe the same solution can be achieved in XML also.
Hope it helps.
This is a bit of an old-school approach, but I usually keep a track of the XPATH address of where I am inside the XML file, then use the XPATH to work out what to do with the value.
The important thing is to make sure you never load the whole file, but "stream" (in the general sense, stream bytes, characters, xml nodes, etc.) everything from end to end (ie: server to client here).
For network bytes, it means you must use a raw
Stream
object.For Xml nodes, it means you can use an XmlReader (not an XmlDocument which loads a full document object model from a stream). In this case, you can use an XmlTextReader which "Represents a reader that provides fast, non-cached, forward-only access to XML data".
Here is a C# piece of code (that can easily be translated to VB.NET) that does this, but can still build an intermediary small Xml document for each product in the big Gb file, using XmlReader methods ReadInnerXml and/or ReadOuterXml:
PS: Ideally, you could use async / await code with Async counterparts methods
ReadInnerXmlAsync
/ReadOuterXmlAsync
but this is another story and easy to setup.