I need help in working with big Excel Tables.
Description
I have an export of Data from our ERP System that has 400K Rows at least.
In this report the format is quite messed up and I want to write a script that will clean up all the data.
I started to write little sub just to delete empty rows and such that have a special behavior.
please see below:
Sub Main()
OptimizeVBA (True)
DeleteLastRows
OptimizeVBA (False)
End Sub
Sub DeleteLastRows()
'Achtung, diese Funktion dauert sehr lange
Dim total
total = ActiveSheet.UsedRange.Rows.Count
Dim Tim1 As Single
Tim1 = Timer
For i = total To total - 100 Step -1
If ThatSpecialLine("0", i, 1, 9) Then
'DeleteRow (i)
Rows(i).EntireRow.Delete
ElseIf EmptyRow(i, 1, 13) Then
'DeleteRow (i)
Rows(i).EntireRow.Delete
End If
Next
Tim1 = Timer - Tim1
MsgBox ("Anzahl der Zeilen nach der Bearbeitung: " & ActiveSheet.UsedRange.Rows.Count & vbNewLine & "Dafür wurde: " & Tim1 & " gebraucht")
End Sub
Function EmptyRow(ByVal Row As Long, ByVal startc As Integer, ByVal EndC As Integer) As Boolean
EmptyRow = True
Dim temp As String
For i = startc To EndC
temp = Cells(Row, i).Value
temp = Trim(temp)
If temp <> "" Then
EmptyRow = False
Exit Function
End If
Next
End Function
Function ThatSpecialLine(ByVal val As String, ByVal Row As Long, ByVal startc As Integer, ByVal EndC As Integer) As Boolean
ThatSpecialLine = False
If EmptyRow(Row, startc, EndC) Then
If Cells(Row, EndC + 1).Value = val Then
ThatSpecialLine = True
End If
End If
End Function
Sub OptimizeVBA(isOn As Boolean)
Application.Calculation = IIf(isOn, xlCalculationManual, xlCalculationAutomatic)
Application.EnableEvents = Not (isOn)
Application.ScreenUpdating = Not (isOn)
ActiveSheet.DisplayPageBreaks = Not (isOn)
End Sub
This code needs about 14 seconds to execute for just 100 lines.
I am wondering why the performance is so bad.
I have no experience in making an application performance optimizing so please be kind if my question is very stupid :).
Questions
- Would it be better / faster to export this .xlsx file to .txt file and process with a programm i write in Visual studio with vb.net or C#? this would be my next idea.
- How to improve my vba code?
Would it be better / faster to export this .xlsx file to .txt file and process with a programm i write in Visual studio with vb.net or C#? this would be my next idea.
Thanks in advance
2
Answers
The Solution was to make a quick program in VB.Net with just reading the needed Lines. Also i made some improvements to the code.
The following code needs just 1 sec to read the File in the List(of string()) and to write it back again to .csv
I don't think i will use vba again for big data. Feel free to change my mind.
There are 2 things in your code that makes the execution slow.
The first thing has something to do with Excel vs VBA. Every time your VBA code needs something from Excel, it has to call an internal interface and that is rather slow. You can’t measure this when you have a sheet with a few rows/columns, but in a sheet with 400k rows and (at least) 13 columns of data, you have 5 millions cells, and your code reads most of them 2 times. This can be speed up by reading large chunks of data into an array. This is only one read and for that 5 million cells it’s a matter of maybe a second.
The second thing is pure Excel: Deleting a row of data from a worksheet is painfully slow, even if you switch off recalculation and screen update. That means, you should decrease the number of deletes by "collecting" rows to be deleted into a Range variable and then delete them all at once. However, the number of rows collected shouldn’t bee too high. I experimented a little bit and 1000 seemed to be reasonable.
I adapted your helper routine so that they work on the array of data which is passed as argument:
That code took more or less 1s for 1000 rows that where to be deleted – my example sheet had approx 30% of such rows. That would lead to a runtime in the range of few minutes.
But there is a much faster attempt, assuming that you are only interested in the data, not in formatting. Instead of deleting rows in the Excel sheet, copy the data you want to keep in a second array. When done, delete all data of your sheet and write the copied data back to Excel. This took maybe 2 or 3 seconds in my example sheet with > 800k rows: