skip to Main Content

I have a directory containing tens of thousands of csv files (that used to be HTML files but I wrote a different vba script that changed the extension to csv), that I want placed into one Main xls file (called Main Import File VBA). Given the huge number of csv files that I need to loop over, I want to do it as efficiently as possible (lowest amount of time necessary. If I can get my code a few percentages faster, that would already make a huge difference.

The csv files are all similar. Basically HTML files that now have a csv extension. All the data is in one column (although length of rows can differ). So I need to perform a similar task in that I only need to copy column A from every csv file.

I have written a code that opens each csv file and copies the content of column A and then (transpose) pastes it into the main import file (myMasterFile). It then closes the csv file and loops to the next file.

Sub PutInMasterFile()
    
    Dim wb          As Workbook
    Dim myMasterFile As String
    Dim masterWB    As Workbook
    Dim rowNum      As Integer
    Dim copyRange   As Range
    Dim pasteRange  As Range
    Dim myPath      As String
    Dim myFile      As String
    Dim FirstAddress As String
    Dim x           As Variant
    Dim C           As Variant
    
    Application.ScreenUpdating = FALSE
    Application.EnableEvents = FALSE
    Application.Calculation = xlCalculationManual
    Application.DisplayAlerts = FALSE
    
    x = 2
    myMasterFile = "C:S_AVBA TestMain Import File VBA.xlsm"
    'Workbooks("Main Import File VBA").Activate
    
    Set pasteRange = ActiveWorkbook.Sheets(1).Range("A" & x)
    
    myPath = "C:S_AVBA TestOutput"
    myFile = Dir(myPath & "*.csv")
    
    Start = Timer
    Do While myFile <> vbNullString
        
        'csv file opened here
        Workbooks.Open Filename:=myPath & myFile
        
        'selection made for copying the data
        With Workbooks(myFile).Sheets(1)
            
            Range("A1").Select
            Range(Selection, Selection.End(xlDown)).Select
            Selection.Copy
            pasteRange.PasteSpecial Paste:=xlPasteAll, Operation:=xlNone, SkipBlanks:= _
                                    False, Transpose:=True
            
            'Close the csv file
            Workbooks(myFile).Close
            myFile = Dir()
            
            x = x + 1
            Set pasteRange = ActiveWorkbook.Sheets(1).Range("A" & x)
            
        End With
    Loop
    
    Finish = Timer
    TotalTime = Finish - Start
    
    MsgBox "Total time is " & TotalTime & " seconds!"
    
End Sub

How can I optimize this script? I think using Select and Copy (for the csv files) is inefficient, but I’m not sure I can use the optimized manner since I want to transpose paste it into myMasterFile.

Also I wonder whether it’s really necessary to open (and close) each csv file. Is there a way to avoid this and get the content using a better manner?

EDIT:
Response to the remarks/questions from Taller and Tim Williams. I have tested the code for 25 csv files. It takes about 9.8 seconds to finish. When I leave out the copy/transpose paste part it takes about 6.5 seconds. So I guess considerably faster. I should make a change there right? (avoiding Select and Selection Copy).

As for how the csv files looks: hope this helps (it’s about 1600 rows long (but it can differ per file).
I have no experience with Powerquery/Parsing HTML, or the suggestion that you posted @Tim Williams (array flipping)
enter image description here

2

Answers


  1. I’d be interested if VBA is faster then just plain powerquery

    let Source = Folder.Files("C:temp12"),
    #"Filtered Rows" = Table.SelectRows(Source, each [Extension] = ".csv"),
    #"Merged Queries" = Table.AddColumn(#"Filtered Rows", "data", each Table.Transpose(Table.SelectColumns(Csv.Document([Content]),"Column1"))),
    #"Removed Other Columns" = Table.SelectColumns(#"Merged Queries",{"Name","data"}),
    ColumnsToExpand = List.Distinct(List.Combine(List.Transform(Table.Column(#"Removed Other Columns", "data"), each if _ is table then Table.ColumnNames(_) else {}))),
    #"Expanded" = Table.ExpandTableColumn(#"Removed Other Columns", "data",ColumnsToExpand ,ColumnsToExpand )
    in #"Expanded"
    
    Login or Signup to reply.
  2. Parsing the HTML is faster than opening the files in Excel.

    Approx 12sec for 100 files.

    Sub TestExtract()
        Const DATA_DIR As String = "C:TempHTML data"
        Dim doc As Object, tables As Object, table As Object, rows As Object
        Dim r As Long, wb As Workbook, i As Long, t, arr, f
        
        Set doc = CreateObject("htmlfile")
        
        Application.ScreenUpdating = False
        t = Timer
        f = Dir(DATA_DIR & "data*.html")
        Do While Len(f) > 0
            i = i + 1
            'Debug.Print f
            doc.Body.innerhtml = GetContent(DATA_DIR & f)
            Set tables = doc.Body.getElementsByTagName("table")
            If tables.Length > 0 Then
                Set rows = tables(0).getElementsByTagName("tr") 'get rows from first table
                ReDim arr(rows.Length - 1) 'size array for output
                For r = 1 To rows.Length   'get value from first cell in row
                    arr(r - 1) = rows(r - 1).getElementsByTagName("td")(0).innerText
                Next r
            End If
            'put the data in a row on the worksheet
            Sheet2.Cells(i, 1).Resize(1, UBound(arr) + 1).Value = arr
            f = Dir()
        Loop
        
        Debug.Print "Done", Timer - t
    End Sub
    
    Function GetContent(f As String) As String
        GetContent = CreateObject("scripting.filesystemobject"). _
                      OpenTextFile(f, 1).ReadAll()
    End Function
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search