3,850 reads

How to Generate Large Datasets in .NET for Excel With OpenXML

by Artem RudiakovJune 21st, 2024

Too Long; Didn't Read

Generating Excel reports is essential for managing extensive datasets in large enterprises, aiding in strategic decision-making. The common approach using OpenXML is straightforward for small datasets but slows significantly with larger ones. Transitioning to the SAX method improves processing speed but can lead to memory issues. The unexpected memory leaks stem from a flaw in the .NET System.IO.Packaging. A workaround using a custom Package object mitigates this issue, optimizing performance. For practical use, consider chunk-based processing or using a dedicated NuGet package for generating office documents efficiently.

featured image - How to Generate Large Datasets in .NET for Excel With OpenXML

Importance of Excel reporting
Common approach to generating Excel files
Passing large datasets in Excel
Unexpected memory leaks: unraveling the Enigma
Final thoughts

Importance of Excel Reporting

In large enterprise companies, generating Excel reports has become an indispensable process for managing and analyzing extensive datasets efficiently. These reports are crucial for tracking performance metrics, financial records, and operational statistics, offering valuable insights that drive strategic decision-making.

In such environments, automation tools that generate these files play a pivotal role in streamlining report creation and ensuring accuracy. As we advance into 2024, the ability to generate Excel files should be an easy and common task, right?

Common Approach to Generating Excel Files

To generate an Excel file with your own dataset, we will use the OpenXML library. The first thing you should do is install this library into your project:

dotnet add package DocumentFormat.OpenXml

After installing the necessary library and creating our template Excel file named “Test.xlsx,” we added this code to our application:

// this custom type is for your input data
public class DataSet
{
    public List<DataRow> Rows { get; set; }
}
// this row will contain number of our row and info about each cell
public class DataRow
{
    public int Index { get; set; }

    public Dictionary<string, string> Cells { get; set; }
}

private void SetValuesToExcel(string filePath, DataSet dataSet)
{
    if (string.IsNullOrWhiteSpace(filePath))
    {
        throw new FileNotFoundException($"File not found at this path: {filePath}");
    }

    using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true))
    {
        //each excel document has XML-structure, 
        //so we need to go deeper to our sheet
        WorkbookPart wbPart = document.WorkbookPart;
        //feel free to pass sheet name as parameter. 
        //here we'll just use the default one
        Sheet theSheet = wbPart.Workbook
                            .Descendants<Sheet>()
                            .FirstOrDefault(s => s.Name.Value.Trim() == "Sheet1");
        //next element in hierarchy is worksheetpart
        //we need to dive deeper to SheetData object                    
        WorksheetPart wsPart = (WorksheetPart)(wbPart.GetPartById(theSheet.Id));
        Worksheet worksheet = wsPart.Worksheet;
        SheetData sheetData = worksheet.GetFirstChild<SheetData>();
        
        //iterating through our data
        foreach (var dataRow in dataSet.Rows)
        {
            //getting Row element from Excel's DOM
            var rowIndex = dataRow.Index;
            var row = sheetData
                        .Elements<Row>()
                        .FirstOrDefault(r => r.RowIndex == rowIndex);
            //if there is no row - we'll create new one
            if (row == null)
            {
                row = new Row { RowIndex = (uint)rowIndex };
                sheetData.Append(row);
            }
            
            //now we need to iterate though each cell in the row
            foreach (var dataCell in dataRow.Cells)
            {
                var cell = row.Elements<Cell>()
                .FirstOrDefault(c => c.CellReference.Value == dataCell.Key);
        
                if (cell == null)
                {
                    cell = new Cell 
                    { 
                      CellReference = dataCell.Key, 
                      DataType = CellValues.String 
                    };
                    row.AppendChild(cell);
                }
        
                cell.CellValue = new CellValue(dataCell.Value);
            }
        }
        //after all changes in Excel DOM we need to save it
        wbPart.Workbook.Save();
    }
}

And that is how to use the code above:

var filePath = "Test.xlsx";
// number of rows that we want to add to our Excel file
var testRowsCounter = 100;
// creating some data for it
var dataSet = new DataSet();
dataSet.Rows = new List<DataRow>();
string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
for (int i = 0; i < testRowsCounter; i++)
{
    var row = new DataRow 
    { 
      Cells = new Dictionary<string, string>(), Index = i + 1 
    };
    for (int j = 0; j < 10; j++)
    {
        row.Cells.Add($"{alphabet[j]}{i+1}", Guid.NewGuid().ToString());
    }
    dataSet.Rows.Add(row);
}
//passing path to our file and data object
SetValuesToExcel(filePath, dataSet);

Metrics

Count of rows	Time to process	Memory gained (MB)
100	454ms	21 Mb
10 000	2.92s	132 Mb
100 000	10min 47s 270ms	333 Mb

In this table, we tried to test our function with various numbers of rows. As expected - increasing number of rows will lead to decreasing of performance. To fix that, we can try another approach.

Passing Large Datasets in Excel

The approach demonstrated above is straightforward and sufficient for small datasets. However, as illustrated in the table, processing large datasets can be significantly slow. This method involves DOM manipulations, which are inherently slow. In such cases, the SAX (Simple API for XML) approach becomes invaluable. As the name suggests, SAX allows us to work directly with the XML of the Excel document, providing a more efficient solution for handling large datasets.

Changing code from the first example to this:

using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true))
{
    WorkbookPart workbookPart = document.WorkbookPart;
    //we taking the original worksheetpart of our template
    WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
    //adding the new one
    WorksheetPart replacementPart = workbookPart.AddNewPart<WorksheetPart>();

    string originalSheetId = workbookPart.GetIdOfPart(worksheetPart);
    string replacementPartId = workbookPart.GetIdOfPart(replacementPart);
    
    //the main idea is read through XML of original sheet object
    OpenXmlReader openXmlReader = OpenXmlReader.Create(worksheetPart);
    //and write it to the new one with some injection of our custom data
    OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(replacementPart);

    while (openXmlReader.Read())
    {
        if (openXmlReader.ElementType == typeof(SheetData))
        {
            if (openXmlReader.IsEndElement)
                continue;

            // write sheet element
            openXmlWriter.WriteStartElement(new SheetData());

            // write data rows
            foreach (var row in dataSet.Rows)
            {
                Row r = new Row
                {
                    RowIndex = (uint)row.Index
                };

                // start row
                openXmlWriter.WriteStartElement(r);

                foreach (var rowCell in row.Cells)
                {
                    Cell c = new Cell
                    {
                        DataType = CellValues.String,
                        CellReference = rowCell.Key,
                        CellValue = new CellValue(rowCell.Value)
                    };

                    // cell
                    openXmlWriter.WriteElement(c);
                }

                // end row
                openXmlWriter.WriteEndElement();
            }

            // end sheet
            openXmlWriter.WriteEndElement();
        }
        else
        {
            //this block is for writing all not so interesting parts of XML
            //but they are still are necessary
            if (openXmlReader.ElementType == typeof(Row)
                && openXmlReader.ElementType == typeof(Cell)
                && openXmlReader.ElementType == typeof(CellValue))
            {
                openXmlReader.ReadNextSibling();
                continue;
            }

            if (openXmlReader.IsStartElement)
            {
                openXmlWriter.WriteStartElement(openXmlReader);
            }
            else if (openXmlReader.IsEndElement)
            {
                openXmlWriter.WriteEndElement();
            }
        }
    }

    openXmlReader.Close();
    openXmlWriter.Close();
    //after all modifications we switch sheets inserting 
    //the new one to the original file
    Sheet sheet = workbookPart.Workbook
        .Descendants<Sheet>()
        .First(c => c.Id == originalSheetId);

    sheet.Id.Value = replacementPartId;
    
    //deleting the original worksheet
    workbookPart.DeletePart(worksheetPart);
}

Explanation: This code reads XML elements from a source Excel file one by one and copies its elements to a new sheet. After some manipulation of the data, it deletes the old sheet and saves the new one.

Metrics

Count of rows	Time to process	Memory gained (MB)
100	414ms	22 Mb
10 000	961ms	87 Mb
100 000	3s 488ms	492 Mb
1 000 000	30s 224ms	over 4.5 GB

As you can see, the speed of processing a large number of rows has significantly increased. However, we now have a memory issue that we need to address.

Unexpected Memory Leaks: Unraveling the Enigma

A discerning observer might have noticed an unexpected surge in memory consumption while processing 10 million cells in Excel. Although the weight of 1 million strings is considerable, it shouldn't account for such a substantial increase. After meticulous investigation with memory profilers, the culprit was identified within the OpenXML library.

Specifically, the root cause can be traced to a flaw in the .NET package System.IO.Packaging, affecting both .NET Standard and .NET Core versions. Interestingly, this issue seems absent in classic .NET, likely due to differences in the underlying Windows Base code. Shortly, the OpenXML library uses ZipArchive in it, which copies data in MemoryStream each time when you update the file.

It happens only if you open it in update mode, but you can’t do it in another way because it’s the behavior of .NET itself.

For those interested in delving deeper into this issue, further details can be found at GitHub Issue #23750.

Subsequently, after poring over the .NET source code and consulting peers facing similar challenges, I devised a workaround solution. If we can’t use the SpreadsheetDocument object to work with our Excel file in Open mode - let’s use it in Create mode with our own Package object. It will not use buggy ZipArchive under the hood and will work as it should.

(Warning: this code works now only with OpenXML v.2.19.0 and earlier).

Change our code to this:

public class Builder
{
    public async Task Build(string filePath, string sheetName, DataSet dataSet)
    {
        var workbookId = await FillData(filePath, sheetName, dataSet);
        await WriteAdditionalElements(filePath, sheetName, workbookId);
    }


    public async Task<string> FillData(string filePath, 
                                       string sheetName, DataSet excelDataRows)
    {
        //opening our file in create mode
        await using var fileStream = File.Create(filePath);
        using var package = Package.Open(fileStream, FileMode.Create, FileAccess.Write);
        using var excel = SpreadsheetDocument.Create(package, SpreadsheetDocumentType.Workbook);
        
        //adding new workbookpart
        excel.AddWorkbookPart();
        var worksheetPart = excel.WorkbookPart.AddNewPart<WorksheetPart>();
        var workbookId = excel.WorkbookPart.GetIdOfPart(worksheetPart);
        
        //creating necessary worksheet and sheetdata
        OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(worksheetPart);
        openXmlWriter.WriteStartElement(new Worksheet());
        openXmlWriter.WriteStartElement(new SheetData());

        // write data rows
        foreach (var row in excelDataRows.Rows.OrderBy(r => r.Index))
        {
            Row r = new Row
            {
                RowIndex = (uint)row.Index
            };

            openXmlWriter.WriteStartElement(r);

            foreach (var rowCell in row.Cells)
            {
                Cell c = new Cell
                {
                    DataType = CellValues.String,
                    CellReference = rowCell.Key
                };
                //cell
                openXmlWriter.WriteStartElement(c);

                CellValue v = new CellValue(rowCell.Value);
                openXmlWriter.WriteElement(v);
                
                //cell end
                openXmlWriter.WriteEndElement();
            }

            // end row
            openXmlWriter.WriteEndElement();
        }
        //sheetdata end
        openXmlWriter.WriteEndElement();
        //worksheet end
        openXmlWriter.WriteEndElement();

        openXmlWriter.Close();

        return workbookId;
    }

    public async Task WriteAdditionalElements(string filePath, string sheetName, string worksheetPartId)
    {
        //here we should add our workbook to the file
        //without this - our document will be incomplete
        await using var fileStream = File.Open(filePath, FileMode.Open, FileAccess.ReadWrite, FileShare.None);
        using var package = Package.Open(fileStream, FileMode.Open, FileAccess.ReadWrite);
        using var excel = SpreadsheetDocument.Open(package);

        if (excel.WorkbookPart is null)
            throw new InvalidOperationException("Workbook part cannot be null!");

        var xmlWriter = OpenXmlWriter.Create(excel.WorkbookPart);
        xmlWriter.WriteStartElement(new Workbook());
        xmlWriter.WriteStartElement(new Sheets());

        xmlWriter.WriteElement(new Sheet { Id = worksheetPartId, Name = sheetName, SheetId = 1 });
        xmlWriter.WriteEndElement();
        xmlWriter.WriteEndElement();

        xmlWriter.Close();
        xmlWriter.Dispose();
    }
}

And use it like this:

var builder = new Builder();
await builder.Build(filePath, "Sheet1", dataSet);

Metrics

Count of rows	Time to process	Memory gained (MB)
100	291ms	18 Mb
10 000	940ms	62 Mb
100 000	3s 767ms	297 Mb
1 000 000	31s 354ms	2.7 GB

Now, our measurements look satisfactory compared to the initial ones.

Final Thoughts

Initially, the showcased code serves purely demonstrative purposes. In practical applications, additional features such as support for various cell types or the replication of cell styles should be considered. Despite the significant optimizations demonstrated in the previous example, its direct application in real-world scenarios may not be feasible. Typically, for handling large Excel files, a chunk-based approach is more suitable.

P.S.: If you prefer to avoid delving into the intricacies of generating office documents, you're welcome to explore my NuGet package, which simplifies and integrates all these functionalities seamlessly.

Feature Image by vecstock on Freepik