在大型企业中,生成 Excel 报告已成为高效管理和分析大量数据集不可或缺的过程。这些报告对于跟踪绩效指标、财务记录和运营统计数据至关重要,可提供推动战略决策的宝贵见解。
在这样的环境中,生成这些文件的自动化工具在简化报告创建和确保准确性方面发挥着关键作用。随着我们进入 2024 年,生成 Excel 文件的能力应该是一项简单而常见的任务,对吧?
要使用您自己的数据集生成 Excel 文件,我们将使用OpenXML库。您应该做的第一件事是将此库安装到您的项目中:
dotnet add package DocumentFormat.OpenXml
安装必要的库并创建名为“Test.xlsx”的模板 Excel 文件后,我们将此代码添加到我们的应用程序中:
// this custom type is for your input data public class DataSet { public List<DataRow> Rows { get; set; } } // this row will contain number of our row and info about each cell public class DataRow { public int Index { get; set; } public Dictionary<string, string> Cells { get; set; } } private void SetValuesToExcel(string filePath, DataSet dataSet) { if (string.IsNullOrWhiteSpace(filePath)) { throw new FileNotFoundException($"File not found at this path: {filePath}"); } using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true)) { //each excel document has XML-structure, //so we need to go deeper to our sheet WorkbookPart wbPart = document.WorkbookPart; //feel free to pass sheet name as parameter. //here we'll just use the default one Sheet theSheet = wbPart.Workbook .Descendants<Sheet>() .FirstOrDefault(s => s.Name.Value.Trim() == "Sheet1"); //next element in hierarchy is worksheetpart //we need to dive deeper to SheetData object WorksheetPart wsPart = (WorksheetPart)(wbPart.GetPartById(theSheet.Id)); Worksheet worksheet = wsPart.Worksheet; SheetData sheetData = worksheet.GetFirstChild<SheetData>(); //iterating through our data foreach (var dataRow in dataSet.Rows) { //getting Row element from Excel's DOM var rowIndex = dataRow.Index; var row = sheetData .Elements<Row>() .FirstOrDefault(r => r.RowIndex == rowIndex); //if there is no row - we'll create new one if (row == null) { row = new Row { RowIndex = (uint)rowIndex }; sheetData.Append(row); } //now we need to iterate though each cell in the row foreach (var dataCell in dataRow.Cells) { var cell = row.Elements<Cell>() .FirstOrDefault(c => c.CellReference.Value == dataCell.Key); if (cell == null) { cell = new Cell { CellReference = dataCell.Key, DataType = CellValues.String }; row.AppendChild(cell); } cell.CellValue = new CellValue(dataCell.Value); } } //after all changes in Excel DOM we need to save it wbPart.Workbook.Save(); } }
这就是如何使用上面的代码:
var filePath = "Test.xlsx"; // number of rows that we want to add to our Excel file var testRowsCounter = 100; // creating some data for it var dataSet = new DataSet(); dataSet.Rows = new List<DataRow>(); string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; for (int i = 0; i < testRowsCounter; i++) { var row = new DataRow { Cells = new Dictionary<string, string>(), Index = i + 1 }; for (int j = 0; j < 10; j++) { row.Cells.Add($"{alphabet[j]}{i+1}", Guid.NewGuid().ToString()); } dataSet.Rows.Add(row); } //passing path to our file and data object SetValuesToExcel(filePath, dataSet);
指标
行数 | 处理时间 | 已获得内存 (MB) |
---|---|---|
100 | 454毫秒 | 21 MB |
10 000 | 2.92秒 | 132 MB |
100 000 | 10分47秒270毫秒 | 333 MB |
在此表中,我们尝试使用各种行数来测试我们的函数。正如预期的那样 - 增加行数将导致性能下降。为了解决这个问题,我们可以尝试另一种方法。
上面演示的方法很简单,对于小数据集来说已经足够了。但是,如表所示,处理大数据集可能会非常慢。此方法涉及 DOM 操作,而 DOM 操作本身就很慢。在这种情况下,SAX(XML 简单 API)方法就变得非常有价值。顾名思义,SAX 允许我们直接处理 Excel 文档的 XML,从而为处理大数据集提供了更高效的解决方案。
将第一个示例中的代码更改为:
using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true)) { WorkbookPart workbookPart = document.WorkbookPart; //we taking the original worksheetpart of our template WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); //adding the new one WorksheetPart replacementPart = workbookPart.AddNewPart<WorksheetPart>(); string originalSheetId = workbookPart.GetIdOfPart(worksheetPart); string replacementPartId = workbookPart.GetIdOfPart(replacementPart); //the main idea is read through XML of original sheet object OpenXmlReader openXmlReader = OpenXmlReader.Create(worksheetPart); //and write it to the new one with some injection of our custom data OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(replacementPart); while (openXmlReader.Read()) { if (openXmlReader.ElementType == typeof(SheetData)) { if (openXmlReader.IsEndElement) continue; // write sheet element openXmlWriter.WriteStartElement(new SheetData()); // write data rows foreach (var row in dataSet.Rows) { Row r = new Row { RowIndex = (uint)row.Index }; // start row openXmlWriter.WriteStartElement(r); foreach (var rowCell in row.Cells) { Cell c = new Cell { DataType = CellValues.String, CellReference = rowCell.Key, CellValue = new CellValue(rowCell.Value) }; // cell openXmlWriter.WriteElement(c); } // end row openXmlWriter.WriteEndElement(); } // end sheet openXmlWriter.WriteEndElement(); } else { //this block is for writing all not so interesting parts of XML //but they are still are necessary if (openXmlReader.ElementType == typeof(Row) && openXmlReader.ElementType == typeof(Cell) && openXmlReader.ElementType == typeof(CellValue)) { openXmlReader.ReadNextSibling(); continue; } if (openXmlReader.IsStartElement) { openXmlWriter.WriteStartElement(openXmlReader); } else if (openXmlReader.IsEndElement) { openXmlWriter.WriteEndElement(); } } } openXmlReader.Close(); openXmlWriter.Close(); //after all modifications we switch sheets inserting //the new one to the original file Sheet sheet = workbookPart.Workbook .Descendants<Sheet>() .First(c => c.Id == originalSheetId); sheet.Id.Value = replacementPartId; //deleting the original worksheet workbookPart.DeletePart(worksheetPart); }
说明:此代码逐个从源 Excel 文件中读取 XML 元素,并将其元素复制到新工作表。对数据进行一些操作后,它会删除旧工作表并保存新工作表。
指标
行数 | 处理时间 | 已获得内存 (MB) |
---|---|---|
100 | 414毫秒 | 22 MB |
10 000 | 961毫秒 | 87 MB |
100 000 | 3秒488毫秒 | 492 MB |
1 000 000 | 30秒224毫秒 | 超过 4.5 GB |
如您所见,处理大量行的速度显著提高。但是,我们现在有一个需要解决的内存问题。
有眼光的观察者可能会注意到,在 Excel 中处理 1000 万个单元格时,内存消耗会意外激增。虽然 100 万个字符串的权重相当大,但它不应该导致如此大幅的增长。经过对内存分析器的细致调查,罪魁祸首被确定在 OpenXML 库中。
具体来说,根本原因可以追溯到 .NET 包 System.IO.Packaging 中的一个缺陷,该缺陷影响了 .NET Standard 和 .NET Core 版本。有趣的是,这个问题似乎在经典 .NET 中不存在,可能是由于底层 Windows Base 代码的差异。简而言之,OpenXML 库在其中使用了 ZipArchive,每次更新文件时它都会复制 MemoryStream 中的数据。
仅当您以更新模式打开它时才会发生这种情况,但是您无法以其他方式执行此操作,因为这是.NET 本身的行为。
对于那些有兴趣深入研究这个问题的人,可以在GitHub 问题 #23750中找到更多详细信息。
随后,在仔细研究了 .NET 源代码并咨询了面临类似挑战的同行后,我设计了一个解决方案。如果我们不能在打开模式下使用 SpreadsheetDocument 对象处理我们的 Excel 文件 - 让我们在创建模式下使用我们自己的 Package 对象。它不会在后台使用有缺陷的 ZipArchive,并且会正常工作。
(警告:此代码现在仅适用于 OpenXML v.2.19.0 及更早版本)。
将我们的代码改为这样:
public class Builder { public async Task Build(string filePath, string sheetName, DataSet dataSet) { var workbookId = await FillData(filePath, sheetName, dataSet); await WriteAdditionalElements(filePath, sheetName, workbookId); } public async Task<string> FillData(string filePath, string sheetName, DataSet excelDataRows) { //opening our file in create mode await using var fileStream = File.Create(filePath); using var package = Package.Open(fileStream, FileMode.Create, FileAccess.Write); using var excel = SpreadsheetDocument.Create(package, SpreadsheetDocumentType.Workbook); //adding new workbookpart excel.AddWorkbookPart(); var worksheetPart = excel.WorkbookPart.AddNewPart<WorksheetPart>(); var workbookId = excel.WorkbookPart.GetIdOfPart(worksheetPart); //creating necessary worksheet and sheetdata OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(worksheetPart); openXmlWriter.WriteStartElement(new Worksheet()); openXmlWriter.WriteStartElement(new SheetData()); // write data rows foreach (var row in excelDataRows.Rows.OrderBy(r => r.Index)) { Row r = new Row { RowIndex = (uint)row.Index }; openXmlWriter.WriteStartElement(r); foreach (var rowCell in row.Cells) { Cell c = new Cell { DataType = CellValues.String, CellReference = rowCell.Key }; //cell openXmlWriter.WriteStartElement(c); CellValue v = new CellValue(rowCell.Value); openXmlWriter.WriteElement(v); //cell end openXmlWriter.WriteEndElement(); } // end row openXmlWriter.WriteEndElement(); } //sheetdata end openXmlWriter.WriteEndElement(); //worksheet end openXmlWriter.WriteEndElement(); openXmlWriter.Close(); return workbookId; } public async Task WriteAdditionalElements(string filePath, string sheetName, string worksheetPartId) { //here we should add our workbook to the file //without this - our document will be incomplete await using var fileStream = File.Open(filePath, FileMode.Open, FileAccess.ReadWrite, FileShare.None); using var package = Package.Open(fileStream, FileMode.Open, FileAccess.ReadWrite); using var excel = SpreadsheetDocument.Open(package); if (excel.WorkbookPart is null) throw new InvalidOperationException("Workbook part cannot be null!"); var xmlWriter = OpenXmlWriter.Create(excel.WorkbookPart); xmlWriter.WriteStartElement(new Workbook()); xmlWriter.WriteStartElement(new Sheets()); xmlWriter.WriteElement(new Sheet { Id = worksheetPartId, Name = sheetName, SheetId = 1 }); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.Close(); xmlWriter.Dispose(); } }
像这样使用它:
var builder = new Builder(); await builder.Build(filePath, "Sheet1", dataSet);
指标
行数 | 处理时间 | 已获得内存 (MB) |
---|---|---|
100 | 291毫秒 | 18 MB |
10 000 | 940毫秒 | 62 MB |
100 000 | 3秒 767毫秒 | 297 MB |
1 000 000 | 31秒354毫秒 | 2.7 GB |
现在,与最初的测量结果相比,我们的测量结果看起来令人满意。
首先,展示的代码纯粹用于演示目的。在实际应用中,应考虑其他功能,例如支持各种单元格类型或复制单元格样式。尽管上例中展示了显著的优化,但它在实际场景中的直接应用可能不可行。通常,对于处理大型 Excel 文件,基于块的方法更合适。
附言:如果您不想深入研究生成办公文档的复杂性,欢迎探索我的 NuGet包,它可以简化并无缝集成所有这些功能。
特色图片由 vecstock 在 Freepik 上提供