Cómo generar grandes conjuntos de datos en .NET para Excel con OpenXML

Importancia de los informes de Excel
Enfoque común para generar archivos de Excel
Pasar grandes conjuntos de datos en Excel
Pérdidas de memoria inesperadas: desentrañando el Enigma
Pensamientos finales

Importancia de los informes de Excel

En las grandes empresas, la generación de informes de Excel se ha convertido en un proceso indispensable para gestionar y analizar conjuntos de datos extensos de manera eficiente. Estos informes son cruciales para realizar un seguimiento de las métricas de desempeño, los registros financieros y las estadísticas operativas, y ofrecen información valiosa que impulsa la toma de decisiones estratégicas.

En dichos entornos, las herramientas de automatización que generan estos archivos desempeñan un papel fundamental a la hora de agilizar la creación de informes y garantizar la precisión. A medida que avanzamos hacia 2024, la capacidad de generar archivos de Excel debería ser una tarea fácil y común, ¿verdad?

Enfoque común para generar archivos de Excel

Para generar un archivo Excel con su propio conjunto de datos, usaremos la biblioteca OpenXML . Lo primero que debes hacer es instalar esta biblioteca en tu proyecto:

 dotnet add package DocumentFormat.OpenXml

Después de instalar la biblioteca necesaria y crear nuestro archivo de plantilla de Excel llamado "Test.xlsx", agregamos este código a nuestra aplicación:

 // this custom type is for your input data public class DataSet { public List<DataRow> Rows { get; set; } } // this row will contain number of our row and info about each cell public class DataRow { public int Index { get; set; } public Dictionary<string, string> Cells { get; set; } } private void SetValuesToExcel(string filePath, DataSet dataSet) { if (string.IsNullOrWhiteSpace(filePath)) { throw new FileNotFoundException($"File not found at this path: {filePath}"); } using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true)) { //each excel document has XML-structure, //so we need to go deeper to our sheet WorkbookPart wbPart = document.WorkbookPart; //feel free to pass sheet name as parameter. //here we'll just use the default one Sheet theSheet = wbPart.Workbook .Descendants<Sheet>() .FirstOrDefault(s => s.Name.Value.Trim() == "Sheet1"); //next element in hierarchy is worksheetpart //we need to dive deeper to SheetData object WorksheetPart wsPart = (WorksheetPart)(wbPart.GetPartById(theSheet.Id)); Worksheet worksheet = wsPart.Worksheet; SheetData sheetData = worksheet.GetFirstChild<SheetData>(); //iterating through our data foreach (var dataRow in dataSet.Rows) { //getting Row element from Excel's DOM var rowIndex = dataRow.Index; var row = sheetData .Elements<Row>() .FirstOrDefault(r => r.RowIndex == rowIndex); //if there is no row - we'll create new one if (row == null) { row = new Row { RowIndex = (uint)rowIndex }; sheetData.Append(row); } //now we need to iterate though each cell in the row foreach (var dataCell in dataRow.Cells) { var cell = row.Elements<Cell>() .FirstOrDefault(c => c.CellReference.Value == dataCell.Key); if (cell == null) { cell = new Cell { CellReference = dataCell.Key, DataType = CellValues.String }; row.AppendChild(cell); } cell.CellValue = new CellValue(dataCell.Value); } } //after all changes in Excel DOM we need to save it wbPart.Workbook.Save(); } }

Y así es como se usa el código anterior:

 var filePath = "Test.xlsx"; // number of rows that we want to add to our Excel file var testRowsCounter = 100; // creating some data for it var dataSet = new DataSet(); dataSet.Rows = new List<DataRow>(); string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; for (int i = 0; i < testRowsCounter; i++) { var row = new DataRow { Cells = new Dictionary<string, string>(), Index = i + 1 }; for (int j = 0; j < 10; j++) { row.Cells.Add($"{alphabet[j]}{i+1}", Guid.NewGuid().ToString()); } dataSet.Rows.Add(row); } //passing path to our file and data object SetValuesToExcel(filePath, dataSet);

Métrica

recuento de filas	tiempo para procesar	Memoria ganada (MB)
100	454ms	21 MB
10 000	2,92s	132 MB
100 000	10min 47s 270ms	333 MB

En esta tabla, intentamos probar nuestra función con varios números de filas. Como era de esperar, un aumento en el número de filas conducirá a una disminución del rendimiento. Para solucionarlo, podemos probar otro enfoque.

Pasar grandes conjuntos de datos en Excel

El enfoque demostrado anteriormente es sencillo y suficiente para conjuntos de datos pequeños. Sin embargo, como se ilustra en la tabla, el procesamiento de grandes conjuntos de datos puede resultar considerablemente lento. Este método implica manipulaciones DOM, que son inherentemente lentas. En tales casos, el enfoque SAX (API simple para XML) resulta invaluable. Como sugiere el nombre, SAX nos permite trabajar directamente con el XML del documento Excel, brindando una solución más eficiente para manejar grandes conjuntos de datos.

Cambiando el código del primer ejemplo a este:

 using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true)) { WorkbookPart workbookPart = document.WorkbookPart; //we taking the original worksheetpart of our template WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); //adding the new one WorksheetPart replacementPart = workbookPart.AddNewPart<WorksheetPart>(); string originalSheetId = workbookPart.GetIdOfPart(worksheetPart); string replacementPartId = workbookPart.GetIdOfPart(replacementPart); //the main idea is read through XML of original sheet object OpenXmlReader openXmlReader = OpenXmlReader.Create(worksheetPart); //and write it to the new one with some injection of our custom data OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(replacementPart); while (openXmlReader.Read()) { if (openXmlReader.ElementType == typeof(SheetData)) { if (openXmlReader.IsEndElement) continue; // write sheet element openXmlWriter.WriteStartElement(new SheetData()); // write data rows foreach (var row in dataSet.Rows) { Row r = new Row { RowIndex = (uint)row.Index }; // start row openXmlWriter.WriteStartElement(r); foreach (var rowCell in row.Cells) { Cell c = new Cell { DataType = CellValues.String, CellReference = rowCell.Key, CellValue = new CellValue(rowCell.Value) }; // cell openXmlWriter.WriteElement(c); } // end row openXmlWriter.WriteEndElement(); } // end sheet openXmlWriter.WriteEndElement(); } else { //this block is for writing all not so interesting parts of XML //but they are still are necessary if (openXmlReader.ElementType == typeof(Row) && openXmlReader.ElementType == typeof(Cell) && openXmlReader.ElementType == typeof(CellValue)) { openXmlReader.ReadNextSibling(); continue; } if (openXmlReader.IsStartElement) { openXmlWriter.WriteStartElement(openXmlReader); } else if (openXmlReader.IsEndElement) { openXmlWriter.WriteEndElement(); } } } openXmlReader.Close(); openXmlWriter.Close(); //after all modifications we switch sheets inserting //the new one to the original file Sheet sheet = workbookPart.Workbook .Descendants<Sheet>() .First(c => c.Id == originalSheetId); sheet.Id.Value = replacementPartId; //deleting the original worksheet workbookPart.DeletePart(worksheetPart); }

Explicación : este código lee elementos XML de un archivo Excel de origen uno por uno y copia sus elementos en una nueva hoja. Después de alguna manipulación de los datos, elimina la hoja anterior y guarda la nueva.

Métrica

recuento de filas	tiempo para procesar	Memoria ganada (MB)
100	414ms	22 MB
10 000	961ms	87 MB
100 000	3s 488ms	492 MB
1 000 000	30s 224ms	más de 4,5 GB

Como puede ver, la velocidad de procesamiento de una gran cantidad de filas ha aumentado significativamente. Sin embargo, ahora tenemos un problema de memoria que debemos solucionar.

Fugas de memoria inesperadas: desentrañando el enigma

Un observador perspicaz podría haber notado un aumento inesperado en el consumo de memoria al procesar 10 millones de celdas en Excel. Aunque el peso de 1 millón de cuerdas es considerable, no debería representar un aumento tan sustancial. Después de una investigación meticulosa con perfiladores de memoria, se identificó al culpable dentro de la biblioteca OpenXML.

Específicamente, la causa principal se remonta a una falla en el paquete .NET System.IO.Packaging, que afecta tanto a las versiones .NET Standard como a .NET Core. Curiosamente, este problema parece ausente en .NET clásico, probablemente debido a diferencias en el código base de Windows subyacente. En breve, la biblioteca OpenXML utiliza ZipArchive, que copia datos en MemoryStream cada vez que actualiza el archivo.

Ocurre sólo si lo abres en modo de actualización, pero no puedes hacerlo de otra manera porque es el comportamiento del propio .NET.

Para aquellos interesados en profundizar en este problema, pueden encontrar más detalles en GitHub Issue #23750 .

Posteriormente, después de estudiar detenidamente el código fuente de .NET y consultar a compañeros que enfrentaban desafíos similares, ideé una solución alternativa. Si no podemos usar el objeto SpreadsheetDocument para trabajar con nuestro archivo Excel en modo Abierto, usémoslo en modo Crear con nuestro propio objeto Paquete. No utilizará ZipArchive con errores debajo del capó y funcionará como debería.

(Advertencia: este código ahora funciona solo con OpenXML v.2.19.0 y versiones anteriores).

Cambie nuestro código a este:

 public class Builder { public async Task Build(string filePath, string sheetName, DataSet dataSet) { var workbookId = await FillData(filePath, sheetName, dataSet); await WriteAdditionalElements(filePath, sheetName, workbookId); } public async Task<string> FillData(string filePath, string sheetName, DataSet excelDataRows) { //opening our file in create mode await using var fileStream = File.Create(filePath); using var package = Package.Open(fileStream, FileMode.Create, FileAccess.Write); using var excel = SpreadsheetDocument.Create(package, SpreadsheetDocumentType.Workbook); //adding new workbookpart excel.AddWorkbookPart(); var worksheetPart = excel.WorkbookPart.AddNewPart<WorksheetPart>(); var workbookId = excel.WorkbookPart.GetIdOfPart(worksheetPart); //creating necessary worksheet and sheetdata OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(worksheetPart); openXmlWriter.WriteStartElement(new Worksheet()); openXmlWriter.WriteStartElement(new SheetData()); // write data rows foreach (var row in excelDataRows.Rows.OrderBy(r => r.Index)) { Row r = new Row { RowIndex = (uint)row.Index }; openXmlWriter.WriteStartElement(r); foreach (var rowCell in row.Cells) { Cell c = new Cell { DataType = CellValues.String, CellReference = rowCell.Key }; //cell openXmlWriter.WriteStartElement(c); CellValue v = new CellValue(rowCell.Value); openXmlWriter.WriteElement(v); //cell end openXmlWriter.WriteEndElement(); } // end row openXmlWriter.WriteEndElement(); } //sheetdata end openXmlWriter.WriteEndElement(); //worksheet end openXmlWriter.WriteEndElement(); openXmlWriter.Close(); return workbookId; } public async Task WriteAdditionalElements(string filePath, string sheetName, string worksheetPartId) { //here we should add our workbook to the file //without this - our document will be incomplete await using var fileStream = File.Open(filePath, FileMode.Open, FileAccess.ReadWrite, FileShare.None); using var package = Package.Open(fileStream, FileMode.Open, FileAccess.ReadWrite); using var excel = SpreadsheetDocument.Open(package); if (excel.WorkbookPart is null) throw new InvalidOperationException("Workbook part cannot be null!"); var xmlWriter = OpenXmlWriter.Create(excel.WorkbookPart); xmlWriter.WriteStartElement(new Workbook()); xmlWriter.WriteStartElement(new Sheets()); xmlWriter.WriteElement(new Sheet { Id = worksheetPartId, Name = sheetName, SheetId = 1 }); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.Close(); xmlWriter.Dispose(); } }

Y úsalo así:

 var builder = new Builder(); await builder.Build(filePath, "Sheet1", dataSet);

Métrica

recuento de filas	tiempo para procesar	Memoria ganada (MB)
100	291ms	18 MB
10 000	940ms	62 MB
100 000	3s 767ms	297 MB
1 000 000	31s 354ms	2,7GB

Ahora, nuestras mediciones parecen satisfactorias en comparación con las iniciales.

Pensamientos finales

Inicialmente, el código mostrado tiene fines puramente demostrativos. En aplicaciones prácticas, se deben considerar características adicionales como la compatibilidad con varios tipos de células o la replicación de estilos de células. A pesar de las importantes optimizaciones demostradas en el ejemplo anterior, su aplicación directa en escenarios del mundo real puede no ser factible. Normalmente, para manejar archivos grandes de Excel, es más adecuado un enfoque basado en fragmentos.

PD: Si prefiere evitar profundizar en las complejidades de generar documentos de Office, puede explorar mi paquete NuGet, que simplifica e integra todas estas funcionalidades a la perfección.

Imagen destacada de vecstock en Freepik

Cómo generar grandes conjuntos de datos en .NET para Excel con OpenXML

Demasiado Largo; Para Leer

Importancia de los informes de Excel

Enfoque común para generar archivos de Excel

Pasar grandes conjuntos de datos en Excel

Fugas de memoria inesperadas: desentrañando el enigma

Pensamientos finales

About Author

ETIQUETAS

ESTE ARTÍCULO FUE PRESENTADO EN...

Categories

Trending Topics

Cómo generar grandes conjuntos de datos en .NET para Excel con OpenXML

Demasiado Largo; Para Leer

Importancia de los informes de Excel

Enfoque común para generar archivos de Excel

Pasar grandes conjuntos de datos en Excel

Fugas de memoria inesperadas: desentrañando el enigma

Pensamientos finales

About Author

ETIQUETAS

ESTE ARTÍCULO FUE PRESENTADO EN...

HISTORIAS RELACIONADAS

Categories

Trending Topics