This story draft by @escholar has not been reviewed by an editor, YET.

Limitations and References

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Authors:

(1) Ahatsham Hayat, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]);

(2) Mohammad Rashedul Hasan, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]).

Table of Links

Abstract and 1 Introduction

2 Method

2.1 Problem Formulation and 2.2 Missingness Patterns

2.3 Generating Missing Values

2.4 Description of CLAIM

3 Experiments

3.1 Results

4 Related Work

5 Conclusion and Future Directions

6 Limitations and References

6 Limitations

Despite the notable advancements presented by CLAIM in addressing missing data within tabular datasets, this work has several limitations. First, the efficacy of CLAIM is inherently dependent on the quality and breadth of the training data used to develop the underlying LLMs. In scenarios where the LLMs have not been exposed to data similar to the specific context or domain of the missing information, their ability to generate accurate and relevant imputations may be compromised. Additionally, the approach assumes that the descriptive context provided for missing values is sufficiently informative for the LLM to understand and act upon, which may not always be the case. Furthermore, the computational requirements for processing large datasets with CLAIM, given the need for interaction with sophisticated LLMs, could pose scalability challenges. Lastly, while CLAIM shows promise in handling various missingness mechanisms, its performance in highly specialized or niche domains, where expert knowledge significantly influences data interpretation, has yet to be fully explored.

References

  1. Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling, Synthesis Lectures on Data Management, vol. 10. Morgan & Claypool (2018). https://doi.org/10.2200/s00878ed1v01y201810dtm052


  2. Achiam, J., Andrychowicz, M., Beattie, A., Clark, J., Drozdov, N., Ecoffet, A., Edwards, D., Giddings, J., Goldberg, I., Gomez, M., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)


  3. Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation method. In: Frontiers in Artificial Intelligence and Applications. vol. 87, pp. 251–260. HIS (2002)


  4. Bhatia, K., Narayan, A., De Sa, C., Ré, C.: TART: A plug-and-play Transformer module for task-agnostic reasoning (Jun 2023). https://doi.org/10.48550/arXiv.2306.07536, http://arxiv.org/abs/2306.07536, arXiv:2306.07536 [cs]


  5. Biessmann, F., Salinas, D., Schelter, S., Schmidt, P., Lange, D.: "deep" learning for missing value imputation in tables with non-numerical data. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. p. 2017–2025. CIKM ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3269206.3272005, https://doi.org/10.1145/3269206.3272005


  6. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners. In: Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)


  7. Buuren, S.v., Groothuis-Oudshoorn, K.: Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software 45, 1–67 (2011)


  8. Camino, R.D., Hammerschmidt, C.A., State, R.: Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666 pp. 1–8 (2019)


  9. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A.M., Pillai, T.S., Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., Fiedel, N.: PaLM: Scaling Language Modeling with Pathways (Oct 2022), http://arxiv.org/abs/2204.02311, arXiv:2204.02311 [cs]


  10. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22 (1977)


  11. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetuning of quantized llms (2023)


  12. Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml


  13. Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., Tabona, O.: A survey on missing data in machine learning. J Big Data 8(1), 140 (2021). https://doi.org/10.1186/s40537-021-00516-9, epub 2021 Oct 27. PMID: 34722113; PMCID: PMC8549433


  14. García-Laencina, P.J., Sancho-Gómez, J., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282(2010). https://doi.org/10.1007/S00521-009-0295-6, https://doi.org/10.1007/s00521-009-0295-6


  15. Gimpy, M.: Missing value imputation in multi attribute data set. Int. J. Comput. Sci. Inf. Technol. 5(4), 1–7 (2014)


  16. Gondara, L., Wang, K.: Mida: Multiple imputation using denoising autoencoders. In: PacificAsia conference on knowledge discovery and data mining. pp. 260–272. Springer (2018)


  17. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems. vol. 27, pp. 2672–2680. Curran Associates, Inc., Montréal, Canada (2014)


  18. Gupta, A., Lam, M.S.: Estimating missing values using neural networks. Journal of the Operational Research Society 47(2), 229–238 (1996)


  19. Hallaji, E., Razavi-Far, R., Saif, M.: Dlin: Deep ladder imputation network. IEEE Transactions on Cybernetics 52(9), 8629–8641 (2021)


  20. Jäger, S., Allhorn, A., Biessmann, F.: A benchmark for data imputation methods. Front Big Data 4, 693674 (2021). https://doi.org/10.3389/fdata.2021.693674, pMID: 34308343; PMCID: PMC8297389 Enhancing Imputation Accuracy with Contextual Large Language Models 15


  21. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons, 3 edn. (2019)


  22. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken, 2 edn. (2002)


  23. Lu, H.m., Perrone, G., Unpingco, J.: Multiple imputation with denoising autoencoder using metamorphic truth and imputation feedback. arXiv preprint arXiv:2002.08338 (2020)


  24. McCoy, J.T., Kroon, S., Auret, L.: Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51(21), 141–146 (2018), 5th IFAC Workshop on Mining, Mineral and Metal Processing MMM 2018


  25. Nazabal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653 (2018)


  26. OpenAI: GPT-4 Technical Report (Mar 2023). https://doi.org/10.48550/arXiv.2303.08774, http://arxiv.org/abs/2303.08774, arXiv:2303.08774 [cs]


  27. Qiu, Y.L., Zheng, H., Gevaert, O.: Genomic data imputation with variational auto-encoders. GigaScience 9(8), giaa082 (2020)


  28. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 140:5485–140:5551 (Jan 2020)


  29. Roberts, A., Raffel, C., Shazeer, N.: How Much Knowledge Can You Pack Into the Parameters of a Language Model? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5418–5426. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.437, https://aclanthology.org/2020.emnlp-main.437


  30. Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581


  31. Rubin, D.B.: Multiple imputations in sample surveys-a phenomenological bayesian approach to nonresponse. In: Proceedings of the survey research methods section of the American Statistical Association. vol. 1, pp. 20–34. American Statistical Association, Alexandria, VA, USA (1978)


  32. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York, NY (2004)


  33. Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London, UK (1997)


  34. Schelter, S., Biessmann, F., Januschowski, T., Salinas, D., Seufert, S., Szarvas, G.: On challenges in machine learning model management. IEEE Data Eng. Bull. 41(4), 5–15 (2018), http://sites.computer.org/debull/A18dec/p5.pdf


  35. Schelter, S., Rukat, T., Biessmann, F.: JENGA - A framework to study the impact of data errors on the predictions of machine learning models. In: Velegrakis, Y., Zeinalipour-Yazti, D., Chrysanthis, P.K., Guerra, F. (eds.) Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021. pp. 529–534. OpenProceedings.org (2021). https://doi.org/10.5441/002/EDBT.2021.63, https://doi.org/10.5441/002/edbt.2021.63


  36. Sharpe, P.K., Solly, R.: Dealing with missing values in neural network-based diagnostic systems. Neural Computing & Applications 3(2), 73–77 (1995)


  37. Stekhoven, D.J., Bühlmann, P.: Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)


  38. Stoyanovich, J., Howe, B., Jagadish, H.V.: Responsible data management. Proceedings of the VLDB Endowment 13, 3474–3488 (2020). https://doi.org/10.14778/ 3415478.3415570


  39. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and Efficient Foundation Language Models (Feb 2023), http://arxiv.org/ abs/2302.13971, arXiv:2302.13971 [cs]


  40. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open foundation and fine-tuned chat models (2023)


  41. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. pp. 1096–1103 (2008)


  42. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Jan 2023). https://doi.org/10.48550/arXiv.2201.11903, http://arxiv.org/abs/2201.11903, arXiv:2201.11903 [cs]


  43. Yang, K., Huang, B., Stoyanovich, J., Schelter, S.: Fairness-aware instrumentation of preprocessing pipelines for machine learning. In: Proceedings of the Workshop on HumanIn-the-Loop Data Analytics (HILDA’20). ACM (2020). https://doi.org/10.1145/3398730.3399194


  44. Yoon, J., Jordon, J., van der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. In: International conference on machine learning. pp. 5689–5698. PMLR (2018)


  45. Yoon, J., Jordon, J., van der Schaar, M.: Gain: Missing data imputation using generative adversarial nets (2018)


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks