paint-brush
I-QA ijonga iiSeti zeDatha ezinkulu nge-Deequ & neNdlela zoBalonge@akshayjain1986
42,029 ukufunda
42,029 ukufunda

I-QA ijonga iiSeti zeDatha ezinkulu nge-Deequ & neNdlela zoBalo

nge Akshay Jain7m2024/05/30
Read on Terminal Reader
Read this story w/o Javascript

Inde kakhulu; Ukufunda

Ithala leencwadi le-Deequ liyiprofayili yedatha evulelekileyo kunye nesikhokelo se-QA esakhiwe kwi-Spark. Ikuvumela ukuba uchaze imigaqo entsonkothileyo yokuqinisekisa elungiselelwe iimfuno zakho ezithile, iqinisekisa ukhuseleko olubanzi. I-Deequ ineemetrics ezibanzi kunye nesakhono sokubona ngendlela engaqhelekanga eya kukunceda uchonge kwaye ulungise imiba yobulunga bedatha. Nantsi indlela onokuthi uphumeze ngayo olu vavanyo usebenzisa i-Deequ.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - I-QA ijonga iiSeti zeDatha ezinkulu nge-Deequ & neNdlela zoBalo
Akshay Jain HackerNoon profile picture

Esinye sezakhono ezibalulekileyo zengcali yedatha ephunyeziweyo kukuphathwa ngokufanelekileyo kwedatha enkulu, ukuqinisekisa umgangatho wedatha kunye nokuthembeka. Idatha ingundoqo kunye nesiseko sayo nayiphi na inkqubo yedatha, kwaye naziphi na izakhono ezilungileyo onazo kweminye imiba yorhwebo lwethu, le yenye ongenakukwazi ukuyihoya.


Kweli nqaku, ndiphonononga iindlela ezomeleleyo zokwenza uvavanyo lwe-QA kwiiseti zedatha ezinkulu usebenzisa ilayibrari ye-Deequ kunye neendlela zobalo. Ngokudibanisa iindlela endizichazayo ngezantsi, uya kukwazi ukugcina ingqibelelo yedatha, uphucule iinkqubo zolawulo lwakho lwedatha, kwaye uthintele imiba enokubakho kwizicelo ezisezantsi.

Ihlola i-QA isebenzisa iThala leeNcwadi laseDeequ

Kutheni uDeequ?

Ukuqinisekisa umgangatho wedatha kwinqanaba ngumsebenzi onzima, ngakumbi xa ujongene neebhiliyoni zemigca egcinwe kwiinkqubo zeefayile ezisasazwayo okanye iindawo zokugcina idatha. Ithala leencwadi likaDeequ liyiprofayili yedatha evulelekileyo kunye nesakhelo se-QA esakhiwe kwi-Spark esisisixhobo sale mihla nesisebenzayo esiyilelwe ukusombulula le ngxaki. Yintoni eyahlulayo kwizixhobo ezifanayo kukukwazi ukudibanisa ngaphandle komthungo kunye ne-Spark, ukuhambisa amandla okuhambisa asasazwayo ukuze kuphathwe ngokufanelekileyo iiseti zedatha ezinkulu.


Xa uyizama, uya kubona indlela ukuguquguquka kwayo kukuvumela ukuba uchaze imithetho enzima yokuqinisekisa elungiselelwe iimfuno zakho ezithile, iqinisekisa ugutyulo olubanzi. Ukongeza, i-Deequ ineemetrics ezibanzi kunye nesakhono sokubona okungaqhelekanga okuya kukunceda uchonge kwaye ulungise imiba yomgangatho wedatha. Kubasebenzi bedata abasebenza ngeeseti zedatha ezinkulu neziguqukayo, i-Deequ sisisombululo se-Swiss-mela. Makhe sibone ukuba sinokuyisebenzisa njani.

Ukumisela i-Deequ

Iinkcukacha ezithe kratya malunga nokusekwa kwethala leencwadi le-Deequ kunye neemeko zokusebenzisa malunga neprofayili yedatha ziyafikeleleka apha . Ukwenza lula, kulo mzekelo, sivelise iirekhodi zokudlala ezimbalwa:


 val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd)


Ukuchaza ukucingelwa kweDatha

Uninzi lwezicelo zedatha ziza neengcinga ezicacileyo malunga neempawu zedatha, njengexabiso elingeyo-NULL kunye nokukodwa. Nge-Deequ, ezi ngqikelelo zicaca ngovavanyo lweeyunithi. Nazi ezinye iitshekhi eziqhelekileyo:


  1. Ubalo lweRow: Qinisekisa ukuba iseti yedatha iqulethe inani elithile lemiqolo.


  2. Uphawu loyelelwano: Khangela ukuba iimpawu ezifana ne-id kunye ne-productName azize ZINZE.


  3. Uphawu olulodwa: Qinisekisa ukuba iimpawu ezithile, ezifana ne-id, zizodwa.


  4. Uluhlu lweXabiso: Qinisekisa ukuba iimpawu ezifana nokubaluleka kunye nenumViews ziwela phakathi koluhlu olulindelekileyo.


  5. Ipateni yokuTshelaniswa: Qinisekisa ukuba iinkcazelo ziqulathe ii-URL xa zilindelwe.


  6. IiPropati zeNkcazo-manani: Qinisekisa ukuba umndilili weempawu zamanani uyahlangabezana nemilinganiselo ethile.


Nantsi indlela onokuthi uphumeze ngayo olu vavanyo usebenzisa i-Deequ:


 import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run()


Ukutolika Iziphumo

Emva kokwenza ezi tshekhi, u-Deequ uziguqulela kuthotho lwemisebenzi ye-Spark, ethi iyenze ukubala iimethrikhi kwidatha. Emva koko, icela imisebenzi yakho yokuqinisekisa (umzekelo, _ == 5 yokukhangela ubungakanani) kwezi metrics ukubona ukuba imiqobo ibambe idatha. Sinokuyihlola into ethi "verificationResult" ukubona ukuba uvavanyo lufumene iimpazamo:


 import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } }


Ukuba siqhuba umzekelo, sifumana iziphumo ezilandelayo:


 We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!


Uvavanyo lwafumanisa ukuba iingcamango zethu zaphulwa! Kuphela yi-4 kwi-5 (80%) yexabiso le-productName yophawu loyelelwano aluyi-null, kwaye kuphela yi-2 ngaphandle kwe-5 (okt, 40%) ixabiso lenkcazo yempawu equlethe i-URL. Ngethamsanqa, siye saqhuba uvavanyo kwaye safumana iimpazamo; umntu kufuneka alungise idatha ngokukhawuleza!

I-QA ihlola ngeeNdlela zoBalo

Ngelixa i-Deequ ibonelela ngesakhelo esomeleleyo sokuqinisekiswa kwedatha, ukudibanisa iindlela zamanani kunokuphucula ngakumbi iitshekhi zakho ze-QA, ngakumbi ukuba ujongana neemetrics ezidityanisiweyo zeseti yedatha. Makhe sibone ukuba ungazisebenzisa njani iindlela zobalo ukubeka iliso kunye nokuqinisekisa umgangatho wedatha.

Record Bala Tracking

Cinga ngemeko yoshishino apho inkqubo ye-ETL (i-Extract, Transform, Load) ivelisa iirekhodi ze-N kumsebenzi ocwangcisiweyo wemihla ngemihla. Amaqela axhasayo anokufuna ukuseta i-QA iitshekhi ukuphakamisa isilumkiso ukuba kukho ukutenxa okubonakalayo kwinani lerekhodi. Ngokomzekelo, ukuba inkqubo ngokuqhelekileyo ivelisa phakathi kwe-9,500 ukuya kwiirekhodi ze-10,500 imihla ngemihla kwiinyanga ezimbini, nakuphi na ukonyuka okubalulekileyo okanye ukwehla kunokubonisa umba ngedatha engaphantsi.


Singasebenzisa indlela yamanani ukuchaza lo mda wenkqubo ekufuneka inyuse isilumkiso kwiqela lenkxaso. Apha ngezantsi ngumzekeliso wokulandelela ukubalwa kwerekhodi kwiinyanga ezimbini:











Ukuhlalutya oku, sinokuyiguqula idatha yokubala irekhodi ukujonga utshintsho lwemihla ngemihla. Olu tshintsho ludla ngokujikeleza u-zero, njengoko kubonisiwe kwitshathi elandelayo:












Xa sibonisa le nguqu yenguqu kunye nokusabalalisa okuqhelekileyo, yenza intsimbi yentsimbi, ebonisa ukuba idatha ihanjiswa ngokuqhelekileyo. Utshintsho olulindelekileyo lujikeleze i-0%, kunye nokuphambuka okusemgangathweni kwe-2.63%.













Olu hlalutyo lubonisa ukuba ukubala kwerekhodi ngokuqhelekileyo kuwela ngaphakathi kwe-5.26% ukuya ku-5.25% uluhlu kunye ne-90% yokuzithemba. Ngokusekelwe koku, unokuseka umgaqo wokuphakamisa isilumkiso ukuba ubalo lwerekhodi luyaphambuka ngaphaya kolu luhlu, ukuqinisekisa ungenelelo olungexesha.

Uphawu loKwakha ukuJonga

Uphawu lophawu e lubhekiselele kumlinganiselo wamaxabiso angengo-NULL kwinani lilonke lerekhodi leseti yedatha ekhawulezayo. Umzekelo, ukuba iirekhodi ezisi-8 kwezili-100 zinexabiso elingu-NULL kuphawu oluthile, ukhuselo lwelo phawu ngama-92%.


Makhe sijonge enye imeko yeshishini ngenkqubo ye-ETL evelisa umfanekiso wetafile yemveliso yonke imihla. Sifuna ukubeka iliso kwi-coverage yeempawu zenkcazo yemveliso. Ukuba i-inshorensi iwela ngaphantsi komqobo othile, kufuneka kuphakanyiswe isaziso seqela lenkxaso. Apha ngezantsi kukho umboniso obonwayo wenkcazo yemveliso kwiinyanga ezimbini:









Ngokuhlalutya iiyantlukwano ezipheleleyo zemihla ngemihla kugutyulo, siqaphela ukuba utshintsho lujikeleza ku-zero:










Ukumela le datha njengokusabalalisa okuqhelekileyo kubonisa ukuba ngokuqhelekileyo isasazwa ngotshintsho olulindelekileyo malunga ne-0% kunye nokuphambuka okusemgangathweni kwe-2.45%.















Njengoko sibona, kule datha yedatha, inkcazo yenkcazo yenkcazo yeempawu eziqhelekileyo zivela kwi--4.9% ukuya ku-+4.9% nge-90% yokuzithemba. Ngokusekelwe kwesi salathisi, sinokumisela umgaqo wokuphakamisa isilumkiso ukuba i-coverage itenxa ngaphaya kolu luhlu.

I-QA ihlola nge-Time Series Algorithms

Ukuba usebenza ngeeseti zedatha ezibonisa umahluko obalulekileyo ngenxa yezinto ezifana nexesha lonyaka okanye iintsingiselo, iindlela zobalo eziqhelekileyo zinokuqalisa izilumkiso zobuxoki. Ii-algorithms zechungechunge lwexesha zinika indlela ephuculweyo, ukuphucula ukuchaneka kunye nokuthembeka kweetshekhi zakho ze-QA.


Ukuvelisa izilumkiso ezinengqiqo ngakumbi, ungasebenzisa nokuba le IAutoregressive Integrated Moving Average (ARIMA) okanye i Indlela yaseHolt-Winters . Eyangaphambili ilungile ngokwaneleyo kwiiseti zedatha ezinentsingiselo, kodwa eyokugqibela isivumela ukuba sijongane nedatha enentsingiselo kunye nexesha lonyaka. Le ndlela isebenzisa amacandelo kwinqanaba, intsingiselo, kunye nexesha lonyaka, elivumela ukuba liziqhelanise ngokuguquguqukayo notshintsho ekuhambeni kwexesha.


Masenze imodeli yentengiso yemihla ngemihla ebonisa zombini intsingiselo kunye neepateni zamaxesha onyaka sisebenzisa iHolt-Winters:

 import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies)


Ukusebenzisa le ndlela, unokubona ukutenxa okubalulekileyo okunokuthi kubonise imiba yomgangatho wedatha, ukubonelela ngendlela ecacileyo yokuhlola i-QA.


Ndiyathemba ukuba eli nqaku liza kukunceda ngokufanelekileyo ukuphumeza iitshekhi ze-QA kwiiseti zedatha yakho enkulu. Ngokusebenzisa ilayibrari ye-Deequ kunye nokudibanisa iindlela zezibalo kunye ne-algorithms yochungechunge lwexesha, unokuqinisekisa ukuthembeka kwedatha kunye nokuthembeka, ekugqibeleni uphucula izenzo zakho zokulawula idatha.


Ukusebenzisa iindlela ezichazwe ngasentla kuya kukunceda uthintele imiba enokubakho kwizicelo ezisezantsi kunye nokuphucula umgangatho opheleleyo wokuhamba komsebenzi wakho wedatha.