paint-brush
QA Inoongorora Mahombe Dataset ane Deequ & Statistical Methodsby@akshayjain1986
41,967 kuverenga
41,967 kuverenga

QA Inoongorora Mahombe Dataset ane Deequ & Statistical Methods

by Akshay Jain7m2024/05/30
Read on Terminal Reader
Read this story w/o Javascript

Kurebesa; Kuverenga

Raibhurari yeDeequ ndeye yakavhurika-sosi data profiling uye QA chimiro chakavakirwa paSpark. Inokutendera kuti utsanangure mitemo yakaoma yekusimbisa inoenderana nezvinodiwa zvako, kuve nechokwadi chekuvhara yakazara. Deequ inoratidzira akakura metrics uye anomaly yekuona hunyanzvi iyo inokubatsira kuona uye kunyatsogadzirisa nyaya dzemhando yedata. Heano maitiro aungaita aya macheki uchishandisa Deequ.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - QA Inoongorora Mahombe Dataset ane Deequ & Statistical Methods
Akshay Jain HackerNoon profile picture

Humwe hunyanzvi hwakakosha hweanokwanisa data nyanzvi kubata kwakanaka kwemaseti makuru, kuve nechokwadi chemhando yedata uye kuvimbika. Dhata ndiyo yepakati uye yakakosha chidimbu chechero data system, uye chero hunyanzvi hwakanaka hwaunahwo mune zvimwe zvinhu zvekutengesa kwedu, iyi ndiyo yausingakwanise kufuratira.


Muchikamu chino, ini ndinoongorora nzira dzakasimba dzekuita QA cheki pamaseti makuru ndichishandisa raibhurari yeDeequ uye nzira dzehuwandu. Nekubatanidza nzira dzandinotsanangura pazasi, iwe unozokwanisa kuchengetedza kutendeseka kwedata, kusimudzira maitiro ako ekutonga data, uye kudzivirira zvingangoitika mumashandisirwo ezasi.

QA Inotarisa Kushandisa Deequ Library

Sei Deequ?

Kuve nechokwadi chemhando yedata pachiyero ibasa rinotyisa, kunyanya kana uchibata nemabhiriyoni emitsara yakachengetwa mumafaira akagoverwa masisitimu kana matura edata. Raibhurari yeDeequ ndeye yakavhurika-sosi data profiling uye QA chimiro chakavakirwa paSpark chiri chemazuva ano uye chinogoneka chishandiso chakagadzirirwa kugadzirisa dambudziko iri. Chii chinoisiyanisa kubva kune mamwe maturusi kugona kwayo kubatanidza isina musono neSpark, inosimudzira yakagovaniswa yekugadzirisa simba rekubata zvakanaka kwemataseti makuru.


Paunoiedza, iwe uchaona kuti kuchinjika kwayo kunoita sei kuti utsanangure mitemo yakaoma yekusimbisa inoenderana nezvako chaizvo zvaunoda, kuve nechokwadi chekuvhara yakazara. Pamusoro pezvo, Deequ inoratidzira akakura metrics uye anomaly yekuona hunyanzvi iyo inokubatsira iwe kuona uye kunyatsogadzirisa nyaya dzemhando yedata. Kune nyanzvi dzedata dzinoshanda nemaseti makuru uye ane simba, Deequ iSwiss-banga mhinduro. Ngationei kuti tingaishandisa sei.

Kugadzika Deequ

Mamwe mashoko pamusoro pekugadzirisa raibhurari yeDeequ uye mashandisirwo emakesi akatenderedza data profiling anowanikwa pano . Nekuda kwekureruka, mumuenzaniso uyu, isu tichangogadzira mashoma ematoyi marekodhi:


 val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd)


Kutsanangura Mafungiro eData

Mazhinji maapplication edatha anouya nekufungidzira kwakajeka nezve data hunhu, senge isiri-NULL tsika uye kusarudzika. NaDeequ, fungidziro idzi dzinobuda pachena kuburikidza neyuniti bvunzo. Heano mamwe macheki anowanzo:


  1. Row Count: Ita shuwa kuti dataset ine chaiyo nhamba yemitsara.


  2. Attribute Kukwana: Tarisa kuti hunhu hwakaita senge id uye productName haina kumbobvira NULL.


  3. Attribute Uniqueness: Ita shuwa kuti humwe hunhu, senge id, hwakasiyana.


  4. Value Range: Simbisa kuti hunhu hwakanyanya kukosha uye numViews hunowira mukati mezvinotarisirwa.


  5. Patani Matching: Simbisa kuti tsananguro dzine maURL kana uchitarisirwa.


  6. Statistical Properties: Ita shuwa kuti yepakati yenhamba yehunhu inosangana nemaitiro chaiwo.


Heano maitiro aungaita aya macheki uchishandisa Deequ:


 import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run()


Kuturikira Migumisiro

Mushure mekuita macheki aya, Deequ anoashandura kuita nhevedzano yemabasa eSpark, ayo anoita kuti averenge metrics pane data. Mushure mezvo, inodaidza mabasa ako ekusimbisa (semuenzaniso, _ == 5 yekutarisa saizi) pane aya metrics kuti uone kana zvipingaidzo zvakabata data. Tinogona kuongorora chinhu "verificationResult" kuti tione kana bvunzo yakawana zvikanganiso:


 import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } }


Kana tikamhanyisa muenzaniso, tinowana zvinotevera zvinobuda:


 We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!


Muedzo wakawana kuti fungidziro dzedu dzakatyorwa! Zvina chete kubva pa5 (80%) zvehunhu hwechigadzirwaName hunhu hazvina maturo, uye 2 chete kubva pa5 (kureva, 40%) yemhando yetsanangudzo yanga iine URL. Sezvineiwo, takamhanya bvunzo uye takawana zvikanganiso; mumwe munhu anofanira kukurumidza kugadzirisa data!

QA Inotarisa NeStatistical Methods

Nepo Deequ ichipa hwaro hwakasimba hwekusimbisa data, kubatanidza nzira dzenhamba dzinogona kuwedzera kusimudzira yako macheki eQA, kunyanya kana uri kubata neakaunganidzwa metrics yedataset. Ngationei mashandisiro aungashandisa nzira dzenhamba dzekutarisa uye kuona mhando yedata.

Rekodha Count Tracking

Funga nezve bhizinesi mamiriro apo ETL (Kubvisa, Shandura, Mutoro) maitiro anoburitsa N marekodhi pabasa rakarongwa zuva nezuva. Mapoka ekutsigira angangoda kuseta macheki eQA kusimudza yambiro kana paine kutsauka kwakakosha mukuverenga rekodhi. Semuyenzaniso, kana maitiro acho achiwanzogadzira pakati pe9,500 kusvika 10,500 marekodhi zuva nezuva mukati memwedzi miviri, chero kuwedzera kukuru kana kuderera kunogona kuratidza nyaya neiyo pasi data.


Tinogona kushandisa nzira yezviverengero kutsanangura ichi chikumbaridzo pamaitiro anofanira kusimudza chenjedzo kuchikwata chetsigiro. Pazasi pane mufananidzo wekuteedzera kuverenga kwerekodhi kwemwedzi miviri:











Kuti tiongorore izvi, tinogona kushandura rekodhi kuverenga data kuti titarise shanduko yezuva nezuva. Shanduko idzi dzinowanzo tenderera pa zero, sezvakaratidzwa mune inotevera chati:












Kana isu tichimiririra iyi chiyero chekuchinja nekugoverwa kwakajairika, inoumba bhero curve, zvichiratidza kuti data inogoverwa kazhinji. Shanduko inotarisirwa yakatenderedza 0%, ine mwero kutsauka kwe2.63%.













Kuongorora uku kunoratidza kuti rekodhi rekodhi rinowanzowira mukati -5.26% kusvika + 5.25% renji ne90% kuvimba. Zvichienderana neizvi, unogona kuseta mutemo wekusimudza yambiro kana rekodhi kuverenga ikatsauka kupfuura iyi nhanho, kuve nechokwadi chekupindira panguva.

Attribute Coverage Tracking

Attribute coverag e inoreva chiyero chevasiri-NULL kukosha kune yakazara rekodhi kuverenga kwe dataset snapshot. Semuenzaniso, kana 8 kubva pa100 zvinyorwa zvine NULL kukosha kwehumwe hunhu, kufukidzwa kweiyo hunhu i92%.


Ngationgororei imwe bhizinesi kesi ine ETL maitiro inogadzira chigadzirwa tafura snapshot zuva nezuva. Tinoda kutarisa kuvharwa kwemaitiro ekutsanangurwa kwechigadzirwa. Kana kuvharika kwakawira pasi pechimwe chikumbaridzo, chenjedzo inofanira kusimudzwa kune timu yerutsigiro. Pazasi pane inomiririra inomiririra yehunhu kufukidzwa kune tsananguro yechigadzirwa mukati memwedzi miviri:









Nekuongorora misiyano yakakwana yezuva nezuva mukuvhara, tinoona kuti shanduko dzinotenderera dzakatenderedza zero:










Kumiririra iyi data seyakajairwa kugovera kunoratidza kuti inowanzo kugoverwa nekuchinja kunotarisirwa kutenderedza 0% uye kutsauka kwakajairwa kwe2.45%.















Sezvatinoona, kune iyi dhatabheti, kutsanangurwa kwechigadzirwa hunhu hwekuvhara hunowanzo kubva -4.9% kusvika +4.9% ne90% kuvimba. Zvichienderana nechiratidzo ichi, tinogona kuseta mutemo wekusimudza yambiro kana kufukidzwa kukapotera kupfuura iyi nhanho.

QA Inotarisa Nenguva Yakateedzana Algorithms

Kana ukashanda nemaseti edhata anoratidza musiyano wakakura nekuda kwezvinhu zvakaita semwaka kana maitiro, nzira dzechinyakare dzezviverengero dzinogona kukonzeresa chenjedzo dzenhema. Nguva yakatevedzana algorithms inopa imwe yakanatswa nzira, inovandudza huchokwadi uye kuvimbika kweQA yako cheki.


Kuti ugadzire yambiro inonzwisisika, unogona kushandisa chero iyo Autoregressive Integrated Moving Average (ARIMA) kana kuti Holt-Winters Method . Iyo yekutanga yakanaka zvakakwana kune datasets ane mafambiro, asi iyo yekupedzisira inotibvumira kubata nedatasets ane ese maitiro uye mwaka. Iyi nzira inoshandisa zvikamu zvenhanho, maitiro, uye mwaka, izvo zvinoibvumira kuchinjika kune shanduko nekufamba kwenguva.


Ngatisekei-modhi rekutengesa zuva nezuva rinoratidza ese mafambiro uye mwaka patani uchishandisa Holt-Winters:

 import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies)


Uchishandisa nzira iyi, unogona kuona kutsauka kwakakosha kunogona kuratidza nyaya dzemhando yedata, zvichipa imwe nzira ine nuanced kune QA cheki.


Ndinovimba chinyorwa ichi chichakubatsira iwe kunyatso shandisa QA cheki kune ako makuru dataset. Nekushandisa raibhurari yeDeequ uye kubatanidza nzira dzechiverengero uye nguva yakatevedzana algorithms, unogona kuve nechokwadi chekuvimbika kwedata uye kuvimbika, pakupedzisira uchiwedzera maitiro ako ekutonga data.


Kuita matekiniki atsanangurwa pamusoro kuchakubatsira kudzivirira zvingangoitika mumashandisirwo ezasi uye kugadzirisa huwandu hwese hwe data workflows.