Elinye lamakhono abalulekile ochwepheshe bedatha ophumelelayo ukuphatha ngempumelelo amasethi amakhulu edatha, ukuqinisekisa ikhwalithi yedatha nokwethembeka. Idatha iwucezu oluyisisekelo noluyisisekelo lwanoma yiluphi uhlelo lwedatha, futhi noma imaphi amakhono amahle onawo kwezinye izici zohwebo lwethu, leli yilelo ongeke ukwazi ukulishaya indiva.
Kulesi sihloko, ngihlola amasu aqinile okwenza ukuhlola kwe-QA kumadathasethi amakhulu ngisebenzisa umtapo wezincwadi we-Deequ nezindlela zezibalo. Ngokuhlanganisa izindlela engizichaza ngezansi, uzokwazi ukugcina ubuqotho bedatha, uthuthukise izinqubo zakho zokuphatha idatha, futhi uvimbele izinkinga ezingaba khona ezinhlelweni zokusebenza ezingezansi.
Ukuqinisekisa ikhwalithi yedatha esikalini kuwumsebenzi onzima, ikakhulukazi lapho usebenza nezigidigidi zemigqa egcinwe ezinhlelweni zamafayela asabalalisiwe noma izindawo zokugcina idatha. Umtapo wezincwadi wakwa-Deequ uwumthombo ovulekile wokuphrofayili wedatha kanye nohlaka lwe-QA olwakhelwe ku-Spark okuyithuluzi lesimanje nelisebenza ngezindlela eziningi eliklanyelwe ukuxazulula le nkinga. Okuyenza ihluke kumathuluzi afanayo yikhono layo lokuhlanganisa ngaphandle komthungo ne-Spark, isebenzise amandla okucubungula asabalalisiwe ukuze kuphathwe kahle amadathasethi amakhulu.
Uma uyizama, uzobona ukuthi ukuguquguquka kwayo kukuvumela kanjani ukuthi uchaze imithetho eyinkimbinkimbi yokuqinisekisa ehambisana nezidingo zakho ezithile, iqinisekisa ukumbozwa okuphelele. Ukwengeza, i-Deequ ihlanganisa amamethrikhi abanzi namandla okuthola okudidayo azokusiza ukuthi ubone futhi ulungise ngokushesha izinkinga zekhwalithi yedatha. Kochwepheshe bedatha abasebenza ngamasethi edatha amakhulu naguqukayo, i-Deequ iyisixazululo sommese waseSwitzerland. Ake sibone ukuthi singayisebenzisa kanjani.
Imininingwane eyengeziwe mayelana nokusethwa kwelabhulali ye-Deequ kanye nezimo zokusebenzisa ezizungeze iphrofayela yedatha iyafinyeleleka lapha . Ukuze kube lula, kulesi sibonelo, sisanda kukhiqiza amarekhodi amathoyizi ambalwa:
val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd)
Iningi lezinhlelo zokusebenza zedatha ziza nokuqagela okucacile mayelana nezibaluli zedatha, njengamavelu angewona ama-NULL kanye nokuhluka. Nge-Deequ, lokhu kuqagela kuba sobala ngokuhlolwa kweyunithi. Nawa amasheke ajwayelekile:
Ukubalwa Komugqa: Qinisekisa ukuthi isethi yedatha iqukethe inombolo ethile yemigqa.
Isibaluli Esiphelele: Hlola ukuthi izibaluli ezifana ne-id ne-productName azikaze zibe NULL.
Isibaluli Esihlukile: Qinisekisa ukuthi izici ezithile, njenge-id, zihlukile.
Ibanga Levelu: Qinisekisa ukuthi izibaluli ezifana nokubalulekile kanye ne-numViews ziwela phakathi kobubanzi obulindelekile.
Ukufanisa Iphethini: Qinisekisa ukuthi izincazelo ziqukethe ama-URL uma kulindelekile.
Izakhiwo Zezibalo: Qinisekisa ukuthi i-median yezibaluli zezinombolo ihlangabezana nemibandela ethile.
Nansi indlela ongakusebenzisa ngayo lokhu kuhlola usebenzisa i-Deequ:
import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run()
Ngemva kokwenza lokhu kuhlola, i-Deequ iwahumushela ochungechungeni lwemisebenzi ye-Spark, ayisebenzisayo ukuze ibale amamethrikhi kudatha. Ngemva kwalokho, icela imisebenzi yakho yokugomela (isb, _ == 5 yokuhlola usayizi) kulawa mamethrikhi ukuze ubone ukuthi ingabe imingcele ibambelele kudatha. Singahlola into ethi "verificationResult" ukuze sibone ukuthi ukuhlola kutholwe yini amaphutha:
import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } }
Uma sisebenzisa isibonelo, sithola okukhiphayo okulandelayo:
We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!
Ukuhlolwa kwathola ukuthi ukucabangela kwethu kwephuliwe! Amavelu angu-4 kuphela kwangu-5 (80%) esibaluli se-productName awalona ize, futhi amanani angu-2 kuphela kwangu-5 (okungukuthi, 40%) esibaluli sencazelo ayequkethe i-URL. Ngenhlanhla, senze ukuhlolwa futhi sathola amaphutha; othile kufanele alungise idatha ngokushesha!
Nakuba i-Deequ inikeza uhlaka oluqinile lokuqinisekisa idatha, ukuhlanganisa izindlela zezibalo kungathuthukisa ukuhlola kwakho kwe-QA, ikakhulukazi uma usebenzisana namamethrikhi ahlanganisiwe edathasethi. Ake sibone ukuthi ungazisebenzisa kanjani izindlela zezibalo ukuze ugade futhi uqinisekise ikhwalithi yedatha.
Cabangela isimo sebhizinisi lapho inqubo ye-ETL (Extract, Transform, Load) ikhiqiza amarekhodi angu-N emsebenzini ohleliwe wansuku zonke. Amaqembu asekelayo angase afune ukusetha amasheke e-QA ukuze akhuphule isexwayiso uma kukhona ukuchezuka okukhulu ekubalweni kwerekhodi. Isibonelo, uma inqubo ngokuvamile ikhiqiza phakathi kwamarekhodi angu-9,500 kuya kwangu-10,500 nsuku zonke phakathi nezinyanga ezimbili, noma yikuphi ukunyuka okukhulu noma ukwehla kungase kubonise inkinga ngedatha eyisisekelo.
Singasebenzisa indlela yezibalo ukuchaza lo mkhawulo wokuthi iyiphi inqubo okufanele ikhuphule isixwayiso ethimbeni losekelo. Ngezansi umfanekiso wokulandelela ukubalwa kwamarekhodi phakathi nezinyanga ezimbili:
Ukuze sihlaziye lokhu, singaguqula idatha yokubalwa kwamarekhodi ukuze sibone izinguquko zansuku zonke. Lezi zinguquko ngokuvamile ziphenduka ziro, njengoba kuboniswe eshadini elilandelayo:
Uma simelela leli zinga loshintsho ngokusabalalisa okuvamile, lakha ijika lensimbi, elibonisa ukuthi idatha isatshalaliswa ngokujwayelekile. Ushintsho olulindelekile lucishe lube ngu-0%, ngokuchezuka okujwayelekile okungu-2.63%.
Lokhu kuhlaziya kuphakamisa ukuthi isibalo serekhodi ngokuvamile siwela ngaphakathi kwebanga elingu--5.26% ukuya ku-+5.25% ngokuzethemba okungu-90%. Ngokusekelwe kulokhu, ungasungula isimiso sokuphakamisa isixwayiso uma isibalo serekhodi sichezuka ngale kwalolu bubanzi, okuqinisekisa ukungenelela okufika ngesikhathi.
Ukufakwa kwesibaluli u -e kubhekisela esilinganisweni samanani angewona angu-NULL kwinani eliphelele lesibalo serekhodi lesifinyezo sedathasethi. Isibonelo, uma amarekhodi angu-8 kwayi-100 enenani elingu-NULL lesibaluli esithile, ukumbozwa kwaleso sibaluli ngu-92%.
Ake sibuyekeze elinye icala lebhizinisi ngenqubo ye-ETL ekhiqiza isifinyezo setafula lomkhiqizo nsuku zonke. Sifuna ukuqapha ukumbozwa kwezibaluli zencazelo yomkhiqizo. Uma ukumbozwa kuwela ngaphansi komkhawulo othile, isaziso kufanele siphakanyiselwe ithimba labasekeli. Ngezansi ukuboniswa okubonakalayo kokufakwa kwesibaluli sezincazelo zomkhiqizo phakathi nezinyanga ezimbili:
Ngokuhlaziya umehluko ophelele wansuku zonke wokuhlanganisa, sibona ukuthi izinguquko ziyashintshashintsha zibe yiziro:
Ukumelela le datha njengokusatshalaliswa okuvamile kubonisa ukuthi imvamisa isatshalaliswa ngoshintsho olulindelekile olungaba ngu-0% kanye nokuchezuka okujwayelekile okungu-2.45%.
Njengoba sibona, kule dathasethi, ukumbozwa kwesibaluli sencazelo yomkhiqizo ngokuvamile kusuka ku- -4.9% kuya ku-+4.9% ngokuzethemba okungu-90%. Ngokusekelwe kule nkomba, singasetha umthetho wokuphakamisa isexwayiso uma ukumbozwa kuchezuka ngale kwalolu bubanzi.
Uma usebenza namasethi edatha abonisa ukuhluka okuphawulekayo ngenxa yezinto ezifana nesizini noma amathrendi, izindlela zezibalo ezivamile zingase ziqalise izexwayiso ezingamanga. Ama-algorithms ochungechunge lwesikhathi anikeza indlela ecwengisiswe kakhudlwana, ethuthukisa ukunemba nokuthembeka kokuhlolwa kwakho kwe-QA.
Ukuze ukhiqize izexwayiso ezinengqondo kakhudlwana, ungasebenzisa noma yikuphi
Masenze imodeli yokuthengisa yansuku zonke ebonisa kokubili okuthrendayo kanye namaphethini esizini sisebenzisa i-Holt-Winters:
import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies)
Usebenzisa le ndlela, ungathola ukuchezuka okubalulekile okungase kubonise izinkinga zekhwalithi yedatha, kunikeze indlela eyinkimbinkimbi yokuhlola i-QA.
Ngethemba ukuthi lesi sihloko sizokusiza ukuthi usebenzise ngempumelelo ukuhlola kwe-QA kumadathasethi wakho amakhulu. Ngokusebenzisa umtapo wezincwadi we-Deequ nokuhlanganisa izindlela zezibalo kanye nama-algorithms ochungechunge lwesikhathi, ungaqinisekisa ubuqotho nokuthembeka kwedatha, ekugcineni uthuthukise izinqubo zakho zokuphatha idatha.
Ukusebenzisa amasu achazwe ngenhla kuzokusiza ukuthi uvimbele izinkinga ezingaba khona ezinhlelweni zokusebenza ezingezansi futhi uthuthukise ikhwalithi iyonke yokugeleza komsebenzi wedatha yakho.