42,164 gusoma
42,164 gusoma

QA Kugenzura imibare nini hamwe na Deequ & Statistical Methods

na Akshay Jain7m2024/05/30
Read on Terminal Reader
Read this story w/o Javascript

Birebire cyane; Gusoma

Isomero rya Deequ nisoko rifungura amakuru yerekana kandi QA yubatswe kuri Spark. Iragufasha gusobanura amategeko akomeye yo kwemeza ajyanye nibisabwa byihariye, yemeza ko byuzuye. Deequ igaragaramo ibipimo byinshi hamwe nubushobozi bwo kumenya ibintu bizagufasha kumenya no gukemura ibibazo byubuziranenge bwamakuru. Dore uburyo ushobora gushyira mubikorwa aya cheque ukoresheje Deequ.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - QA Kugenzura imibare nini hamwe na Deequ & Statistical Methods
Akshay Jain HackerNoon profile picture

Bumwe mu buhanga bwingenzi bwabakozi babigize umwuga ni uburyo bwiza bwo gukoresha imibare minini, kwemeza ubwiza bwamakuru no kwizerwa. Ibyatanzwe nigice cyingenzi kandi cyibanze cya sisitemu iyariyo yose, kandi ubuhanga bwiza ufite ufite mubindi bice byubucuruzi bwacu, iyi nimwe udashobora kwirengagiza.


Muri iki kiganiro, ndasesengura tekinike zikomeye zo gukora igenzura rya QA kuri datasets nini nkoresheje isomero rya Deequ hamwe nuburyo bwibarurishamibare. Muguhuza inzira ndagusobanurira hepfo, uzashobora kugumana ubunyangamugayo bwamakuru, kuzamura imikorere yimicungire yamakuru, no gukumira ibibazo bishobora kuba mubisabwa hasi.

QA Kugenzura Ukoresheje Isomero rya Deequ

Kuki Deequ?

Kugenzura ireme ryamakuru ku gipimo ni umurimo utoroshye, cyane cyane iyo ukorana na miliyari y'imirongo ibitswe muri sisitemu ya dosiye yatanzwe cyangwa mu bubiko bw'amakuru. Isomero rya Deequ namakuru afunguye-yerekana amakuru hamwe na QA urwego rwubatswe kuri Spark nigikoresho kigezweho kandi gihindagurika cyagenewe gukemura iki kibazo. Ikitandukanya nibikoresho bisa nubushobozi bwayo bwo guhuza hamwe na Spark, gukoresha imbaraga zitangwa kugirango zitunganyirizwe neza imibare nini nini.


Mugihe ubigerageje, uzabona uburyo guhinduka kwayo kugufasha gusobanura amategeko yemewe yo kwemeza ajyanye nibisabwa byihariye, ukemeza ko byuzuye. Byongeye kandi, Deequ igaragaramo ibipimo byinshi hamwe nubushobozi bwo kumenya ibintu bidasanzwe bizagufasha kumenya no gukemura ibibazo byubuziranenge bwamakuru. Ku banyamwuga bakorana na datasets nini kandi zifite imbaraga, Deequ nigisubizo cyicyuma cyu Busuwisi. Reka turebe uko dushobora kuyikoresha.

Gushiraho Deequ

Ibisobanuro birambuye kubyerekeye isomero rya Deequ no gukoresha imanza zijyanye no kwerekana amakuru uraboneka hano . Kugirango tworohereze, mururugero, twakoze gusa ibikinisho bike by ibikinisho:


 val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd)


Gusobanura Ibyatanzwe

Porogaramu nyinshi zamakuru zizana ibitekerezo bidasobanutse kubyerekeye ibiranga amakuru, nkindangagaciro zitari NULL kandi zidasanzwe. Hamwe na Deequ, ibyo bitekerezo biragaragara neza binyuze mubizamini byibice. Hano hari igenzura risanzwe:


  1. Kubara umurongo: Menya neza ko imibare ikubiyemo umubare wihariye wumurongo.


  2. Ibiranga Byuzuye: Reba neza ibiranga nka id nibicuruzwa Izina ntabwo ari NULL.


  3. Ibiranga umwihariko: Menya neza ko ibiranga bimwe na bimwe, nka id, byihariye.


  4. Agaciro Urwego: Kwemeza ibiranga nkibyihutirwa na numViews biri mubiteganijwe.


  5. Guhuza icyitegererezo: Menya neza ko ibisobanuro birimo URL mugihe biteganijwe.


  6. Ibarurishamibare: Menya neza ko imiterere yimibare yujuje ibisabwa.


Dore uburyo ushobora gushyira mubikorwa aya cheque ukoresheje Deequ:


 import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run()


Gusobanura ibisubizo

Nyuma yo gukora iri genzura, Deequ irayihindura murukurikirane rw'imirimo ya Spark, ikora kugirango ibare ibipimo ku makuru. Nyuma, irahamagarira imikorere yawe yo kwemeza (urugero, _ == 5 kugirango igenzure ingano) kuri ibi bipimo kugirango urebe niba inzitizi zifata amakuru. Turashobora kugenzura ikintu "verisiyo Igisubizo" kugirango turebe niba ikizamini cyabonye amakosa:


 import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } }


Niba dukoresha urugero, tubona ibisohoka bikurikira:


 We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!


Ikizamini cyasanze ibitekerezo byacu byarenze! 4 gusa kuri 5 (80%) byagaciro byibicuruzwa Izina ryiranga ntabwo ari impfabusa, kandi 2 kuri 5 gusa (ni ukuvuga 40%) indangagaciro yibisobanuro byarimo URL. Kubwamahirwe, twakoze ikizamini dusanga amakosa; umuntu agomba guhita akosora amakuru!

QA Igenzura hamwe nuburyo bwibarurishamibare

Mugihe Deequ itanga urwego rukomeye rwo kwemeza amakuru, guhuza uburyo bwibarurishamibare birashobora kurushaho kunoza igenzura rya QA, cyane cyane niba urimo ukora ibipimo byegeranye bya dataset. Reka turebe uburyo ushobora gukoresha uburyo bwibarurishamibare kugirango ukurikirane kandi urebe neza ireme ryamakuru.

Kwandika Kubara

Reba ibintu byubucuruzi aho ETL (Gukuramo, Guhindura, Kuzamura) itanga N inyandiko kumurimo uteganijwe kumunsi. Amatsinda yingoboka arashobora gushiraho QA cheque kugirango azamure niba hari itandukaniro rikomeye mubarwa. Kurugero, niba inzira isanzwe itanga inyandiko ziri hagati ya 9.500 na 10.500 buri munsi mumezi abiri, kwiyongera cyangwa kugabanuka gukomeye bishobora kwerekana ikibazo hamwe namakuru yibanze.


Turashobora gukoresha uburyo bwibarurishamibare kugirango dusobanure iyi mbago inzira igomba kuzamura imenyesha itsinda ryunganira. Hasi yerekana ishusho yo kubara inyandiko ikurikirana mumezi abiri:











Kugira ngo dusesengure ibi, turashobora guhindura imibare yo kubara kugirango turebe umunsi-ku munsi impinduka. Izi mpinduka muri rusange zinyeganyega hafi zeru, nkuko bigaragara mu mbonerahamwe ikurikira:












Iyo duhagarariye iki gipimo cyimpinduka hamwe nogukwirakwiza bisanzwe, ikora umurongo w inzogera, byerekana ko amakuru yatanzwe mubisanzwe. Impinduka ziteganijwe ni hafi 0%, hamwe no gutandukana bisanzwe bya 2,63%.













Iri sesengura ryerekana ko mubare wanditse mubisanzwe biri hagati ya -5.26% kugeza kuri + 5.25% hamwe nicyizere 90%. Ukurikije ibi, urashobora gushyiraho itegeko ryo kuzamura imenyesha niba umubare wanditse utandukiriye kurenga iyi ntera, ukemeza gutabara mugihe.

Ikiranga Igipfukisho

Ikiranga igifuniko e bivuga igipimo cyagaciro kitari NULL kumubare wuzuye wanditse kuri dataset snapshot. Kurugero, niba 8 kuri 100 byanditse bifite NULL agaciro kubintu runaka, ubwishingizi kuri iyo miterere ni 92%.


Reka dusubiremo urundi rubanza rwubucuruzi hamwe na ETL itanga ibicuruzwa kumeza yibicuruzwa buri munsi. Turashaka gukurikirana ikwirakwizwa ryibicuruzwa bisobanura. Niba ubwishingizi buguye munsi yurwego runaka, hagomba gukangurwa itsinda ryunganira. Hasi ni ishusho yerekana ibiranga ibicuruzwa bisobanurwa mumezi abiri:









Mugusesengura itandukaniro ryumunsi-ku-munsi itandukaniro, turareba ko impinduka zinyeganyega hafi zeru:










Guhagararira aya makuru nkibisanzwe bisanzwe byerekana ko mubisanzwe bitangwa hamwe nimpinduka ziteganijwe guhinduka hafi 0% naho gutandukana kwa 2,45%.















Nkuko tubibona, kuriyi dataset, ibicuruzwa bisobanura ibiranga ubwishingizi mubisanzwe kuva kuri -4.9% kugeza kuri + 4.9% hamwe nicyizere 90%. Dushingiye kuri iki kimenyetso, turashobora gushyiraho itegeko ryo kuzamura integuza niba ubwishingizi butandukiriye kurenga iyi ntera.

QA Kugenzura Na Igihe Urukurikirane Algorithms

Niba ukorana na datasets yerekana itandukaniro rikomeye bitewe nibintu nkibihe cyangwa ibihe, uburyo bwibarurishamibare gakondo bushobora gukurura ibinyoma. Ibihe byuruhererekane algorithms itanga uburyo bunonosoye, butezimbere ubunyangamugayo nubwizerwe bwigenzura rya QA.


Kugirango utange ibisobanuro byumvikana neza, urashobora gukoresha byombi Impuzandengo Yimuka Yimuka (ARIMA) cyangwa i Uburyo bwa Holt-Winters . Iyambere nibyiza bihagije kuri datasets hamwe nibigenda, ariko iyanyuma itwemerera guhangana na datasets hamwe nibihe n'ibihe. Ubu buryo bukoresha ibice byurwego, icyerekezo, nibihe, bikemerera guhuza neza nimpinduka mugihe.


Reka dushinyagure-kugurisha buri munsi kwerekana ibyerekezo n'ibihe ukoresheje Holt-Winters:

 import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies)


Ukoresheje ubu buryo, urashobora kumenya gutandukana gukomeye bishobora kwerekana ibibazo byubuziranenge bwamakuru, bitanga uburyo bunoze bwo kugenzura QA.


Nizere ko iyi ngingo izagufasha gushyira mubikorwa neza QA igenzura ryamakuru yawe manini. Ukoresheje isomero rya Deequ no guhuza uburyo bwibarurishamibare hamwe nigihe cyurukurikirane rwa algorithms, urashobora kwemeza ubudakemwa bwamakuru no kwizerwa, amaherezo ukazamura imikorere yawe yo gucunga amakuru.


Gushyira mubikorwa tekinike zasobanuwe haruguru bizagufasha gukumira ibibazo bishobora kuba muri porogaramu zo hasi no kuzamura ireme rusange ryibikorwa byawe.

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks