paint-brush
QA Vérifier ba Big Datasets Na Deequ & Méthodes Statistiquespene@akshayjain1986
42,054 botángi
42,054 botángi

QA Vérifier ba Big Datasets Na Deequ & Méthodes Statistiques

pene Akshay Jain7m2024/05/30
Read on Terminal Reader
Read this story w/o Javascript

Molai mingi; Mpo na kotánga

Bibliothèque ya Deequ ezali cadre ya profilage ya ba données na QA ya source ouverte oyo etongami na Spark. Ezali kopesa yo nzela ya kolimbola mibeko ya bondimi ya mindondo oyo ebongisami na masengi na yo ya sikisiki, kosala ete bobateli ya mobimba ezala. Deequ ezali na ba metrics ya monene mpe makoki ya détection ya anomalie oyo ekosalisa yo oyeba mpe o répondre na proactivement ba problèmes ya qualité ya ba données. Tala ndenge okoki ko mettre en œuvre ba vérifications oyo na nzela ya Deequ.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - QA Vérifier ba Big Datasets Na Deequ & Méthodes Statistiques
Akshay Jain HackerNoon profile picture

Moko ya mayele ya ntina mingi ya mosali ya mayele ya ba données oyo ayebi malamu ezali kosimba malamu ba ensembles ya ba données ya minene, ko assurer qualité ya ba données mpe fidélité. Données ezali pièce centrale mpe fondamentale ya système nionso ya ba données, mpe ata soki ozali na makoki nini ya malamu na ba aspects mosusu ya commerce na biso, oyo ezali oyo okoki kozala na makoki ya koboya te.


Na article oyo, na explorer ba techniques robustes pona kosala ba vérifications ya QA na ba ensembles ya ba données ya minene en utilisant bibliothèque Deequ na ba méthodes statistiques. Na kosangisaka ba approches oyo nalimboli na se, okozala na makoki ya kobatela intégrité ya ba données, ko améliorer ba pratiques na yo ya gestion ya ba données, pe kopekisa ba problèmes oyo ekoki kobima na ba applications ya se.

QA esali ba vérifications na nzela ya bibliothèque ya Deequ

Mpo na nini Deequ?

Kosala ete ba données ezala na échelle ezali mosala ya mpasi, mingi mingi tango ozali kosala na ba milliards ya ba lignes oyo ebombami na ba systèmes ya ba fichiers distribués to ba entrepôts ya ba données. Bibliothèque ya Deequ ezali cadre ya profilage ya ba données na QA ya source ouverte oyo etongami na Spark oyo ezali outil moderne mpe polyvalent oyo ebongisami mpo na ko résoudre problème oyo. Eloko ekesenisaka yango na bisaleli ya ndenge moko ezali makoki na yango ya kosangisa na ndenge ya malamu na Spark, kosalela nguya ya kosala mosala oyo ekabolami mpo na kosimba malamu ba ensembles ya ba données ya monene.


Ntango omeki yango, okomona ndenge nini bopeto na yango epesi yo nzela ya kolimbola mibeko ya mindondo ya bondimi oyo ebongisami na masengi na yo ya sikisiki, kosala ete bozipi ya mobimba ezala. En plus, Deequ ezali na ba metrics ya minene pe ba capacités ya détection ya anomalie oyo ekosalisa yo oyeba pe o répondre na proactivement ba problèmes ya qualité ya ba données. Mpo na ba professionnels ya ba données oyo basalaka na ba ensembles ya ba données ya minene mpe ya dynamique, Deequ ezali solution ya mbeli ya Suisse. Tótala ndenge oyo tokoki kosalela yango.

Kobongisa Deequ

Makambo mingi na ntina ya bobongisi ya bibliothèque ya Deequ mpe makambo ya bosaleli nzinganzinga ya profilage ya ba données ezali accessible awa . Mpo na kozala pete, na ndakisa oyo, towuti kobimisa mwa ba disques ya masano:


 val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd)


Kolimbola ba suppositions ya ba données

Mingi ya ba applications ya ba données eyaka na ba suppositions implicites na oyo etali ba attributs ya ba données, lokola ba valeurs non-NULL pe uniqueté. Na Deequ, ba suppositions wana ekomaka explicites na nzela ya ba tests unitaires. Talá mwa ba chèques oyo bato mingi basalelaka:


  1. Motango ya milɔngɔ: Sala ete ensemble ya ba données ezala na motango moko ya sikisiki ya milɔngɔ.


  2. Bokokani ya Attribut: Tala soki ba attributs lokola id na productName ezali jamais NULL.


  3. Bokeseni ya bizaleli: Sala ete bizaleli mosusu, lokola id, ezala ya bokeseni.


  4. Value Range: Valider que ba attributs lokola priorité na numViews ekweyi na ba ranges oyo ezelamaki.


  5. Bokokani ya motindo: Tala soki bandimbola ezali na ba URL ntango okanisaki.


  6. Propriétés statistiques : Sala que médiane ya ba attributs numériques ekokisaka ba critères spécifiques.


Tala ndenge okoki ko mettre en œuvre ba vérifications oyo na nzela ya Deequ:


 import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run()


Ko interpréter ba résultats

Sima ya kosala ba vérifications wana, Deequ ebongoli yango na série ya ba travaux ya Spark, oyo e exécuter pona ko calculer ba métriques na ba données. Na sima, ebengi ba fonctions na yo ya assertion (ndakisa, _ == 5 pona vérification ya taille) na ba metrics oyo pona kotala soki ba contraintes esimbaka na ba données. Tokoki kotala eloko "verificationResult" mpo na komona soki momekano ezwaki mabunga:


 import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } }


Soki tosali exemple, tokozua sortie oyo elandi:


 We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!


Momekano yango emonisaki ete makanisi na biso ebukamaki! Kaka 4 na kati ya 5 (80%) ya ba valeurs ya attribut productName ezali non-null, mpe kaka 2 sur 5 (elingi koloba, 40%) ya ba valeurs ya attribut ya description nde ezalaki na URL. Likambo ya esengo, tosalaki momekano moko mpe tomonaki mabunga yango; esengeli mutu abongisa mbala moko ba données!

QA Esalaka Na Ba Méthodes Statistiques

Atako Deequ epesaka cadre ya makasi mpo na validation ya ba données, kosangisa ba méthodes statistiques ekoki kotombola lisusu ba vérifications na yo ya QA, mingi mingi soki ozali kosala na ba metrics agrégées ya ensemble ya ba données. Totala ndenge nini okoki kosalela ba méthodes statistiques pona ko suivre pe ko assurer qualité ya ba données.

Kolanda motango ya ba enregistrements

Tala scénario ya mombongo esika processus ya ETL (Extract, Transform, Load) ebimisaka N enregistrements na mosala oyo esengelaki kosalema mokolo na mokolo. Ba équipes ya soutien ekoki kolinga kosala ba vérifications ya QA pona kotombola alerte soki déviation ya monene ezali na nombre ya dossier. Na ndakisa, soki mbala mingi mosala yango ebimisaka ba dossiers 9.500 kino 10.500 mokolo na mokolo na boumeli ya sanza mibale, bomati to bokiti nyonso ya monene ekoki kolakisa likambo moko na ba données oyo ezali na se.


Tokoki kosalela méthode statistique pona kolimbola seuil oyo na processus nini esengeli kotombola alerte na équipe ya soutien. Awa na nse ezali ndakisa ya kolandela motango ya ba records na boumeli ya sanza mibale:











Pona ko analyser yango, tokoki ko transformer ba données ya nombre ya enregistrement pona kotala ba changements ya mokolo na mokolo. Mingimingi mbongwana yango eninganaki zingazinga ya zéro, ndenge emonisami na etanda oyo elandi:












Tango tozali ko représenter vitesse oyo ya changement na distribution normale, esala courbe ya cloche, elakisaka que ba données e distribuer normalement. Mbongwana oyo ezelamaki ezali pene na 0%, na écart standard ya 2,63%.













Analyse oyo ezali kolakisa ete motango ya ba records ekiti mingi mingi na kati ya -5,26% kino +5,25% na confiance ya 90%. Na kotalaka yango, okoki kosala mobeko mpo na kotombola alerte soki motango ya ba dossiers e dévier koleka intervalle oyo, ko assurer intervention na tango.

Bolandi ya bozipi ya bizaleli

Bozipi ya bizaleli elakisi rapport ya ba valeurs oyo ezali NULL te na motango mobimba ya enregistrement pona instantané ya ensemble ya ba données. Ndakisa, soki 8 na kati ya 100 enregistrement ezali na valeur NULL mpo na attribut moko boye, couverture mpo na attribut wana ezali 92%.


Totala lisusu cas d'affaires mosusu na processus ya ETL oyo ezali ko générer instantané ya tableau ya produit mokolo na mokolo. Tolingi ko suivre couverture ya ba attributs ya description ya produit. Soki couverture ekiti na se ya seuil moko boye, esengeli kotombola alerte pona équipe ya soutien. Awa na nse ezali na bomonisi ya bozipi ya bizaleli mpo na bandimbola ya biloko na boumeli ya sanza mibale:









Na ko analyser ba différences absolues ya mokolo na mokolo na couverture, tomoni que ba changements ezo osciller autour ya zéro:










Kolakisa ba données oyo lokola distribution normale elakisi que ezali normalement distribuée na changement prévu ya environ 0% pe écart standard ya 2,45%.















Ndenge tozali komona, pona ensemble ya ba données oyo, couverture ya attribut ya description ya produit ezalaka typiquement kobanda -4,9% ti +4,9% na confiance ya 90%. Na kotalaka elembo oyo, tokoki kotia mobeko mpo na kotombola alerte soki couverture e dévier koleka intervalle oyo.

QA Ezali Kosala Vérification Na Ba Algorithmes Ya Série Temps

Soki osali na ba ensembles ya ba données oyo elakisaka ba variations ya minene na tina ya makambo lokola saisonnalité to ba tendances, ba méthodes statistiques traditionnelles ekoki ko déclencher ba alertes ya lokuta. Ba algorithmes ya série temporelle epesaka approche moko ya refiné mingi, kobongisa précision mpe fidélité ya ba vérifications na yo ya QA.


Mpo na kobimisa makebisi ya mayele mingi, okoki kosalela soit ba Moyenne mouvement intégré autorégressif (ARIMA) . to mpe ba Méthode ya Holt-Winters . Ya liboso ezali malamu mpo na ba ensembles ya ba données oyo ezali na ba tendances, kasi oyo ya suka e permettre biso tosala na ba ensembles ya ba données oyo ezali na tendance mpe na saisonnalité. Méthode oyo esalelaka ba composants pona niveau, tendance, pe saisonnalité, oyo epesaka yango nzela ya ko s’adapter na ndenge ya flexibilité na ba changements na tango.


Tosala mock-model ya ba ventes ya mokolo na mokolo oyo elakisaka ezala tendance mpe ba modèles ya saison na kosalelaka Holt-Winters:

 import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies)


Na kosalelaka méthode oyo, okoki ko détecter ba déviations ya minene oyo ekoki kolakisa ba problèmes ya qualité ya ba données, kopesa approche nuanced mingi pona ba vérifications ya QA.


Nazali na elikya ete lisolo oyo ekosalisa yo osalela malamu ba vérifications ya QA mpo na ba ensembles ya ba données na yo ya minene. Na kosalelaka bibliothèque ya Deequ mpe kosangisa ba méthodes statistiques mpe ba algorithmes ya série temporelle, okoki ko assurer intégrité ya ba données mpe fidélité, na suka ko améliorer ba pratiques na yo ya gestion ya ba données.


Kosalela mayele oyo elimbolami likolo ekosalisa yo kopekisa makambo oyo ekoki kobima na ba applications ya se mpe kobongisa qualité mobimba ya ba flux ya mosala ya ba données na yo.