paint-brush
Détection ya ba outlier: Oyo osengeli koyebapene@nataliaogneva
54,579 botángi
54,579 botángi

Détection ya ba outlier: Oyo osengeli koyeba

pene Natalia Ogneva4m2024/04/23
Read on Terminal Reader
Read this story w/o Javascript

Molai mingi; Mpo na kotánga

Mbala mingi, ba analystes bakutanaka na ba outliers na ba données na tango ya mosala na bango. Mbala mingi, mikano ezwamaka na nzela ya moyenne ya échantillon, oyo ezali très sensible na ba outliers. Ezali na ntina mingi ko gérer ba outliers mpo na kozua décision ya malamu. Totalela ba approches ebele ya pete pe ya mbangu pona kosala na ba valeurs inhabituelles.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Détection ya ba outlier: Oyo osengeli koyeba
Natalia Ogneva HackerNoon profile picture

Mbala mingi, ba analystes bakutanaka na ba outliers na ba données na tango ya mosala na bango, lokola na tango ya analyse AB-test, kosala ba modèles prédictifs, to kolandela ba tendances. Mbala mingi, mikano ezwamaka na nzela ya moyenne ya échantillon, oyo ezali très sensible na ba outliers mpe ekoki kobongola motuya na ndenge ya somo. Donc, ezali crucial ko gérer ba outliers pona kozua décision correcte.


Totalela ba approches ebele ya pete pe ya mbangu pona kosala na ba valeurs inhabituelles.

Formulation ya ba problèmes

Kanisa ete osengeli kosala analyse ya expérience na kosalelaka valeur moyenne ya ordre lokola métrique primaire. Toloba que métrique na biso ezalaka mingi mingi na distribution normale. Lisusu, toyebi ete distribution métrique na groupe ya test ekeseni na oyo ya contrôle. Na maloba mosusu, moyenne ya distribution na contrôle ezali 10, mpe na test ezali 12. Écart standard na ba groupes nionso mibale ezali 3.


Kasi, ba échantillons nionso mibale ezali na ba outliers oyo e skew moyenne ya échantillon na écart standard ya échantillon.

 import numpy as np N = 1000 mean_1 = 10 std_1 = 3 mean_2 = 12 std_2 = 3 x1 = np.concatenate((np.random.normal(mean_1, std_1, N), 10 * np.random.random_sample(50) + 20)) x2 = np.concatenate((np.random.normal(mean_2, std_2, N), 4 * np.random.random_sample(50) + 1))

NB que considérant métrique ekokaki kozala na ba outliers de deux côtés. Soki métrique na yo ekokaki kozala na ba outliers kaka na ngambo moko, ba méthodes ekokaki ko transformer facilement pona tina wana.

Kokata Mikila

Lolenge ya pete ezali ya kokata ba observations nionso liboso ya 5% percentile pe sima ya 95% percentile . Na cas oyo, to perdre 10% ya ba informations comme con. Kasi, ba distributions emonanaka mingi esalemi, pe ba moments ya échantillon ezali pene na ba moments ya distribution.

 import numpy as np x1_5pct = np.percentile(x1, 5) x1_95pct = np.percentile(x1, 95) x1_cutted = [i for i in x1 if i > x1_5pct and i < x1_95pct] x2_5pct = np.percentile(x2, 5) x2_95pct = np.percentile(x2, 95) x2_cutted = [i for i in x2 if i > x2_5pct and i < x2_95pct]


Lolenge mosusu ezali ya kolongola ba observations oyo ezali libanda ya intervalle spécifique . Bande ya se ekokani na percentile ya 25% moins ndambo moko ya interquartile, mpe bande ya likolo ekokani na percentile ya 75% bakisa ndambo moko. Awa, tokobungisa kaka 0,7% ya ba informations. Ba distributions ezo monana que esalemi mingi koleka oyo ya ebandeli. Ba moments ya échantillon ekokani mingi na ba moments ya distribution.

 import numpy as np low_band_1 = np.percentile(x1, 25) - 1.5 * np.std(x1) high_band_1 = np.percentile(x1, 75) + 1.5 * np.std(x1) x1_cutted = [i for i in x1 if i > low_band_1 and i < high_band_1] low_band_2 = np.percentile(x2, 25) - 1.5 * np.std(x2) high_band_2 = np.percentile(x2, 75) + 1.5 * np.std(x2) x2_cutted = [i for i in x2 if i > low_band_2 and i < high_band_2]

Bootstrap ya kosala

Méthode ya mibale oyo to considérer awa ezali bootstrap. Na ndenge oyo, moyenne etongami lokola moyenne ya ba sous-échantillons. Na ndakisa na biso, moyenne na groupe témoin ekokani na 10,35, mpe groupe ya test ezali 11,78. Ezali kaka résultat ya malamu koleka soki tokokanisi yango na traitement ya ba données ya kobakisa.

 import pandas as pd def create_bootstrap_samples( sample_list: np.array, sample_size: int, n_samples: int ): # create a list for sample means sample_means = [] # loop n_samples times for i in range(n_samples): # create a bootstrap sample of sample_size with replacement bootstrap_sample = pd.Series(sample_list).sample(n = sample_size, replace = True) # calculate the bootstrap sample mean sample_mean = bootstrap_sample.mean() # add this sample mean to the sample means list sample_means.append(sample_mean) return pd.Series(sample_means) (create_bootstrap_samples(x1, len(x1), 1000).mean(), create_bootstrap_samples(x2, len(x2), 1000).mean())

Maloba ya nsuka

Détection ya ba outlier pe traitement ezali na tina pona kozua décision ya malamu. Sikawa, ata mayele misato ya mbangu mpe ya semba ekoki kosalisa yo otala ba données liboso ya analyse.


Kasi, ezali na ntina mingi kobosana te ete ba outliers oyo ezwami ekoki kozala ba valeurs inhabituelles mpe eloko moko ya effet ya nouveauté. Kasi ezali lisolo mosusu :)