paint-brush
Outlier Detección: Imakunatan yachanaykiby@nataliaogneva
54,575 ñawinchasqakuna
54,575 ñawinchasqakuna

Outlier Detección: Imakunatan yachanayki

by Natalia Ogneva4m2024/04/23
Read on Terminal Reader
Read this story w/o Javascript

Nishu unay; Ñawinchanapaq

Analistakunaqa sapa kutim llamkayninkupi datukunapi mana allin kaqkunawan tupanku. Decisiones nisqakunaqa aswantaqa promedio de muestra nisqapim ruwakun, chaymi anchata sensibles outliers nisqaman. Importantemi outliers nisqakunata kamachiy, allinta tanteanapaq. Mana costumbre kaq valorkunawan llamkanapaq achka sasan hinaspa utqaylla ruwaykunata qawarisun.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Outlier Detección: Imakunatan yachanayki
Natalia Ogneva HackerNoon profile picture

Analistakunaqa sapa kutim llamkasqanku pachapi willakuykunapi mana allin kaqkunawan tupanku, ahinataq AB-prueba analisispi, predictivo modelokuna ruwaypi utaq tendenciakuna qatipaypi. Decisiones nisqakunaqa aswantaqa promedio de muestra nisqapim ruwakun, chaymi ancha sensibles outliers nisqaman, chaymi valorta anchata cambianman. Chaymi, ancha allin kanman outliers nisqakunata kamachiy, allinta tanteanapaq.


Mana costumbre kaq valorkunawan llamkanapaq achka sasan hinaspa utqaylla ruwaykunata qawarisun.

Sasachakuykuna Formulación

Yuyaykuy huk experimento analisis ruwayta necesitasqaykita huk promedio orden valorta huk primaria métrica hina llamk'achispa. Nisunman, métricanchikqa normal distribución nisqayuqmi. Hinallataq, yachanchikmi prueba huñupi métrica rakiyqa hukniray kasqanmanta controlpi. Huk rimaypiqa, controlpi rakinakuypa promedionqa 10, pruebapiñataqmi 12. Iskaynin huñupi desviación estándarqa 3.


Ichaqa iskaynin muestrakunam kanku outliers nisqakuna, chaymi skew chay medios de muestra nisqatapas chaymanta desviación estándar de muestra nisqatapas.

 import numpy as np N = 1000 mean_1 = 10 std_1 = 3 mean_2 = 12 std_2 = 3 x1 = np.concatenate((np.random.normal(mean_1, std_1, N), 10 * np.random.random_sample(50) + 20)) x2 = np.concatenate((np.random.normal(mean_2, std_2, N), 4 * np.random.random_sample(50) + 1))

NB chay métrica nisqamanta qhawarispaqa iskaynin ladomanta outliers nisqayuq kanman. Sichus métricayki huk ladumantalla outliers nisqayuq kanman, métodokuna chaypaq mana sasachu tikrasqa kanman.

Chupakuna Kuchusqa

Aswan facil ruwayqa llapa qawariykunata kuchuymi manaraq 5% percentil nisqa kachkaptin , 95% percentil nisqa qipatapas . Kay casopiqa, 10% willayta chinkachirqayku con hina. Ichaqa, rakinakuykunaqa aswan formasqa hinam rikurinku, chaymantam muestra momentos nisqakunaqa aswan hichpallapim kachkan distribucin momentos nisqaman.

 import numpy as np x1_5pct = np.percentile(x1, 5) x1_95pct = np.percentile(x1, 95) x1_cutted = [i for i in x1 if i > x1_5pct and i < x1_95pct] x2_5pct = np.percentile(x2, 5) x2_95pct = np.percentile(x2, 95) x2_cutted = [i for i in x2 if i > x2_5pct and i < x2_95pct]


Huk ñanqa hawa específico nisqa qhawaykunata qarquymi . Pisi bandaqa 25% percentilwan menos huk kuskanninwanmi tupan intercuartílico nisqapa chawpinpi, hatun bandañataqmi 75% percentilwan kuskanwan kuskanchasqa. Kaypiqa, 0,7% willayllata chinkachisunchik. Rakiykuna aswan formasqa hinam qawakun qallariymantaqa. Chay muestra momentos nisqakunaqa aswanmi kaqlla kanku chay distribución momentos nisqawan.

 import numpy as np low_band_1 = np.percentile(x1, 25) - 1.5 * np.std(x1) high_band_1 = np.percentile(x1, 75) + 1.5 * np.std(x1) x1_cutted = [i for i in x1 if i > low_band_1 and i < high_band_1] low_band_2 = np.percentile(x2, 25) - 1.5 * np.std(x2) high_band_2 = np.percentile(x2, 75) + 1.5 * np.std(x2) x2_cutted = [i for i in x2 if i > low_band_2 and i < high_band_2]

Bootstrap nisqa

Iskay kaq método kaypi qhawarisqaykuqa bootstrap nisqa. Kay enfoquepiqa, promedio nisqa ruwakun submuestras nisqapa promedio nisqa hina. Ejemploykupiqa, control qutupi promedioqa 10,35 kaqwan kikin, prueba qutupitaq 11,78 kaqwan. Aswan allin ruwayraqmi kachkan yapasqa willakuy ruwaywan tupachisqa.

 import pandas as pd def create_bootstrap_samples( sample_list: np.array, sample_size: int, n_samples: int ): # create a list for sample means sample_means = [] # loop n_samples times for i in range(n_samples): # create a bootstrap sample of sample_size with replacement bootstrap_sample = pd.Series(sample_list).sample(n = sample_size, replace = True) # calculate the bootstrap sample mean sample_mean = bootstrap_sample.mean() # add this sample mean to the sample means list sample_means.append(sample_mean) return pd.Series(sample_means) (create_bootstrap_samples(x1, len(x1), 1000).mean(), create_bootstrap_samples(x2, len(x2), 1000).mean())

Conclusion

Outlier nisqakuna tariy chaymanta ruwayqa ancha allinmi allin tanteayta ruwanapaq. Kunanqa, kimsa utqaylla chaymanta chiqan ruwaykunallapas yanapasunkiman manaraq t'aqwichkaspa willayta qhawayta.


Ichaqa, ancha allinmi yuyarinapaq, chay outlierkuna tarisqa mana costumbre kaq valorkuna kanman chaymanta huk ruway chay efecto novedad kaqpaq. Ichaqa huk willakuymi :)