We've all come across warnings when visiting suspicious websites. Your browser or search engine might even block you from entering, displaying a message that this site may harm your device. But what if the site you're trying to visit is not flagged as malicious? According to SiteLock's 2022 Security Report, by search engines. This means that businesses and individuals are vulnerable to attack when they visit these sites. 92% of infected websites are not blacklisted There are a number of reasons why search engines are missing infected sites. Firstly, it can take weeks or even months for a website to be identified as malicious. This is because attackers are constantly changing their tactics to evade detection. Secondly, many businesses don't realize their site has been hacked until it's too late. And thirdly, even if a website is flagged, there's no guarantee that users will avoid it. How AI can secure the web So what can be done to protect businesses and users from these threats? Just as cybercriminals use AI to automate their attacks, so too can we use AI to defend businesses. This isn't merely theory; An IEEE analysis of AI-based malware detection techniques concluded that they " ," such as in terms of accuracy, speed, and scalability. provide significant advantages For example, SafeDNS uses " ," achieving 98% precision in detecting malware. They use a "database of malware" to fuel machine learning models that analyze data to look for new patterns of behavior that could indicate a threat. This allows them to identify threats quickly and effectively, before they can do any damage. continuous machine learning If we want to stay one step ahead of cybercriminals, we need to use AI to defend our businesses. Recent research is a wake-up call - it's time to take action and invest in AI-powered solutions. Detecting Malware - A Python Proof of Concept There are many ways to detect and protect against malware. In this section, we'll take a look at one such method: using Python to detect malware based on a dataset of executable files. View the full associated code . here The dataset we'll be using is from Kaggle's “ " dataset. It's made of 373 samples of executable files, 301 of which are malicious files and 72 of which are non-malicious. As you can see, the dataset is imbalanced, with regular files outnumbered by malware files. Malware Executable Detection There are 531 features represented in the dataset, from F1 to F531, and a label column stating whether the file is malicious or non-malicious. We won't be using all of these features, but we'll be using a variety of them to build our models. We'll start by importing the necessary libraries for our demo. We'll be using the pandas, numpy, and scikit-learn libraries: import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,accuracy_score,confusion_matrix,recall_score,precision_score,f1_score, auc, roc_auc_score
from sklearn.model_selection import train_test_split Next, we'll load in the dataset: df = pd.read_csv('uci_malware_detection.csv') Now that we've taken a look at the dataset, let's go ahead and split it into training and testing sets. We'll also map the labels from strings to numbers and remove duplicates: df['Label'] = df['Label'].map({'malicious': 0, 'non-malicious': 1})
df = df.drop_duplicates(keep=False)

X, y = df.drop("Label", axis=1), df["Label"]
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=42) We're now ready to build our models. We'll be using a simple logistic regression model: lr_model =  LogisticRegression(max_iter=1,000)
lr_model.fit(X_train, y_train) We can now evaluate our model's performance on the testing set: lr_model.score(X_test, y_test)
y_pred = lr_model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print('ROC-AUC score', roc_auc_score(y_test,y_pred))
print('Confusion matrix:\n ', confusion_matrix(y_test, y_pred)) Running this code gives us the following output: 0.9864864864864865
ROC-AUC score 0.9705882352941176
Confusion matrix:
 [[57  0]
 [ 1 16]] Ultimately, we've managed to make an accurate model with both a high precision and recall. Not bad! Of course, this is just a proof of concept, as the real-world situation is orders of magnitude more complex. At scale, AI systems trained on big data can make a real difference in the fight against malware.

Google

The Black Market for Data is on the Rise

Search Engines are Missing Infected Sites, Putting Businesses At Risk

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Can Technology Fix Modern Dating?

AI Cyber Security: Silver Bullet Or Another Potential Vulnerability?

ChatGPT is Exasperating the Insider Threat Risk

FOD 37: Can We Genuinely Trust LLMs?

How Enterprises Can Mitigate the Potential Risks of Generative AI

AI Security — What Are Sources and Sinks?

Can Technology Fix Modern Dating?

AI Cyber Security: Silver Bullet Or Another Potential Vulnerability?

ChatGPT is Exasperating the Insider Threat Risk

FOD 37: Can We Genuinely Trust LLMs?

How Enterprises Can Mitigate the Potential Risks of Generative AI

AI Security — What Are Sources and Sinks?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps