We came to know Deflect in June of 2012, when violence broke out in the Rakhine state of Myanmar, Burma. Our website, the first Rohingya News Agency, had been continuously attacked and went down. The Deflect team brought it back on-line and are still protecting it today.
Ronnie, Director at Arakan Rohingya National Organisation
eQualitie really saved the day for us when our website was being attacked by the far-right, which caused us to lose our hosting. You all stepped in, offered to host the site (and refused our offer to pay!), and protected us from future attacks with Deflect. We are a grassroots, volunteer-run anarchist community space funded by individual donations, and eQualitie’s support was crucial for us in having the resiliency to get through those times. It allowed us to keep our main website online, but also our library catalogue (largest free library of political books in Hamilton) and our fundraising platform (used pretty heavily for regional anti-repression work).
We basically liked everything, particularly the principles with regards to respect for privacy and data ownership– we’re assured that Deflect won’t sell our data. It also has better integration of the Let’s encrypt SSL certificate. We don’t have to worry about renewing it every 3 months because Deflect has it included in its workflow/process. The support is responsive and helpful!
This site is a public resource that has provided a wide range of information about local government and community problems in the Prirpin region since 2013. This is the main communication platform for Prirpin communities.
Thank you so much for your support in getting it back up and running and also for Deflect’s general support of nonprofits. It’s vital that our Ugandan team can communicate their work to their own networks, and we’re very grateful to have found a way to make that possible.
Esther Smitheram, Communications Manager, Children on the Edge
Baskerville is a machine operating on the Deflect network that protects sites from hounding, malicious bots. It’s also an open source project that, in time, will be able to reduce bad behaviour on your networks too. Baskerville responds to web traffic, analyzing requests in real-time, and challenging those acting suspiciously. A few months ago, Baskerville passed an important milestone – making its own decisions on traffic deemed anomalous. The quality of these decisions (recall) is high and Baskerville has already successfully mitigated many sophisticated real-life attacks.
We’ve trained Baskerville to recognize what legitimate traffic on our network looks like, and how to distinguish it from malicious requests attempting to disrupt our clients’ websites. Baskerville has turned out to be very handy for mitigating DDoS attacks, and for correctly classifying other types of malicious behaviour.
Baskerville is an important contribution to the world of online security – where solid web defences are usually the domain of proprietary software companies or complicated manual rule-sets. The ever-changing nature and patterns of attacks makes their mitigation a continuous process of adaptation. This is why we’ve trained a machine how to recognize and respond to anomalous traffic. Our plans for Baskerville’s future will enable plug-and-play installation in most web environments and privacy-respecting exchange of threat intelligence data between your server and the Baskerville clearinghouse.
Chapter 2 – Background
Web attacks are a threat to democratic voices on the Internet. Botnets deploy an arsenal of methods, including brute force password login, vulnerability scanning, and DDoS attacks, to overwhelm a platform’s hosting resources and defences, or to wreak financial damage on the website’s owners. Attacks become a form of punishment, intimidation, and most importantly, censorship, whether through direct denial of access to an Internet resource or by instilling fear among the publishers. Much of the development to-date in anomaly detection and mitigation of malicious network traffic has been closed source and proprietary. These silo-ed approaches are limiting when dealing with constantly changing variables. They are also quite expensive to set-up, with a company’s costs often offset by the sale or trade of threat intelligence gathered on the client’s network, something Deflect does not do or encourage.
Since 2010, the Deflect project has protected hundreds of civil society and independent media websites from web attacks, processing over a billion monthly website requests from humans and bots. We are now bringing internally developed mitigation tooling to a wider audience, improving network defences for freedom of expression and association on the internet.
Baskerville was developed over three years by eQualitie’s dedicated team of machine learning experts. Several challenges or ambitions were presented to the team. To make this an effective solution to the ever-growing need for humans to perform constant network monitoring, and the never-ending need to create rules to ban newly discovered malicious network behaviour, Baskerville had to:
Be fast enough to make it count
Be able to adapt to changing traffic patterns
Provide actionable intelligence (a prediction and a score for every IP)
Provide reliable predictions (probation period & feedback)
Baskerville works by analyzing HTTP traffic bound for your website, monitoring the proportion of legitimate vs anomalous traffic. On the Deflect network, it will trigger a Turing challenge to an IP address behaving suspiciously, thereafter confirming whether a real person or a bot is sending us requests.
Chapter 3 – Baskerville Learns
To detect new evolving threats, Baskerville uses the unsupervised anomaly detection algorithm Isolation Forest. The majority of anomaly detection algorithms construct a profile of normal instances, then classify instances that do not conform to the normal profile as anomalies. The main problem with this approach is that the model is optimized to detect normal instances, but not optimized to detect anomalies causing either too many false alarms or too few anomalies. In contrast, Isolation Forest explicitly isolates anomalies rather than profiling normal instances. This method is based on a simple assumption: ‘Anomalies are few, and they are different’. In addition, the Isolation Forest algorithm does not require a training set to contain normal instances only. Moreover, the algorithm performs even better if the training set contains some anomalies, or attack incidents in our case. This enables us to re-train the model regularly on all the recent traffic without any labeling procedure in order to adapt to the changing patterns.
Labelling
Despite the fact that we don’t need labels to train a model, we still need a labelled dataset of historical attacks for parameter tuning. Traditionally, labelling is a challenging procedure since it requires a lot of manual work. Every new attack must be reported and investigated, and every IP should be labelled either malicious or benign.
Our production environment reports several incidents a week, so we designed an automated procedure of labelling using a machine model trained on the same features we use for the Isolation Forest anomaly detection model.
We reasoned that if an attack incident has a clearly visible traffic spike, we can assume that the vast majority of the IPs during this period are malicious, and we can train a classifier like Random Forest particularly for this incident. The only user input would be the precise time period for that incident and for the time period for ordinal traffic for that host. Such a classifier would not be perfect, but it would be good enough to be able to separate some regular IPs from the majority of malicious IPs during the time of the incident. In addition, we assume that attacker IPs most likely are not active immediately before the attack, and we do not label an IP as malicious if it was seen in the regular traffic period.
This labelling procedure is not perfect, but it allows us to label new incidents with very little time or human interaction.
An example of the labelling procedure output
Performance Metrics
We use the Precision-Recall AUC metric for model performance evaluation. The main reason for using the Precision-Recall metric is that it is more sensitive to the improvements for the positive class than the ROC (receiver operating characteristic) curve. We are less concerned about the false positive rate since, in the event that we falsely predict that an IP is doing something malicious, we won’t ban it, but only notify the rule-based attack mitigation system to challenge that specific IP. The IP will only be banned if the challenge fails.
The performance of two different models on two different attacks
Categorical Features
After two months of validating our approach in the production environment, we started to realize that the model was not sophisticated enough to distinguish anomalies specific only to particular clients.
The main reason for this is that the originally published Isolation Forest algorithm supports only numerical features, and could not work with so-called categorical string values, such as hostname. First, we decided to train a separate model per target host and create an assembly of models for the final prediction. This approach over complicated the whole process and did not scale well. Additionally, we had to take care of adjusting the weights in the model assembly. In fact, we jeopardized the original idea of knowledge sharing by having a single model for all the clients. Then we tried to use the classical way of dealing with this problem: one-hot encoding. However, the deployed solution did not work well since the model became too overfit to the new hostname feature, and the performance decreased.
In the next iteration, we found another way of encoding categorical features based on a peer-review paper recently published in 2018. The main idea was not to use one-hot encoding, but rather to modify the tree-building algorithm itself. We could not find the implementation of the idea, and had to modify the source code of IForest library in Scala. We introduced a new string feature ‘hostname,’ and this time the model showed notable performance improvement in production. Moreover, our final implementation was generic and allowed us to experiment with other categorical features like country, user agent, operating system, etc.
Stratified Sampling
Baskerville uses a single machine learning model trained on the data received from hundreds of clients.This allows us to share the knowledge and benefit from a model trained on a global dataset of recorded incidents. However, when we first deployed Baskerville, we realized that the model is biased towards high traffic clients.
We had to find a balance in the amount of data we feed to the training pipeline from each client. On the one hand, we wanted to equalize the number of records from each client, but on the other hand, high traffic clients provided much more valuable incident information. We decided to use stratified sampling of training datasets with a single parameter: the maximum number of samples per host.
Storage
Baskerville uses Postgres to store the processed results. The request-sets table holds the results of the real-time weblogs pre-processed by our analytics engine which has an estimated input of ~30GB per week. So, within a year, we’d have a ~1.5 TB table. Even though this is within Postgres limits, running queries on this would not be very efficient. That’s where the data partitioning feature of Postgres came in. We used that feature to split the request sets table into smaller tables, each holding one week’s data. . This allowed for better data management and faster query execution.
However, even with the use of data partitioning, we needed to be able to scale the database out. Since we already had the Timescale extension for the Prometheus database, we decided to use it for Baskerville too. We followed Timescale’s tutorial for data migration in the same database, which means we created a temp table, moved the data from each and every partition into the temp table, ran the command to create a hypertable on the temp table, deleted the initial request sets table and its partitions, and, finally, renamed the temp table as ‘request sets.’ The process was not very straightforward, unfortunately, and we did run into some problems. But in the end, we were able to scale the database, and we are currently operating using Timescale in production.
We also explored other options, like TileDb, Apache Hive, and Apache HBase, but for the time being, Timescale is enough for our needs. We will surely revisit this in the future, though.
Architecture
The initial design of Baskerville was created with the assumption that Baskerville will be running under Deflect as an analytics engine, to aid the already in place rule-based attack detection and mitigation mechanism. However, the needs changed as it became necessary to open up Baskerville’s prediction to other users and make our insights available to them.
In order to allow other users to take advantage of our model, we had to redesign the pipelines to be more modular. We also needed to take into account the kind of data to be exchanged, more specifically, we wanted to avoid any exchange that would involve sensitive data, like IPs for example. The idea was that the preprocessing would happen on the client’s end, and only the resulting feature vectors would be sent, via Kafka, to the Prediction centre. The Prediction centre continuously listens for incoming feature vectors, and once a request arrives, it uses the pre-trained model to predict and send the results back to the user. This whole process happens without the exchange of any kind of sensitive information, as only the feature vectors go back and forth.
On the client side, we had to implement a caching mechanism with TTL, so that the request sets wait for their matching predictions. If the prediction center takes more than 10 minutes, the request sets expire. 10 minutes, of course, is not an acceptable amount of time, just a safeguard so that we do not keep request sets forever which can result in OOM. The ttl is configurable. We used Redis for this mechanism, as it has the ttl feature embedded, and there is a spark-redis connector we could easily use, but we’re still tuning the performance and thinking about alternatives. We also needed a separate spark application to handle the prediction to request set matching once the response from the Prediction center is received.. This application listens to the client specific Kafka topic, and once a prediction arrives, it looks into redis, fetches the matched request set, and saves everything into the database.
To sum up, in the new architecture, the preprocessing happens on the client’s side, the feature vectors are sent via Kafka to the Prediction centre (no sensitive data exchange), a prediction and a score for each request set is sent as a reply to each feature vector (via Kafka), and on the client side, another Spark job is waiting to consume the prediction message, match it with the respective request set, and save it to the database.
Read more about the project and download the source to try for yourself. Contact us for more information or to get help setting up Baskerville in your web environment.
Deflect responds to the COVID19 emergency with free services
In response and solidarity with numerous efforts that have sprung up to help with communications, coordination and outreach during the COVID-19 epidemic, eQualitie is offering Deflect website security and content delivery services for free until the end of 2020 to organizations and individuals working to help others during this difficult time. This includes:
Availability: as demand for your content grows, our worldwide infrastructure will ensure that your website remains accessible and fast
Security: protecting your website from malicious bots and hackers
Hosting: for existing or new WordPress sites
Aanalytics: view real-time statistics in the Deflect dashboard
Deflect is always offered free of charge to not-for-profit entities that meet our eligibility requirements. This offer extends our free services to any business or individual that is responding to societal needs during the pandemic, including media organizations, government, online retail and hospitality services, etc. We will review all applications to make sure they align with Deflect’s Terms of Use.
It takes 15 minutes to set up and we’ll have you protected on the same day. Our support team can help you in English, French, Chinese, Spanish and Russian. If you have any questions please contact us.
The attacks leading to the publication of this report quickly stood out from the daily onslaught of malicious traffic on Deflect, at first because they were using professional vulnerability scanning tools like Acunetix. The moment we discovered that the origin server of these scans was also hosting fake gmail domains, it became evident that something bigger was going on here. In this report, we describe all the pieces put together about this campaign, with the hope to contribute to public knowledge about the methods and impact of such attacks against civil society.
Context : Human Rights and Surveillance in Uzbekistan
Emblem of Uzbekistan (wikipedia)
Uzbekistan is defined by many human-rights organizations as an authoritarian state, that has known strong repression of civil society. Since the collapse of the Soviet Union, two presidents have presided over a system that institutionalized torture and repressed freedom of expression, as documented over the years by Human Rights Watch, Amnesty International and Front Line Defenders, among many others. Repression extended to media and human rights activists in particular, many of whom had to leave the country and continue their work in diaspora.
Uzbekistan was one of the first to establish a pervasive Internet censorship infrastructure, blocking access to media and human rights websites. Hacking Team servers in Uzbekistan were identified as early as 2014 by the Citizen Lab. It was later confirmed that Uzbek National Security Service (SNB) were among the customers of Hacking Team solutions from leaked Hacking Team emails. A Privacy International report from 2015 describes the installation in Uzbekistan of several monitoring centers with mass surveillance capabilities provided by the Israeli branch of the US-based company Verint Systems and by the Israel-based company NICE Systems. A 2007 Amnesty International report entitled ‘We will find you anywhere’ gives more context on the utilisation of these capabilities, describing digital surveillance and targeted attacks against Uzbek journalists and human-right activists. Among other cases, it describes the unfortunate events behind the closure of uznews.net – an independent media website established by Galima Bukharbaeva in 2005 following the Andijan massacre. In 2014, she discovered that her email account had been hacked and information about the organization, including names and personal details journalists in Uzbekistan was published online. Galima is now the editor of Centre1, a Deflect client and one of the targets of this investigation.
A New Phishing and Web Attack Campaign
On the 16th of November 2018, we identified a large attack against several websites protected by Deflect. This attack used several professional security audit tools like NetSparker and WPScan to scan the websites eltuz.com and centre1.com.
Peak of traffic during the attack (16th of November 2018)
This attack was coming from the IP address 51.15.94.245 (AS12876 – Online AS but an IP range dedicated to Scaleway servers). By looking at older traffic from this same IP address, we found several cases of attacks on other Deflect protected websites, but we also found domains mimicking google and gmail domains hosted on this IP address, like auth.login.google.email-service[.]host or auth.login.googlemail.com.mail-auth[.]top. We looked into passive DNS databases (using the PassiveTotal Community Edition and other tools like RobTex) and crossed that information with attacks seen on Deflect protected websites with logging enabled. We uncovered a large campaign combining web and phishing attacks against media and activists. We found the first evidence of activity from this group in February 2016, and the first evidence of attacks in December 2017.
The list of Deflect protected websites chosen by this campaign, may give some context to the motivation behind them. Four websites were targeted:
Fergana News is a leading independent Russian & Uzbek language news website covering Central Asian countries
Centre1 is an independent media organization covering news in Central Asia
Palestine Chronicle is a non-profit organization working on human-rights issues in Palestine
Three of these targets are prominent media focusing on Uzbekistan. We have been in contact with their editors and several other Uzbek activists to see if they had received phishing emails as part of this campaign. Some of them were able to confirm receiving such messages and forwarded them to us. Reaching out further afield we were able to get confirmations of phishing attacks from other prominent Uzbek activists who were not linked websites protected by Deflect.
Palestine Chronicle seems to be an outlier in this group of media websites focusing on Uzbekistan. We don’t have a clear hypothesis about why this website was targeted.
A year of web attacks against civil society
Through passive DNS, we identified three IPs used by the attackers in this operation :
46.45.137.74 was used in 2016 and 2017 (timeline is not clear, Istanbul DC, AS197328)
139.60.163.29 was used between October 2017 and August 2018 (HostKey, AS395839)
51.15.94.245 was used between September 2018 and February 2019 (Scaleway, AS12876)
We have identified 15 attacks from the IPs 139.60.163.29 and 51.15.94.245 since December 2017 on Deflect protected websites:
Date
IP
Target
Tools used
2017/12/17
139.60.163.29
eltuz.com
WPScan
2018/04/12
139.60.163.29
eltuz.com
Acunetix
2018/09/15
51.15.94.245
www.palestinechronicle.com eltuz.com www.fergana.info and uzbek.fergananews.com
Acunetix and WebCruiser
2018/09/16
51.15.94.245
www.fergana.info
Acunetix
2018/09/17
51.15.94.245
www.fergana.info
Acunetix
2018/09/18
51.15.94.245
www.fergana.info
NetSparker and Acunetix
2018/09/19
51.15.94.245
eltuz.com
NetSparker
2018/09/20
51.15.94.245
www.fergana.info
Acunetix
2018/09/21
51.15.94.245
www.fergana.info
Acunetix
2018/10/08
51.15.94.245
eltuz.com, www.fergananews.com and news.fergananews.com
Unknown
2018/11/16
51.15.94.245
eltuz.com, centre1.com and en.eltuz.com
NetSparker and WPScan
2019/01/18
51.15.94.245
eltuz.com
WPScan
2019/01/19
51.15.94.245
fergana.info www.fergana.info and fergana.agency
Unknown
2019/01/30
51.15.94.245
eltuz.com and en.eltuz.com
Unknown
2019/02/05
51.15.94.245
fergana.info
Acunetix
Besides classic open-source tools like WPScan, these attacks show the utilization of a wide range of commercial security audit tools, like NetSparker or Acunetix. Acunetix offers a trial version that may have been used here, NetSparker does not, showing that the operators may have a consistent budget (standard offer is $4995 / year, a cracked version may have been used).
It is also surprising to see so many different tools coming from a single server, as many of them require a Graphical User Interface. When we scanned the IP 51.15.94.245, we discovered that it hosted a Squid proxy on port 3128, we think that this proxy was used to relay traffic from the origin operator computer.
Extract of nmap scan of 51.15.94.245 in December 2018 :
3128/tcp open http-proxy Squid http proxy 3.5.23
|_http-server-header: squid/3.5.23
|_http-title: ERROR: The requested URL could not be retrieved
A large phishing campaign
After discovering a long list of domains made to resemble popular email providers, we suspected that the operators were also involved in a phishing campaign. We contacted owners of targeted websites, along with several Uzbek human right activists and gathered 14 different phishing emails targeting two activists between March 2018 and February 2019 :
Date
Sender
Subject
Link
12th of March 2018
g.corp.sender[@]gmail.com
У Вас 2 недоставленное сообщение (You have 2 undelivered message)
http://mail.gmal.con.my-id[.]top/
13th of June 2018
service.deamon2018[@]gmail.com
Прекращение предоставления доступа к сервису (Termination of access to the service)
http://e.mail.gmall.con.my-id[.]top/
18th of June 2018
id.warning.users[@]gmail.com
Ваш новый адрес в Gmail: alexis.usa@gmail.com (Your new email address in Gmail: alexis.usa@gmail.com)
http://e.mail.users.emall.com[.]my-id.top/
10th of July 2018
id.warning.daemons[@]gmail.com
Прекращение предоставления доступа к сервису (Termination of access to the service)
hxxp://gmallls.con-537d7.my-id[.]top/
10th of July 2018
id.warning.daemons[@]gmail.com
Прекращение предоставления доступа к сервису (Termination of access to the service)
Almost all these emails were mimicking Gmail alerts to entice the user to click on the link. For instance this email received on the 23rd of October 2018 pretends that the account will be closed soon, using images of the text hosted on imgur to bypass Gmail detection :
The only exception was an email received on the 16th of October 2018 pretending to give confidential information on the former Hokim (governor) of Tashkent :
Emails were using simple tricks to bypass detection, at times drw.sh url shortener (this tool belongs to a Russian security company Doctor Web) or by using open re-directions offered in several Google tools.
Every email we have seen used a different sub-domain, including emails from the same Gmail account and with the same subject line. For instance, two different emails entitled “Прекращение предоставления доступа к сервису” and sent from the same address used hxxp://gmallls.con-537d7.my-id[.]top/ and http://gmallls.con-4f137.my-id[.]top/ as phishing domains. We think that the operators used a different sub-domain for every email sent in order to bypass Gmail list of known malicious domains. This would explain the large number of sub-domains identified through passive DNS. We have identified 74 sub-domains for 26 second-level domains used in this campaign (see the appendix below for full list of discovered domains).
We think that the phishing page stayed online only for a short time after having sent the email in order to avoid detection. We got access to the phishing page of a few emails. We could confirm that the phishing toolkit checked if the password is correct or not (against the actual gmail account) and suspect that they implemented 2 Factor authentication for text messages and 2FA applications, but could not confirm this.
Timeline for the campaign
We found the first evidence of activity in this operation with the registration of domain auth-login[.]com on the 21st of February 2016. Because we discovered the campaign recently, we have little information on attacks during 2016 and 2017, but the domain registration date shows some activity in July and December 2016, and then again in August and October 2017. It is very likely that the campaign started in 2016 and continued in 2017 without any public reporting about it.
Here is a first timeline we obtained based on domain registration dates and dates of web attacks and phishing emails :
To confirm that this group had some activity during 2016 and 2017, we gathered encryption (TLS) certificates for these domains and sub-domains from the crt.sh Certificate Transparency Database. We identified 230 certificates generated for these domains, most of them created by Cloudfare. Here is a new timeline integrating the creation of TLS certificates :
We see here many certificates created since December 2016 and continuing over 2017, which shows that this group had some activity during that time. The large number of certificates over 2017 and 2018 comes from campaign operators using Cloudflare for several domains. Cloudflare creates several short-lived certificates at the same time when protecting a website.
It is also interesting to note that the campaign started in February 2016, with some activity in the summer of 2016, which happens to when the former Uzbek president Islam Karimov died, news first reported by Fergana News, one of the targets of this attack campaign.
Infrastructure Analysis
We identified domains and subdomains of this campaign through analysis of passive DNS information, using mostly the Community access of PassiveTotal. Many domains in 2016/2017 reused the same registrant email address, b.adan1@walla.co.il, which helped us identify other domains related to this campaign :
Based on this list, we identified subdomains and IP addresses associated with them, and discovered three IP addresses used in the operation. We used Shodan historical data and dates of passive DNS data to estimate the timeline of the utilisation of the different servers :
46.45.137.74 was used in 2016 and 2017
139.60.163.29 was used between October 2017 and August 2018
51.15.94.245 was used between September and February 2019
We have identified 74 sub-domains for 26 second-level domains used in this campaign (see the appendix for a full list of IOCs). Most of these domains are mimicking Gmail, but there are also domains mimicking Yandex (auth.yandex.ru.my-id[.]top), mail.ru (mail.ru.my-id[.]top) qip.ru (account.qip.ru.mail-help-support[.]info), yahoo (auth.yahoo.com.mail-help-support[.]info), Live (login.live.com.mail-help-support[.]info) or rambler.ru (mail.rambler.ru.mail-help-support[.]info). Most of these domains are sub-domains of a few generic second-level domains (like auth-mail.com), but there are a few specific second-level domains that are interesting :
We have not found any information on vzlom[.]top and fixerman[.]top. Vzlom means “break into” in Russian, so it could have hosted or mimicked a security website
A weird Cyber-criminality Nexus
It is quite unusual to see connections between targeted attacks and cyber-criminal enterprises, however during this investigation we encountered two such links.
The first one is with the domain msoffice365[.]win which was registered by b.adan1@walla.co.il (as well as many other domains from this campaign) on the 7th of December 2016. This domain was identified as a C2 server for a cryptocurrency theft tool called Quant, as described in this Forcepoint report released in December 2017. Virus Total confirms that this domain hosted several samples of this malware in November 2017 (it was registered for a year). We have not seen any malicious activity from this domain related to our campaign, but as explained earlier, we have marginal access to the group’s activity in 2017.
The second link we have found is between the domain auth-login[.]com and the groups behind the Bedep trojan and the Angler exploit kit. auth-login[.]com was linked to this operation through the subdomain login.yandex.ru.auth-login[.]com that fit the pattern of long subdomains mimicking Yandex from this campaign and it was hosted on the same IP address 46.45.137.74 in March and April 2016 according to RiskIQ. This domain was registered in February 2016 by yingw90@yahoo.com (David Bowers from Grovetown, GA in the US according to whois information). This email address was also used to register hundreds of domains used in a Bedep campaign as described by Talos in February 2016 (and confirmed by severalother reports). Angler exploit kit is one of the most notorious exploit kit, that was commonly used by cyber-criminals between 2013 and 2016. Bedep is a generic backdoor that was identified in 2015, and used almost exclusively with the Angler exploit kit. It should be noted that Trustwave documented the utilization of Bedep in 2015 to increase the number of views of pro-Russian propaganda videos.
Even if we have not seen any utilisation of these two domains in this campaign, these two links seem too strong to be considered cirmcumstantial. These links could show a collaboration between cyber-criminal groups and state-sponsored groups or services. It is interesting to remember the potential involvement of Russian hacking groups in attacks on Uznews.net editor in 2014, as described by Amnesty international.
Taking Down Servers is Hard
When the attack was discovered, we decided to investigate without sending any abuse requests, until a clearer picture of the campaign emerged. In January, we decided that we had enough knowledge of the campaign and started to send abuse requests – for fake Gmail addresses to Google and for the URL shorteners to Doctor Web. We did not receive any answer but noticed that the Doctor Web URLs were taken down a few days after.
Regarding the Scaleway server, we entered into an unexpected loop with their abuse process. Scaleway operates by sending the abuse request directly to the customer and then asks them for confirmation that the issue has been resolved. This process works fine in the case of a compromised server, but does not work when the server was rented intentionally for malicious activities. We did not want to send an abuse request because it would have involved giving away information to the operators. We contacted Scaleway directly and it took some time to find the right person on the security team. They acknowledged the difficulty of having an efficient Abuse Process, and after we sent them an anonymized version of this report along with proof that phishing websites were hosted on the server, they took down the server around the 25th of January 2019.
Being an infrastructure provider, we understand the difficulty of dealing with abuse requests. For a lot of hosting providers, the number of requests is what makes a case urgent or not. We encourage hosting providers to better engage with organisations working to protect Civil Society and establish trust relationships that help quickly mitigate the effects of malicious campaigns.
Conclusion
In this report, we have documented a prolonged phishing and web attack campaign focusing on media covering Uzbekistan and Uzbek human right activists. It shows that once again, digital attacks are a threat for human-right activists and independent media. There are several threat actors known to use both phishing and web attacks combined (like the Vietnam-related group OceanLotus), but this campaign shows a dual strategy targeting civil society websites and their editors at the same time.
We have no evidence of government involvement in this operation, but these attacks are clearly targeted on prominent voices of Uzbek civil society. They also share strong similarities with the hack of Uznews.net in 2014, where the editor’s mailbox was compromised through a phishing email that appeared as a notice from Google warning her that the account had been involved in distributing illegal pornography.
Over the past 10 years, several organisations like the Citizen Lab or Amnesty International have dedicated lots of time and effort to document digital surveillance and targeted attacks against Civil Society. We hope that this report will contribute to these efforts, and show that today, more than ever, we need to continue supporting civil society against digital surveillance and intrusion.
Counter-Measures Against such Attacks
If you think you are targeted by similar campaigns, here is a list of recommendations to protect yourself.
Against phishing attacks, it is important to learn to recognize classic phishing emails. We give some examples in this report, but you can read othersimilar reports by the Citizen Lab. You can also read this nice explanation by NetAlert and practice with this Google Jigsaw quizz. The second important point is to make sure that you have configured 2-Factor Authentication on your email and social media accounts. Two-Factor Authentication means using a second way to authenticate when you log-in besides your password. Common second factors include text messages, temporary password apps or hardware tokens. We recommend using either temporary password apps (like Google Authenticator; FreeOTP) or Hardware Keys (like YubiKeys). Hardware keys are known to be more secure and strongly recommended if you are an at-risk activist or journalist.
Against web attacks, if you are using a CMS like WordPress or Drupal, it is very important to update both the CMS and its plugins very regularly, and avoid using un-maintained plugins (it is very common to have websites compromised because of outdated plugins). Civil society websites are welcome to apply to Deflect for free website protection.
Appendix
Acknowledgement
We would like to thank Front Line Defenders and Scaleway for their help. We would also like to thank ipinfo.io and RiskIQ for their tools that helped us in the investigation.
The Deflect platform is a free website security service defending civil society and human rights groups from digital attack. Currently, malicious traffic is identified on the Deflect network by Banjax, a system that uses handwritten rules to flag IPs that are behaving like attacking bots, so that they can be challenged or banned. While Banjax is successful at identifying the most common bruteforce cyber attacks, the approach of using a static set of rules to protect against the constantly evolving tools available to attackers is fundamentally limited. Over the past year, the Deflect Labs team has been working to develop a machine learning module to automatically identify malicious traffic on the Deflect platform, so that our mitigation efforts can keep pace with the methods of attack as these grow in complexity and sophistication.
In this report, we look at the performance of the Deflect Labs’ new anomaly detection tool, Baskerville, in identifying a selection of the attacks seen on the Deflect platform during the last year. Baskerville is designed to consume incoming batches of web logs (either live from a Kafka stream, or from Elasticsearch storage), group them into request sets by host website and IP, extract the browsing features of each request set, and make a prediction about whether the behaviour is normal or not. At its core, Baskerville currently uses the Scikit-Learn implementation of the Isolation Forest anomaly detection algorithm to conduct this classification, though the engine is agnostic to the choice of algorithm and any trained Scikit-Learn classifier can be used in its place. This model is trained on normal web traffic data from the Deflect platform, and evaluated using a suite of offline tools incorporated in the Baskerville module. Baskerville has been designed in such a way that once the performance of the model is sufficiently strong, it can be used for real-time attack alerting and mitigation on the Deflect platform.
To showcase the current capabilities of the Baskerville module, we have replayed the attacks covered in the 2018 Deflect Labs report: Attacks Against Vietnamese Civil Society, passing the web logs from these incidents through the processing and prediction engine. This report was chosen for replay because of the variety of attacks seen across its constituent incidents. There were eight attacks in total considered in this report, detailed in the table below.
Date
Start (approx.)
Stop (approx.)
Target
2018/04/17
08:00
10:00
viettan.org
2018/04/17
08:00
10:00
baotiengdan.com
2018/05/04
00:00
23:59
viettan.org
2018/05/09
10:00
12:30
viettan.org
2018/05/09
08:00
12:00
baotiengdan.com
2018/06/07
01:00
05:00
baotiengdan.com
2018/06/13
03:00
08:00
baotiengdan.com
2018/06/15
13:00
23:30
baotiengdan.com
Table 1: Attack time periods covered in this report. The time period of each attack was determined by referencing the number of Deflect and Banjax logs recorded for each site, relative to the normal traffic volume.
How does it work?
Given one request from one IP, not much can be said about whether or not that user is acting suspiciously, and thus how likely it is that they are a malicious bot, as opposed to a genuine user. If we instead group together all the requests to a website made by one IP over time, we can begin to build up a more complete picture of the user’s browsing behaviour. We can then train an anomaly detection algorithm to identify any IPs that are behaving outside the scope of normal traffic.
The boxplots below illustrate how the behaviour during the Vietnamese attack time periods differs from that seen during an average fortnight of requests to the same sites. To describe the browsing behaviour, 17 features (detailed in the Baskerville documentation) have been extracted based on the request sets (note that the feature values are scaled relative to average distributions, and do not have a physical interpretation). In particular, it can be seen that these attack time periods stand out by having far fewer unique paths requested (unique_path_to_request_ratio), a shorter average path depth (path_depth_average), a smaller variance in the depth of paths requested (path_depth_variance), and a lower payload size (payload_size_log_average). By the ‘path depth’, we mean the number of slashes in the requested URL (so ‘website.com’ has a path depth of zero, and ‘website.com/page1/page2’ has a path depth of two), and by ‘payload size’ we mean the size of the request response in bytes.
Figure 1:The distributions of the 17 scaled feature values during attack time periods (red) and non-attack time periods (blue). It can be seen that the feature distributions are notably different during the attack and non-attack periods.
The separation between the attack and non-attack request sets can be nicely visualised by projecting along the feature dimensions identified above. In the three-dimensional space defined by the average path depth, the average log of the payload size, and the unique path to request ratio, the request sets identified as malicious by Banjax (red) are clearly separated from those not identified as malicious (blue).
Figure 2:The distribution of request sets along three of the 17 feature dimensions for IPs identified as malicious (red) or benign (blue) by the existing banning module, Banjax. The features shown are the average path depth, the average log of the request payload size, and the ratio of unique paths to total requests, during each request set. The separation between the malicious (red) and benign (blue) IPs is evident along these dimensions.
Training a Model
A machine learning classifier enables us to more precisely define the differences between normal and abnormal behaviour, and predict the probability that a new request set comes from a genuine user. For this report, we chose to train an Isolation Forest; an algorithm that performs well on novelty detection problems, and scales for large datasets.
As an anomaly detection algorithm, the Isolation Forest took as training data all the traffic to the Vietnamese websites over a normal two-week period. To evaluate its performance, we created a testing dataset by partitioning out a selection of this data (assumed to represent benign traffic), and combining this with the set of all requests coming from IPs flagged by the Deflect platform’s current banning tool, Banjax (assumed to represent malicious traffic). There are a number of tunable parameters in the Isolation Forest algorithm, such as the number of trees in the forest, and the assumed contamination with anomalies of the training data. Using the testing data, we performed a gridsearch over these parameters to optimize the model’s accuracy.
Replaying the Attacks
The model chosen for use in this report has a precision of 0.90, a recall of 0.86, and a resultant f1 score of 0.88, when evaluated on the testing dataset formulated from the Vietnamese website traffic, described above. If we take the Banjax bans as absolute truth (which is almost certainly not the case), this means that 90% of the IPs predicted as anomalous by Baskerville were also flagged by Banjax as malicious, and that 88% of all the IPs flagged by Banjax as malicious were also identified as anomalous by Baskerville, across the attacks considered in the Vietnamese report. This is demonstrated visually in the graph below, which shows the overlap between the Banjax flag and the Baskerville prediction (-1 indicates malicious, and +1 indicates benign). It can be seen that Baskerville identifies almost all of the IPs picked up by Banjax, and additionally flags a fraction of the IPs not banned by Banjax.
Figure 3:The overlap between the Banjax results (x-axis) and the Baskerville prediction results (colouring). Where the Banjax flag is -1 and the prediction colour is red, both Banjax and Baskerville agree that the request set is malicious. Where the Banjax flag is +1 and the prediction colour is blue, both modules agree that the request set is benign. The small slice of blue where the Banjax flag is -1, and the larger red slice where the Banjax flag is +1, indicate request sets about which the modules do not agree.
The performance of the model can be broken down across the different attack time periods. The grouped bar chart below compares the number of Banjax bans (red) to the number of Baskerville anomalies (green). In general, Baskerville identifies a much greater number of request sets as being malicious than Banjax does, with the exception of the 17th April attack, for which Banjax picked up slightly more IPs than Baskerville. The difference between the two mitigation systems is particularly pronounced on the 13th and 15th June attacks, for which Banjax scarcely identified any malicious IPs at all, but Baskerville identified a high proportion of malicious IPs.
Figure 4:The verdicts of Banjax (left columns) and Baskerville (right columns) across the 6 attack periods. The red/green components show the number of request sets that Banjax/Baskerville labelled as malicious, while the blue/purple components show the number that they labelled as benign. The fact that the green bars are almost everywhere higher than the red bars indicates that Baskerville picks up more traffic as malicious than does Banjax.
This analysis highlights the issue of model validation. It can be seen that Baskerville is picking up more request sets as being malicious than Banjax, but does this indicate that Baskerville is too sensitive to anomalous behaviour, or that Baskerville is outperforming Banjax? In order to say for sure, and properly evaluate Baskerville’s performance, a large testing set of labelled data is needed.
If we look at the mean feature values across the different attacks, it can be seen that the 13th and 15th June attacks (the red and blue dots, respectively, in the figure below) stand out from the normal traffic in that they have a much lower than normal average path depth (path_depth_average), and a much higher than normal 400-code response rate (response4xx_to_request_ratio), which may have contributed to Baskerville identifying a large proportion of their constituent request sets as malicious. Since a low average path depth (e.g. lots of requests made to ‘/’) and a high 400 response code rate (e.g. lots of requests to non-existent pages) are indicative of an IP behaving maliciously, this may suggest that Baskerville’s predictions were valid in these cases. But more labelled data is required for us to be certain about this evaluation.
Figure 5:Breakdown of the mean feature values during the two attack periods (red, blue) for which Baskerville identified a high proportion of malicious IPs, but Banjax did not. These are compared to the mean feature values during a normal two-week period (green).
Putting Baskerville into Action
Replaying the Vietnamese attacks demonstrates that it is possible for the Baskerville engine to identify cyber attacks on the Deflect platform in real time. While Banjax mitigates attacks using a set of static human-written rules describing what abnormal traffic looks like, by comprehensively describing how normal traffic behaves, the Baskerville classifier is able to identify new types of malicious behaviour that have never been seen before.
Although the performance of the Isolation Forest in identifying the Vietnamese attacks is promising, we would require a higher level of accuracy before the Baskerville engine is used to automatically ban IPs from accessing Deflect websites. The model’s accuracy can be improved by increasing the amount of data it is trained on, and by performing additional feature engineering and parameter tuning. However, to accurately assess its skill, we require a large set of labelled testing data, more complete than what is offered by Banjax logs. To this end, we propose to first deploy Baskerville in a developmental stage, during which IPs that are suspected to be malicious will be served a Captcha challenge rather than being absolutely banned. The results of these challenges can be added to the corpus of labelled data, providing feedback on Baskerville’s performance.
In addition to the applications of Baskerville for attack mitigation on the Deflect platform, by grouping incoming logs by host and IP into request sets, and extracting features from these request sets, we have created a new way to visualise and analyse attacks after they occur. We can compare attacks not just by the IPs involved, but also by the type of behaviour displayed. This opens up new possibilities for connecting disparate attacks, and investigating the agents behind them.
Where Next?
The proposed future of Deflect monitoring is the Deflect Labs Information Sharing and Analysis Centre (DL-ISAC). The underlying idea behind this project, summarised in the schematic below, is to split the Baskerville engine into separate User Module and Clearinghouse components (dealing with log processing and model development, respectively), to enable a complete separation of personal data from the centralised modelling. Users would process their own web logs locally, and send off feature vectors (devoid of IP and host site details) to receive a prediction. This allows threat-sharing without compromising personally identifiable information (PII). In addition, this separation would enable the adoption of the DL-ISAC by a much broader range of clients than the Deflect-hosted websites currently being served. Increasing the user base of this software will also increase the amount of browsing data we are able to collect, and thus the strength of the models we are able to train.
Baskerville is an open-source project, with its first release scheduled next quarter. We hope this will represent the first step towards enabling a new era of crowd-sourced threat information sharing and mitigation, empowering internet users to keep their content online in an increasingly hostile web environment.
Figure 6:A schematic of the proposed structure of the DL-ISAC. The infrastructure is split into a log-processing user endpoint, and a central clearinghouse for prediction, analysis, and model development.
A Final Word: Bias in AI
In all applications of machine learning and AI, it is important to consider sources of algorithmic bias, and how marginalised users could be unintentionally discriminated against by the system. In the context of web traffic, we must take into account variations in browsing behaviour across different subgroups of valid, non-bot internet users, and ensure that Baskerville does not penalise underrepresented populations. For instance, checks should be put in place to prevent disadvantaged users with slower internet connections from being banned because their request behaviour differs from those users that benefit from high-speed internet. The Deflect Labs team is committed to prioritising these considerations in the future development of the DL-ISAC.
In November and December 2018, we identified 3 DDoS attacks against independent media website Кавказский Узел (Caucasian Knot)
The first attack was by far the largest DDoS attack seen by the Deflect project in 2018, clocking over 7.7 million queries in 4 hours
The three attacks used different types of relays, including open proxies, botnets and WordPress pingbacks. We could not find any technical intersection between the incidents to point to their orchestration or provenance.
Context
Caucasian Knot is an online media covering the Caucasus, comprised of 20 regions from the North and South Caucasus. The publication has eleven thematic areas with a focus on human right issues. Several reporters paid the ultimate price for their journalism, including Akhmednabi Akhmednabiev, killed in Dagestan in 2013. Another young Chechen journalist Zhalaudi Geriev, was kidnapped and tortured in 2016, and is now in Chernokozovo prison. On several occasions, Chechen government officials have publicly called for violence against Caucasian Knot reports and editors.
Caucasian Knot has received several journalism awards, including the The Free Press of Eastern Europe award in 2007 and the Sakharov prize in 2017.
First attack : millions of requests from open proxies on October 19th
The Caucasian Knot website joined Deflect on the 19th of October, under the barrel of a massive DDoS attack that had knocked their servers offline. Deflect logged over 7, 700, 000 queries to / on www.kavkaz-uzel.eu between 11h am and 3pm. This was by far the largest DDoS attack we have seen on Deflect in 2018.
The attack was coming from 351 different IP addresses doing requests to /, adding random HTTP queries to bypass any caching mechanism, with queries like GET /?tone=hot or GET /?act=ring, and often adding random referrers like http://www.google.com/translate?u=trade or http://www.comicgeekspeak.com/proxy.php?url=hot. Most of these IP addresses were open proxies used as relays, like the IP 94.16.116.191 which did more than 112 000 queries – listed as an open proxy on different proxy databases.
Many open-proxies are “transparent”, which mean that they do not add or remove any header, but it is common to have proxies adding a header X-Forwarded-for with the origin IP address. Among the long list of proxies used, several of them actually added this header which revealed the IP addresses at the origin of the attack (an occurrence similar to what we’ve previously documented in Deflect Report #4)
157.52.132.202 1,157,759
157.52.132.196 1,127,194
157.52.132.191 1,018,789
157.52.132.190 1,008,426
157.52.132.197 984,914
These IPs are servers hosted by a provider called Global Frag, that propose servers with DDoS protection (sic!). We have sent an abuse request to this provider on the 19th of November and the servers were shutdown a few weeks after that (we cannot be sure if it was related to our abuse request). We have not recorded any other malicious traffic from these servers to the Deflect network.
Second attack: botnet attack on November 18th
On this day we identified a second, smaller attack targeting the same website.
The attack queried the / path more than 2 million times, this time without any query string to avoid caching, but the source of the attack is really different. Most of the attacks are coming from a botnet, with 1591positively identified IP addresses (top 10 countries listed here):
213 India
163 Indonesia
99 Brazil
63 Egypt
63 Morocco
59 Romania
58 Philippines
57 United States
46 Poland
44 Vietnam
A small subset of this attack was actually using the WordPress pingback method, generating around 30 000 requests. WordPress pingback attacks are DDoS attacks using WordPress websites with the pingback feature enabled as relay, which allows to generate traffic to the targeted website. A couple of years ago, the WordPress development team updated the user-agent used for pingback to include the IP address of the origin server. In our logs we see two different types of user-agents for the pingback :
User agents before WordPress 3.8.2 having only the WordPress version and the website, like WordPress/3.3.2; http://equalit.ie
User-agents after version 3.8.1 having an extra field giving the IP address at the origin of the query like WordPress/4.9.3; http://[REDACTED]; verifying pingback from 188.166.105.145
By analyzing user-agents of modern WordPress websites, we were able to distinguish the following 10 attack origin IPs:
All these IPs were actually part of a booter service (professional DDoS-for-hire) that also targeted BT’selem and that we described in detail in our Deflect Labs Report #4.
Third attack: WordPress PingBack and Botnets on the 3rd of December
On the 3rd of December around 3pm UTC, we saw a new attack targeting www.kavkaz-uzel.eu, again with requests only to /. On the diagram below we can see two peaks of traffic around 2h20 pm and 3pm when checking only the requests to / at that time :
Peak of traffic to / on www.kavkaz-uzel.eu on the 3rd of December
Looking at the first peak of traffic, we were able to establish another instance of a WordPress Pingback attack with user agents like WordPress/3.3.2; http://[REDACTED] or WordPress/4.1; http://[REDACTED]; verifying pingback from 185.180.198.124. We analyzed the user-agents from this attack and identified 135 different websites used as relays, making a total of 67 000+ requests. Most of these websites were using recent WordPress version, showing the IP as the origin of this attack, 185.180.198.124 a server from king-servers.com. King Server is a Russian Server provider considered by some people to be a bullet-proof provider. Machines from King Servers were also used in the hack of Arizona and Illinois’ state board of elections in 2016. Upon closer inspection, we could not find any other interesting services running on this machine or proof that it was linked to a broader campaign. Among the 135 websites used as relay here, only 25 were also used in the 2nd attack described above, which seems to show that they are coming from an actor with a different list of WordPress relays.
Peak of traffic by user-Agent type, first peak colour is for WordPress user-agents, second peak color is for Chrome user-agents
The second peak of traffic was actually coming from a very different source: we identified 252 different IP addresses as the origin of this traffic, mostly coming from home Internet access routers, located in different countries. We think this second peak of traffic was from a small botnet of compromised end-systems. These systems were mostly located in Russia (32), Egypt (20), India (17), Turkey (14) and Thailand (10) as shown in the following map :
Conclusion
The first DDoS attack had a significant impact on the Caucasian Knot website, leading to their joining the Deflect service. It took us a few days to mitigate this attack, using specific filtering rules and javascript challenges to ban hosts. The second and third attacks were largely smaller and were automatically mitigated by Deflect.
In our follow up investigations we could not find a direct technical link to explain attackers’ motivation, however in all cases attacks were launched within a 24-hour window of a publication critical of the Chechen government and when countering its official narratives. We did not find any similar correlation with other thematic or region specific publications on this website, within a 24-hour window between publication and attack.