1. Home
  2. >
  3. Deflect
Categories
Blog Deflect Press Release

Introducing Deflect-next

After several years of laborious effort, we are proud to announce the public release of the new Deflect software ecosystem. In our Github organization https://github.com/deflect-ca you will find all official Deflect related open source repositories that allow you to stand up an entire website protection infrastructure or its individual components. Deflect offers a high-performance content caching and delivery network, cyber attack mitigation tools powered by Banjax and the Baskerville machine learning clearinghouse, user dashboards, APIs and much else. You are reading this post on a rebuilt and refactored Deflect infrastructure and we’re very proud of that!

Below is the story of the how the new Deflect came to be and rationale for making various software choices.

Deflect-next

Created in 2011 Deflect was a relatively simple solution to the many-against-one problem of distributed denial of service (DDoS) attacks targeting civil society’s web servers. By running Deflect’s caching and mitigation software on constantly rotating edge servers strategically located in some of the world’s biggest data centers, we offered a many-against-many scenario, utilizing the widespread nature of the Internet in a similar manner to those organizing attacks against our clients – leveling the playing field and bringing a little more ‘equalitie’ for our clients’ rights to freedom of expression and association with our audience. Deflect edges memorized previous requests for website data (by virtue of reverse-proxy caching techniques) and removed the load from our clients’ web servers. To block the onslaught of bot-driven attacks, we developed rule based attack mitigation software – Banjax – identifying algorithmic behavior that we considered malicious and banning the IPs that exhibited them. 

As we scaled our infrastructure and hundreds of independent media, human rights groups and other non-profits joined the service, millions of daily requests were received on the Deflect network from around the world. This growth was matched by a higher frequency and sophistication of attacks. The Deflect infrastructure was continuously and laboriously patched, improved and upgraded multiple times. As often the case, the software code underpinning the service became cumbersome and replete with complicated configurations and fixes. Moreover, we moved further and further away from our secondary project ambition – to see other technology groups run their own instances of Deflect using our code base. The software stack required a lot of manual configuration and ‘insider knoweledge’. Beginning in 2019 we began to architect and develop a new version of Deflect, using an entirely new method of provisioning, managing and configuring network components, maintaining our agility and adding reproduce-ability as a primary design philosophy.

From ATS to Nginx

The primary tool for serving Deflect clients’ websites is the caching software installed on every network edge. It does all the heavy lifting in our design – fielding requests from the Internet on behalf of our clients’ websites in a reverse proxy fashion. Initially we had opted for the Apache Traffic Server (ATS) built by Yahoo and released open source in early 2000s. The choice was made primarily for its performance levels under stress tests. The software itself was not yet widespread and a little more difficult to set-up and maintain, with configuration for single actions often spread across multiple files. It required every Deflect network operator to dig deep into documentation and source code to figure out what will happen with every change.

Another caching and proxy solution – Nginx – was a more attractive choice for our new network design, with expressive configuration formats and a much larger technical userbase.

From Ansible and Bash to Python and Docker

Deflect clients customize their caching and attack mitigation settings via the Deflect Dashboard, which sends snapshots of these settings to the network controller. Previously, the configuration engine was a mix of Ansible and Bash, and it had overgrown in logic and depth more than what was destined to be managed in these languages. We have now rebuilt the configuration module in Python, supporting better scaling in the future and API communications.

We rebuilt the orchestration module of Deflect – for installing packages, starting and stopping processes, and sending new configuration to network edges, using Docker.

The benefits of containerization are well-known, but in short the advantages to us were: decoupling applications from the host OS (upgrading between Debian versions has historically been a pain point), making our various development and staging environments more reproducible, and letting us create and destroy multiple copies of a container on the same server easily. Once our applications were containerized, we needed a way to start and stop these containers, deciding to write it imperatively in our go-to general-purpose language, Python. There’s a library, docker-py, which connects to the Docker daemon on remote hosts over SSH and provides an API for building images, creating volumes, starting/stopping containers, and everything else we needed. The result is not as simple as Docker Compose or Swarm, but not as complicated as Kubernetes, and is written in a language everyone on our dev team already knows.

From Banjax to Banjax-go

Our primary method of mitigation – Banjax – was previously an ATS plugin, and as such was tightly coupled to the internal details of ATS’s request processing state machine and event loop. Written as C++ code it was tightly coupled with ATS internals and often limited in functionality by the caching server itself. Answering a simple question like “does a request count against a rate limit even if the result has previously been cached, or only if the request goes through to the origin?” required a close reading of Banjax and ATS source code. To port our attack mitigation logic to another server like Nginx would require a similarly detailed understanding of its internals. In addition, we wanted to share Banjax tooling with others – but could not decouple it from ATS – no one from our partners and peers were running this caching software.

We explored an alternate architecture: where attack mitigation logic lives in a separate process (developed in any language, and decoupled from the internals of any specific server) and talks to the server over some inter-process communication channel. We explored using an Nginx plugin which was specialized for this purpose (and developed a proof-of-concept ATS plugin in Lua which did the same thing) but found that a combination between the very widely used `proxy_pass` directive and `X-Accel-Redirect` header was more flexible (authentication responses can redirect the client to arbitrary locations) and probably more portable across servers. As for the choice of language for the authentication service, Python and the Flask framework would have been nice because we use it elsewhere in our stack, but some benchmarks showed Go and Gin being a lot faster (we were aiming for a worst-case overhead of about 1ms on top of requests with a 50th percentile response time of 50ms, and Go/Gin achieves this).

The interface between Nginx and Banjax-go is a good demonstration of Nginx’s expressive configuration file. This first code block says to match on every incoming request (`/`) and proxy the request to `http://banjax-go/auth-request`.

location / {
    proxy_pass http://banjax-go/auth_request
}

Banjax-go then checks the client IP and site name against its lists of IPs to block or challenge. It responds to Nginx with a header that looks like `X-Accel-Redirect: @access_granted` or `X-Accel-Redirect: @access_denied`. These are names of other location blocks in the Nginx configuration, and Nginx performs an internal redirect to one of them.

location @access_granted {
    proxy_pass https://origin-server
}
location @access_denied {
    return 403 "access denied";
}

This is already a lot easier to understand than a plugin which hooks into the internals of Nginx’s or ATS’s request processing logic (reading and writing a configuration file is easier than reading and writing code). Furthermore it composes nicely with the other concepts that can be expressed with Nginx’s scoped configuration: you can control the logging, caching, error-handling and more in each location block and its clear whether it applies to the request to banjax-go, the request to the origin server, or the static 403 access denied message.

Here’s a diagram that shows the above proxy_pass + X-Accel-Redirect flow (follow the red numbers 1, 2, and 3) along with the other interfaces Banjax-go has: the log tailing and the Kafka connection. The log tailing enforces the same regex + rate limit rules that the ATS plugin did, but asynchronously (outside of the client’s request and response) rather than synchronously. The Kafka channel is for receiving decisions from Baskerville (“challenge this IP”) and for reporting
 whether a challenged IP then passed or failed the challenge.

Baskerville client

The machine lead anomaly prediction clearinghouse – Baskerville – is an innovative infrastructure that has been working in production on Deflect for over a year. It is a complicated set-up reliant on edge servers reporting logs to the clearinghouse, where the pre-processing for feature extraction (looking for anomalous behavior in web logs) creates vectors which are then run through the learning model. An anomaly prediction is generated and communicated back to the network edge. The clearinghouse runs on a Kubernetes cluster and requires a large amount of resources for processing.

Recently, we have split the software base into two components – the clearinghouse and the client software (operating on any Linux+nginx web server). The idea was to allow third-party clients, not using Deflect, to benefit from the clearginhouse’s predictions and the Banjax mitigation tool. In this new model, the Baskerville client is installed independently of Deflect and performs:

  • Processes nginx web server logs and calculates statistical features.
  • Sends features to a clearing house instance of Baskerville.
  • Receives predictions from a clearing house for every IP.
  • Issues challenge commands for every malicious IP in a separate Kafka topic.
  • Monitors attacks in Grafana dashboards.
Anyone can benefit from Baskerville’s anomaly predictions and Banjax’s mitigation tools

Deflect-next open source components

  • Deflect – all necessary components to set up your network controller and edge servers – essentially acting as a reverse proxy to you or your clients’ origin web servers.
  • Deflect-API – an interface to Deflect components
  • Edgemanage – a tool for managing the HTTP availability of a cluster of web servers via DNS. If a machine is found to be under-performing, it is replaced by a new host to ensure maximum network availability.
  • Banjax – basic rate-limiting on incoming requests according to a configurable set of regex patterns.
  • Baskerville – an analytics engine that leverages machine learning to distinguish between normal and abnormal web traffic behavior. Used in concert with Banjax for challenging and banning IPs that breach an operator defined threshold.
  • Baskerville client – edge software for pre-processing behaivoural features from web logs and communicating with the Baskerville clearinghouse for anomaly predictions.
  • Baskerville dashboard – A dashboard for users running the Baskerville Client software offering setup, labeling behavior and communicating feedback to the clearinghouse

Happy coding everyone!

  1. Home
  2. >
  3. Deflect
Categories
DDoS Deflect News from Deflect Labs

Updates from Deflect – 3 – 2022

This was a busy month for Deflect’s mitigation tooling, with Banjax blocking almost 12 million malicious requests launched by 108,294 different bots. Due to the war in Ukraine, many people turned to Deflect protected Ukrainian media sites for information. Throughout the month Deflect served 1,128,751,920 requests (almost double than the previous month) of which 283,570,50 came from Ukraine – around 20% of our global traffic. 1,277,053 Ukrainians read Deflect protected websites – also a testament to the stability of the Internet there.

Ukrainian readership in March, by city

The biggest attack recorded this month was against informator.ua – a pan-Ukrainian news website with a focus on the Donbas region.

On the 31st of March, between 07:45-8:50 GMT+0 about 1,300 unique IPs were blocked by Deflect as they attacked informator.ua with GET /ru?8943563843054274 and POST /ru?829986440416200 requests, utilizing cache-busting techniques. These bots were from Brazil, USA, Indonesia, India, Bangladesh and many other countries, almost 1,000 of them seems to be infected MikroTik routers. Several hundred were compromised webservers and SOCKS proxies. There was a partial downtime for this website for about an hour as Deflect was not able to mitigate this attack fast enough to be sure no malicious requests are hitting the origin. The Baskerville system did not react as expected (this has been fixed). We enabled Challenger for this domain to be sure we can mitigate future attack without any issues for the origin. Our log aggregation and analysis system was affected by the overall amount of requests and was out of sync for a short period of time.

Over 300,000 requests per minute were generated by the attackers. As you can see – a significant amount of bots originated from the United States. This is another important reminder for patching your computer systems and other Internet connected devices. Otherwise it could be your system attacking Ukrainian websites too!
Top banned unique IPs by vendor

    912 MikroTikRouter
    232 Unknown
     51 UbuntuServer
     44 Torrouter
     33 DebianServer
     16 WindowsServer
      6 WindowsSystem
      6 RedHatServer
      4 CentOSLinuxServer

Top banned unique IPs by service

    875 MikroTik
    232
     49 Ubuntu-ssh
     44 TorExitRouterHTTPheader
     33 Debiansshheader
     13 MikroTikSNMPinfo
     10 MikroTikFTPserver
      8 MikroTikPPTPserver
      7 WindowsRDPServer
      7 MSIISheader
      6 WindowsNetBIOS
      6 RedHatDNSheader
      5 MikrotikRouterOSconfigurationpage
      4 ApacheCentOS
      2 WindowswithMSHTTPAPIWebServer

by client_url:

199940     /ru
102142     /ru/category/biznes/login
37312      /ru/ukraino-rossiyskie-peregovory-v-stambule-itogi
3          /ru/post-prev/45573
  1. Home
  2. >
  3. Deflect
Categories
Blog Deflect

Updates from Deflect – 2 – 2022

Since the beginning of this year, we have served over 1.5 billion website requests to approximately 13.5 million unique readers the world over! We mitigated over 17 distinct and significant attacks and kept our clients online 100% of the time! Our combined bot banning technology (machine lead predictions from Baskerville and confirmed anomalies by Banjax) blocked 5,794,533 malicious hits originating from 1,668,388 zombie bots. That’s quite a lot for this early in the season 🙂

Some of the biggest attacks were directed at a Colombian independent journalist website Los Danieles, a Filipino news media Verafiles, a Latin American information agency and an Indian feminist rights portal.

Attacks during January and February 2022. Colours represent different Deflect clients.

In what is a rather unusual occurence, the deflect.ca website itself was attacked on February 07th. Around 11:00-11:03 GMT+0 about 10,000 unique IPs were sending GET / requests to deflect.ca None of these requests were banned as the attack window was too small. Baskerville worked well in classifying about 5,700 of them as malicious. Some requests returned 502 codes but these were generally malicious requests. eQPress behaved well serving up to 2,620 requests per second on nginx with 5,000RPS to the database and 100 mbps of outgoing traffic. No collateral damage was detected to other eQPress clients. We investigated and improved some of our caching logic as a result of this incident.

  1. Home
  2. >
  3. Deflect
Categories
Blog DDoS Deflect

Deflect – a year in summary

Once again, the Deflect network grew in size and audience in 2021. Apart from the continuously stellar work of our clients, what stood out the most for the Deflect team tasked with network monitoring and adversary mitigation – was the increasing sophistication and ‘precision’ of Baskerville, outperforming human rule sets written for the Banjax mitigation toolkit request rate-limiting library. Yes, the machine is outperforming humans on Deflect. We won’t get into the philosophical nature of this reality, rather share some statistics and interesting attacks we witnessed this year, with you.

Year in stats

Legitimate no. of requests served10,152,911,060
Legitimate no. of unique readers (IP)77,011,728
Total requests banned – Banjax3,326,915
Total requests challenged by Baskerville2,606,927
% of Deflect clients also using eQpress hosting34 %
Total amount of complete Deflect outage0
Lowest up-time for any Deflect client99.8%
% of clients increase year-on-year21.62%
Largest botnet, by number of bots 19,333
Number of significant DDoS events103

Deflected attacks

What an attack looks like

On November 04, 2021 – a DDoS attack on a Vietnamese media (also hosted on EQPress) began around 16:50 UTC. Between 2,000-2,500 unique IPs where blocked, as originating from United States, Canada, Germany, France and other countries. These bots issued about 825 thousand GET / and GET https://website.com// requests during this attack. Most IPs involved were detected as proxies, and many of them revealed an IPs in X-Forwarded-For header. The underlying WordPress instance received up to 5,000 requests per second, forcing the EQPress server to send up to 30 megabits per second of html responses. Thanks to the FasctCGI cache and overall configuration hardening, the hosting network cluster had enough resources to serve requests until all bots were blocked without any significant issues for the website itself or its neighbors.

Baskerville detected this attack and issued challenge commands to 2,200+ IPs.

Baskerville’s traffic light classification system

This attack targeted an independent investigative journalism website from the Philippines. The attack began on November 15th and continued throughout the next two weeks. Large portions of attack traffic were not accessible to Deflect, targeting the hosting data center with L3/L4 floods.

Almost 4,000 unique IP addresses issued more than 70 millions “GET /” and “GET /?&148294400498e131004165713TT117859756720Q106417752262N” requests against the website, using `cache busting` techniques with random query_string parameters. Attackers also reverted to using forged User-Agents in request strings. Obviously this attack was adapted against Deflect caching defenses. Many of the participating IPa were proxies possible revealing the original sender with X-Forwarded-For header.

Unfortunately, this attack was not fully mitigated in a quick way and caused several hours of downtime for real users. After manually enabling Deflect’s advanced protection mechanisms and adjusting the origin’s configuration, the website became stable again.

A Zambian democratic watchdog organization was attacked twice between August 08-09 and 11-12. It seems that when the attackers came back a second time round, they hadn’t learned their lessons and tried a similar technique and an almost identical botnet.

Servers from different countries (mostly Unites States, Germany, Russia, France) were sending more than 16 millions of GET / and /s=87675957 requests (with random numbers to bypass caching) during the first round of attacks. During the following incident over 137 million malicious requests were recorded and blocked.

Most of these IPs are known as compromised servers that could be used as proxies and MikroTik routers. 383 unique User-Agent headers were used, all of them were Google Chrome variations. We can also see about 400 TOR exit nodes which were used for this attack.

Millions of hits per minute

The first attack was not completely mitigated due to its profile and some traffic was able to hit the origin server, resulting in several hours of partial downtime for real visitors during different phases of this attack. The second attack was completely mitigated as we had already updated our mitigation profiles.

  1. Home
  2. >
  3. Deflect
Categories
Blog Deflect Press Release

Deflect partners with technology and media groups

June 01, 2021 – Deflect partners with technology and media groups

Since 2010, Deflect has specialized in protecting online platforms from cyber attacks. Today, our mission and time-tested tooling reaches further and wider than ever before! We are honoured to announce strategic partnerships with well-known Internet Service Providers and digital media entrepreneurs in the Americas and Europe. Our combined service offering includes all manner of web hosting and online collaboration platforms, technical consultancy and web security services. With over a hundred years of collective technology expertise and a dozen common languages between us, this is a partnership that will serve a global clientele and meet the challenges of shrinking online spaces for expression and self-determination.

Our mission is strengthened through this mutually beneficial partnership. We stand together, stronger and ever more resilient, to protect our clients’ platforms with ethical technology solutions, multilingual human resources and a common belief in principles before profits.

Dmitri Vitaliev, Founder deflect.ca

Find out more about our partners’ individual services and mission from the list below. Check out Deflect’s partnership opportunities and write to us!

@colnodo

Colnodo is a non for profit organization working since 1994 providing Internet infrastructure services to activists and civil society organizations.  Colnodo’s main objective is the access, use and appropriation of information and communication technologies (ICT) for social development, human development and the improvement of people’s living conditions through the strengthening of capacities and competencies, education for work, information and knowledge exchange, increased citizen participation, sustainable development and innovation.

@greenhost

Greenhost (Netherlands) is an established infrastructure provider focusing on digital human rights and sustainability. By providing (infrastructure) services to a wide range of organisations supporting human rights, free press and/or censorship circumvention while preserving privacy guarantees. Greenhost makes sure to keep the internet an open and innovative space.

@greennetisp

GreenNet (UK) have been networking people and activist groups for peace, the environment, equality and human rights since 1986 – providing internet services, web design and hosting. Our hardware and software choices are based on expert technical judgment, our ecological sustainability and ethical business values.

@cloud68hq

(Tirana, Tallinn, Worldwide) Cloud68.co provides reliable open source digital infrastructure to for-purpose small & medium teams, organizations and individuals with responsive and friendly support. As a team of long time contributors to digital privacy and open knowledge projects we are committed to help you migrate from big tech as easy as possible.

@sembramedia

SembraMedia is a nonprofit dedicated to empowering diverse voices in Spanish media to publish news and information with independence, journalistic integrity, and a positive impact on the communities they serve. They conduct research, provide training, consulting, and financial support to help media leaders develop more sustainable business models in Latin America, Spain, and the U.S. Hispanic market.

At MainMicro, our goal is to ensure customer satisfaction by providing ongoing support and cost effective solutions for our partners. We take great pride in having a customer retention rate that is among the highest in the industry. For us, when you become a customer you also become a friend, and we become the one-stop shop for all of your IT related needs.

At Black Crow Labs we construct your brand’s ecosystem and tell your story.  By engaging with prospective customers on targeted platforms we integrate your brand into their lives and conversations.

  1. Home
  2. >
  3. Deflect
Categories
Blog Deflect

Updates from Deflect – 2 – 2021

Traffic & Attacks

Since the beginning of this year, we have served over 2 billion website hits to approximately 18 million unique readers the world over! We mitigated over 30 distinct attacks and kept our clients online 100% of the time! The Banjax bot banning technology blocked 291,898 malicious hits originating from 58,181 zombie bots. Our machine lead anomaly prediction system Baskerville was further able to identify and challenge suspiciously behaving IPs 1,182,084 times out of which only 16,755 proved to be legitimate readers and were allowed to access the requested website. This equates to 98.58% precision – which is pretty good for a machine!

Most popular countries reading Deflect protected websites

These attacks have helped us confirm that our prior implementation of the Shapley value estimation in Baskerville had lead to positive results. This is a general way to explain the output from the machine learning model by feature importance ranking – to help us decide which feature works best. We used this algorithm to compare an older machine model with a model that uses only the features Shapley values say are important, on a data set that contained the latest attacks. The model with only the most important features outperformed the older model.

Deflect referral program

Financial survival and independence on today’s Internet is tough. Big Tech permeates and controls virtually every aspect of our digital experience. When it comes to Internet infrastructure and network services, corporate giants such as Akamai, AWS and Cloudflare dominate the space. These handful of companies have managed to create an ecosystem where they profit from virtually every transaction or advertising campaign. While we choose our destiny as consumers, the growing problem is a lack of choices. One way or another, we are being pushed towards a handful of companies.

We want to do things differently. Our goal is to succeed in lockstep with our clients, not simply profit from them. The Deflect referral program creates a mutually beneficial commercial opportunity – by registering for this program and installing a ‘Protected by Deflect’ badge on your website with a unique hyperlink, you will receive 50% of the first full month’s fees charged to every new client that subscribed from this link. Write to partner@deflect.network if you want to participate in this program or read more about this and other collaborative opportunities on the Partner Programs page.

New Website

You are reading this update on our freshly minted website – powered by WordPress and hosted on the secure eQpress platform. We decided to build it using the default 2020 theme. This code is supported by the WordPress team, built according to best practices. That’s important when it comes to running the popular (but often compromised) WordPress platform – the ease of installation for new themes and plugins lowers the barrier for entry and makes it highly functional and customizable. At the same time, custom code developments become outdated, insecure and often lead to website hacking and unintended DDoS attacks. Our set-up configuration comes with the following:

  • Protection from DDoS attacks and password brute-force
  • Daily snapshots and differential backup
  • Long term theme support from WordPress
  • SEO management, chat support, Matomo Analytics, Polylang translations

Over 25% of Deflect clients also host their website on eQpress. The service is detailed on the eQpress page and you can request it from the Dashboard, or contact us with questions. 

  1. Home
  2. >
  3. Deflect
Categories
Blog Deflect News from Deflect Labs

Updates from Deflect – January 2021

Giant leaps in our machine-lead mitigation tooling have removed some of the heavy load in mitigating attacks from our support team this month. We’re very pleased with the machine’s performance but it won’t replace the humans! Below, we share some traffic highlights, Deflect relevant events and stories from our clients.

January Traffic

Throughout the month of January, Deflect served over 884 million requests to more than 9 million unique readers around the world. Much of the traffic was bound for Los Danieles – a new and very popular independent media publication in Colombia. Every Sunday, their website attracts between 5 and 9 million legitimate hits! Thankfully, Deflect was able to serve over 94% of these requests directly from its edge cache.

Content served from Deflect cache

Another notable traffic event this month coincided with the release of an online report, investigating torture of thousands of Belarus protesters at the hands of the incumbent government forces. Published by the renowned Committee Against Torture, this is a visual investigation, suitable for mature audiences only.

Notable Attacks in January

Sixteen distinct attacks were recorded against Deflect protected websites this month. Of these, five were notable for their strength and consistency, with two attacks continuing over a four-day period. The largest attack, with over 5000 bots participating, was targeting the Vietnamese independent new site Tiếng Dân. This is not the first time their website has been targeted. Approximately half of the attacking bots were discovered and challenged by Baskerville, whilst the other half were blocked by our manual rule sets. Overall, Deflect maintained 100% network up-time in January.

Kandinsky theme options for eQpress

Deflect clients who run or would like to migrate to our secure WordPress hosting platform can now request the installation of a new theme called Kandinsky. Developed by our friends at «Теплица социальных технологий» (Greenhouse for Social Technology) in response to needs expressed by civil society groups, wishing to have an effective and well designed online presence Kandinsky offers three templates and guides website creators with helpful tips and check lists. You can read our full interview with Kandinsky here.

Deflect and the World Social Forum

On January 29th, Deflect staff participated in a live panel during the World Social Forum together with our partners Colnodo and the Foundation for Freedom of the Press (FLIP). Julian Casasbuenas from Colnodo presented the use case of a Colombian independent media site losdanieles.com as an example of Deflect protection and its importance to the development of free journalistic expression in Colombia. The losdanieles.com project was launched by a group of highly reputable journalists who had been implicated in the Colombian parapolitics scandal.

To finish this newsletter, we wanted to share a lovely thank-you video sent to us by LosDanieles.com columnist Daniel Samper Ospina.

Find the Daniels on Twitter @DanielSamperO, @DanielsamperP and @DCoronell

  1. Home
  2. >
  3. Deflect
Categories
DDoS Deflect News from Deflect Labs

Introducing Baskerville (waf!)

The more outré and grotesque an incident is the more carefully it deserves to be examined.
― Arthur Conan Doyle, The Hound of the Baskervilles

Chapter 1 – Baskerville

Baskerville is a machine operating on the Deflect network that protects  sites from hounding, malicious bots. It’s also an open source project that, in time, will be able to reduce bad behaviour on your networks too. Baskerville responds to web traffic, analyzing requests in real-time, and challenging those acting suspiciously. A few months ago, Baskerville passed an important milestone – making its own decisions on traffic deemed anomalous. The quality of these decisions (recall) is high and Baskerville has already successfully mitigated many sophisticated real-life attacks.

We’ve trained Baskerville to recognize what legitimate traffic on our network looks like, and how to distinguish it from malicious requests attempting to disrupt our clients’ websites. Baskerville has turned out to be very handy for mitigating DDoS attacks, and for correctly classifying other types of malicious behaviour.

Baskerville is an important contribution to the world of online security – where solid web defences are usually the domain of proprietary software companies or complicated manual rule-sets. The ever-changing nature and patterns of attacks makes their mitigation a continuous process of adaptation. This is why we’ve trained a machine how to recognize and respond to anomalous traffic. Our plans for Baskerville’s future will enable plug-and-play installation in most web environments and privacy-respecting exchange of threat intelligence data between your server and the Baskerville clearinghouse.

Chapter 2 – Background 

Web attacks are a threat to democratic voices on the Internet. Botnets deploy an arsenal of methods, including brute force password login, vulnerability scanning, and DDoS attacks, to overwhelm a platform’s hosting resources and defences, or to wreak financial damage on the website’s owners. Attacks become a form of punishment, intimidation, and most importantly, censorship, whether through direct denial of access to an Internet resource or by instilling fear among the publishers. Much of the development to-date in anomaly detection and mitigation of malicious network traffic has been closed source and proprietary. These silo-ed approaches are limiting when dealing with constantly changing variables. They are also quite expensive to set-up, with a company’s costs often offset by the sale or trade of threat intelligence gathered on the client’s network, something Deflect does not do or encourage.

Since 2010, the Deflect project has protected hundreds of civil society and independent media websites from web attacks, processing over a billion monthly website requests from humans and bots. We are now bringing internally developed mitigation tooling to a wider audience, improving network defences for freedom of expression and association on the internet.

Baskerville was developed over three years by eQualitie’s dedicated team of machine learning experts. Several challenges or ambitions were presented to the team. To make this an effective solution to the ever-growing need for humans to perform constant network monitoring, and the never-ending need to create rules to ban newly discovered malicious network behaviour, Baskerville had to:

  • Be fast enough to make it count
  • Be able to adapt to changing traffic patterns
  • Provide actionable intelligence (a prediction and a score for every IP)
  • Provide reliable predictions (probation period & feedback)

Baskerville works by analyzing HTTP traffic bound for your website, monitoring the proportion of legitimate vs anomalous traffic. On the Deflect network, it will trigger a Turing challenge to an IP address behaving suspiciously, thereafter confirming whether a real person or a bot is sending us requests.

Chapter 3 –  Baskerville Learns

To detect new evolving threats, Baskerville uses the unsupervised anomaly detection algorithm Isolation Forest. The majority of anomaly detection algorithms construct a profile of normal instances, then classify instances that do not conform to the normal profile as anomalies. The main problem with this approach is that the model is optimized to detect normal instances, but not optimized to detect anomalies causing either too many false alarms or too few anomalies. In contrast, Isolation Forest explicitly isolates anomalies rather than profiling normal instances. This method is based on a simple assumption: ‘Anomalies are few, and they are different’. In addition, the Isolation Forest algorithm does not require a training set to contain normal instances only. Moreover, the algorithm performs even better if the training set contains some anomalies, or attack incidents in our case. This enables us to re-train the model regularly on all the recent traffic without any labeling procedure in order to adapt to the changing patterns.

Labelling

Despite the fact that we don’t need labels to train a model, we still need a labelled dataset of historical attacks for parameter tuning. Traditionally, labelling is a challenging procedure since it requires a lot of manual work. Every new attack must be reported and investigated, and every IP should be labelled either malicious or benign.

Our production environment reports several incidents a week, so we designed an automated procedure of labelling using a machine model trained on the same features we use for the Isolation Forest anomaly detection model.

We reasoned that if an attack incident has a clearly visible traffic spike, we can assume that the vast majority of the IPs during this period are malicious, and we can train a classifier like Random Forest particularly for this incident. The only user input would be the precise time period for that incident and for the time period for ordinal traffic for that host. Such a classifier would not be perfect, but it would be good enough to be able to separate some regular IPs from the majority of malicious IPs during the time of the incident. In addition, we assume that attacker IPs most likely are not active immediately before the attack, and we do not label an IP as malicious if it was seen in the regular traffic period.

This labelling procedure is not perfect, but it allows us to label new incidents with very little time or human interaction.

An example of the labelling procedure output

Performance Metrics

We use the Precision-Recall AUC metric for model performance evaluation. The main reason for using the Precision-Recall metric is that it is more sensitive to the improvements for the positive class than the ROC (receiver operating characteristic) curve. We are less concerned about the false positive rate since, in the event that we falsely predict that an IP is doing something malicious, we won’t ban it, but only notify the rule-based attack mitigation system to challenge that specific IP. The IP will only be banned if the challenge fails.

The performance of two different models on two different attacks

Categorical Features

After two months of validating our approach in the production environment, we started to realize that the model was not sophisticated enough to distinguish anomalies specific only to particular clients.

The main reason for this is that the originally published Isolation Forest algorithm supports only numerical features, and could not work with so-called categorical string values, such as hostname. First, we decided to train a separate model per target host and create an assembly of models for the final prediction. This approach over complicated the whole process and did not scale well. Additionally, we had to take care of adjusting the weights in the model assembly. In fact, we jeopardized the original idea of knowledge sharing by having a single model for all the clients. Then we tried to use the classical way of dealing with this problem: one-hot encoding. However, the deployed solution did not work well since the model became too overfit to the new hostname feature, and the performance decreased.

In the next iteration, we found another way of encoding categorical features  based on a peer-review paper recently published in 2018. The main idea was not to use one-hot encoding, but rather to modify the tree-building algorithm itself. We could not find the implementation of the idea, and had to modify the source code of IForest library in Scala. We introduced a new string feature ‘hostname,’ and this time the model showed notable performance improvement in production. Moreover, our final implementation was generic and allowed us to experiment with other categorical features like country, user agent, operating system, etc.

Stratified Sampling

Baskerville uses a single machine learning model trained on the data received from hundreds of clients.This allows us to share the knowledge and benefit from a model trained on a global dataset of recorded incidents. However, when we first deployed Baskerville, we realized that the model is biased towards high traffic clients.

We had to find a balance in the amount of data we feed to the training pipeline from each client. On the one hand, we wanted to equalize the number of records from each client, but on the other hand, high traffic clients provided much more valuable incident information. We decided to use stratified sampling of training datasets with a single parameter: the maximum number of samples per host.

Storage

Baskerville uses Postgres to store the processed results. The request-sets  table holds the results of the real-time weblogs pre-processed by our analytics engine which has an estimated input of ~30GB per week. So, within a year, we’d have a ~1.5 TB table. Even though this is within Postgres limits, running queries on this would not be very efficient. That’s where the data partitioning feature of Postgres came in. We used that feature to split the request sets table into smaller tables, each holding one week’s data. . This allowed for better data management and faster query execution.

However, even with the use of data partitioning, we needed to be able to scale the database out. Since we already had the Timescale extension for the Prometheus database, we decided to use it for  Baskerville too. We followed Timescale’s tutorial for data migration in the same database, which means we created a temp table, moved the data from each and every partition into the temp table, ran the command to create a hypertable on the temp table, deleted the initial request sets table and its partitions, and, finally, renamed the temp table as ‘request sets.’ The process was not very straightforward, unfortunately, and we did run into some problems. But in the end, we were able to scale the database, and we are currently operating using Timescale in production.

We also explored other options, like TileDb, Apache Hive, and Apache HBase, but for the time being, Timescale is enough for our needs. We will surely revisit this in the future, though.

Architecture

The initial design of Baskerville was created with the assumption that Baskerville will be running under Deflect as an analytics engine, to aid the already in place rule-based attack detection and mitigation mechanism. However, the needs changed as it became necessary to open up Baskerville’s prediction to other users and make our insights available to them.

In order to allow other users to take advantage of our model, we had to redesign the pipelines to be more modular. We also needed to take into account the kind of data to be exchanged, more specifically, we wanted to avoid any exchange that would involve sensitive data, like IPs for example. The idea was that the preprocessing would happen on the client’s end, and only the resulting  feature vectors  would be sent, via Kafka, to the Prediction centre. The Prediction centre continuously listens for incoming feature vectors, and once a request arrives, it uses the pre-trained model to predict and send the results back to the user. This whole process happens without the exchange of any kind of sensitive information, as only the feature vectors go back and forth.

On the client side, we had to implement a caching mechanism with TTL, so that the request sets wait for their matching predictions. If the prediction center takes more than 10 minutes, the request sets expire. 10 minutes, of course, is not an acceptable amount of time, just a safeguard so that we do not keep request sets forever which can result in OOM. The ttl is configurable. We used Redis for this mechanism, as it has the ttl feature embedded, and there is a spark-redis connector we could easily use, but we’re still tuning the performance and thinking about alternatives. We also needed a separate spark application to handle the prediction to request set matching once the response from the Prediction center is received.. This application listens to the client specific Kafka topic, and once a prediction arrives, it looks into redis, fetches the matched request set, and saves everything into the database.

To sum up, in the new architecture, the preprocessing happens on the client’s side, the feature vectors are sent via Kafka to the Prediction centre (no sensitive data exchange), a prediction and a score for each request set is sent as a reply to each feature vector (via Kafka), and on the client side, another Spark job is waiting to consume the prediction message, match it with the respective request set, and save it to the database.

Read more about the project and download the source to try for yourself. Contact us for more information or to get help setting up Baskerville in your web environment.

  1. Home
  2. >
  3. Deflect
Categories
Advocacy DDoS Deflect Deflect Labs News from Deflect Labs Threat Intel

Deflect Labs Report #6: Phishing and Web Attacks Targeting Uzbek Human Right Activists and Independent Media

Key Findings

  • We’ve discovered infrastructure used to launch and coordinate attacks targeting independent media and human rights activists from Uzbekistan
  • The campaign has been active since early 2016, using web and phishing attacks to suppress and exploit their targets
  • We have no evidence of who is behind this campaign but the target list points to a new threat actor targeting Uzbek activists and media

Introduction

The Deflect project was created to protect civil society websites from web attacks, following the publication of “Distributed Denial of Service Attacks Against Independent Media and Human Rights Sites report by the Berkman Center for Internet & Society. During that time we’ve investigated many DDoS attacks leading to the publication of several reports.

The attacks leading to the publication of this report quickly stood out from the daily onslaught of malicious traffic on Deflect, at first because they were using professional vulnerability scanning tools like Acunetix. The moment we discovered that the origin server of these scans was also hosting fake gmail domains, it became evident that something bigger was going on here. In this report, we describe all the pieces put together about this campaign, with the hope to contribute to public knowledge about the methods and impact of such attacks against civil society.

Image

Context : Human Rights and Surveillance in Uzbekistan

Emblem of Uzbekistan (wikipedia)

Uzbekistan is defined by many human-rights organizations as an authoritarian state, that has known strong repression of civil society. Since the collapse of the Soviet Union, two presidents have presided over a system that institutionalized  torture and repressed freedom of expression, as documented over the years by Human Rights Watch, Amnesty International and Front Line Defenders, among many others. Repression extended to media and human rights activists in particular, many of whom had to leave the country and continue their work in diaspora.

Uzbekistan was one of the first to establish a pervasive Internet censorship infrastructure, blocking access to media and human rights websites. Hacking Team servers in Uzbekistan were identified as early as 2014 by the Citizen Lab. It was later confirmed that Uzbek National Security Service (SNB) were among the customers of Hacking Team solutions from leaked Hacking Team emails. A Privacy International report from 2015 describes the installation in Uzbekistan of several monitoring centers with mass surveillance capabilities provided by the Israeli branch of the US-based company Verint Systems and by the Israel-based company NICE Systems. A 2007 Amnesty International report entitled ‘We will find you anywhere’ gives more context on the utilisation of these capabilities, describing digital surveillance and targeted attacks against Uzbek journalists and human-right activists. Among other cases, it describes the unfortunate events behind the closure of uznews.net – an independent media website established by Galima Bukharbaeva in 2005 following the Andijan massacre. In 2014, she discovered that her email account had been hacked and information about the organization, including names and personal details journalists in Uzbekistan was published online. Galima is now the editor of Centre1, a Deflect client and one of the targets of this investigation.

A New Phishing and Web Attack Campaign

On the 16th of November 2018, we identified a large attack against several websites protected by Deflect. This attack used several professional security audit tools like NetSparker and WPScan to scan the websites eltuz.com and centre1.com.


Peak of traffic during the attack (16th of November 2018)

This attack was coming from the IP address 51.15.94.245 (AS12876 – Online AS but an IP range dedicated to Scaleway servers). By looking at older traffic from this same IP address, we found several cases of attacks on other Deflect protected websites, but we also found domains mimicking google and gmail domains hosted on this IP address, like auth.login.google.email-service[.]host or auth.login.googlemail.com.mail-auth[.]top. We looked into passive DNS databases (using the PassiveTotal Community Edition and other tools like RobTex) and crossed that information with attacks seen on Deflect protected websites with logging enabled. We uncovered a large campaign combining web and phishing attacks against media and activists. We found the first evidence of activity from this group in February 2016, and the first evidence of attacks in December 2017.

The list of Deflect protected websites chosen by this campaign, may give some context to the motivation behind them. Four websites were targeted:

  • Fergana News is a leading independent Russian & Uzbek language news website covering Central Asian countries
  • Eltuz is an independent Uzbek online media
  • Centre1 is an independent media organization covering news in Central Asia
  • Palestine Chronicle is a non-profit organization working on human-rights issues in Palestine

Three of these targets are prominent media focusing on Uzbekistan. We have been in contact with their editors and several other Uzbek activists to see if they had received phishing emails as part of this campaign. Some of them were able to confirm receiving such messages and forwarded them to us. Reaching out further afield we were able to get confirmations of phishing attacks from other prominent Uzbek activists who were not linked websites protected by Deflect.

Palestine Chronicle seems to be an outlier in this group of media websites focusing on Uzbekistan. We don’t have a clear hypothesis about why this website was targeted.

A year of web attacks against civil society

Through passive DNS, we identified three IPs used by the attackers in this operation :

  • 46.45.137.74 was used in 2016 and 2017 (timeline is not clear, Istanbul DC, AS197328)
  • 139.60.163.29 was used between October 2017 and August 2018 (HostKey, AS395839)
  • 51.15.94.245 was used between September 2018 and February 2019 (Scaleway, AS12876)

We have identified 15 attacks from the IPs 139.60.163.29 and 51.15.94.245 since December 2017 on Deflect protected websites:

DateIPTargetTools used
2017/12/17139.60.163.29eltuz.comWPScan
2018/04/12139.60.163.29eltuz.comAcunetix
2018/09/1551.15.94.245www.palestinechronicle.com eltuz.com www.fergana.info and uzbek.fergananews.comAcunetix and WebCruiser
2018/09/1651.15.94.245www.fergana.infoAcunetix
2018/09/1751.15.94.245www.fergana.infoAcunetix
2018/09/1851.15.94.245www.fergana.infoNetSparker and Acunetix
2018/09/1951.15.94.245eltuz.comNetSparker
2018/09/2051.15.94.245www.fergana.infoAcunetix
2018/09/2151.15.94.245www.fergana.infoAcunetix
2018/10/0851.15.94.245eltuz.com, www.fergananews.com and news.fergananews.comUnknown
2018/11/1651.15.94.245eltuz.com, centre1.com and en.eltuz.comNetSparker and WPScan
2019/01/1851.15.94.245eltuz.comWPScan
2019/01/1951.15.94.245fergana.info www.fergana.info and fergana.agencyUnknown
2019/01/3051.15.94.245eltuz.com and en.eltuz.comUnknown
2019/02/0551.15.94.245fergana.infoAcunetix

Besides classic open-source tools like WPScan, these attacks show the utilization of a wide range of commercial security audit tools, like NetSparker or Acunetix. Acunetix offers a trial version that may have been used here, NetSparker does not, showing that the operators may have a consistent budget (standard offer is $4995 / year, a cracked version may have been used).

It is also surprising to see so many different tools coming from a single server, as many of them require a Graphical User Interface. When we scanned the IP 51.15.94.245, we discovered that it hosted a Squid proxy on port 3128, we think that this proxy was used to relay traffic from the origin operator computer.

Extract of nmap scan of 51.15.94.245 in December 2018 :

3128/tcp  open     http-proxy Squid http proxy 3.5.23
|_http-server-header: squid/3.5.23
|_http-title: ERROR: The requested URL could not be retrieved

A large phishing campaign

After discovering a long list of domains made to resemble popular email providers, we suspected that the operators were also involved in a phishing campaign. We contacted owners of targeted websites, along with several Uzbek human right activists and gathered 14 different phishing emails targeting two activists between March 2018 and February 2019 :

DateSenderSubjectLink
12th of March 2018g.corp.sender[@]gmail.comУ Вас 2 недоставленное сообщение (You have 2 undelivered message)http://mail.gmal.con.my-id[.]top/
13th of June 2018service.deamon2018[@]gmail.comПрекращение предоставления доступа к сервису (Termination of access to the service)http://e.mail.gmall.con.my-id[.]top/
18th of June 2018id.warning.users[@]gmail.comВаш новый адрес в Gmail: alexis.usa@gmail.com (Your new email address in Gmail: alexis.usa@gmail.com)http://e.mail.users.emall.com[.]my-id.top/
10th of July 2018id.warning.daemons[@]gmail.comПрекращение предоставления доступа к сервису (Termination of access to the service)hxxp://gmallls.con-537d7.my-id[.]top/
10th of July 2018id.warning.daemons[@]gmail.comПрекращение предоставления доступа к сервису (Termination of access to the service)http://gmallls.con-4f137.my-id[.]top/
18th of July 2018service.deamon2018[@]gmail.com[Ticket#2011031810000512] – 3 undelivered messageshttp://login-auth-goglemail-com-7c94e3a1597325b849e26a0b45f0f068.my-id[.]top/
2nd of August 2018id.warning.daemon.service[@]gmail.com[Important Reminder] Review your data retention settingsNone
16th of October 2018lolapup.75[@]gmail.comЭкс-хоким Ташкента (Ex-hokim of Tashkent)http://office-online-sessions-3959c138e8b8078e683849795e156f98.email-service[.]host/
23rd of October 2018noreply.user.info.id[@]gmail.comВаш аккаунт будет заблокировано (Your account will be blocked.)http://gmail-accounts-cb66d53c8c9c1b7c622d915322804cdf.email-service[.]host/
25th of October 2018warning.service.suspended[@]gmail.comВаш аккаунт будет заблокировано. (Your account will be blocked.)http://gmail-accounts-bb6f2dfcec87551e99f9cf331c990617.email-service[.]host/
18th of February 2019service.users.blocked[@]gmail.comВажное оповещение системы безопасности (Important Security Alert)http://id-accounts-blocked-ac5a75e4c0a77cc16fe90cddc01c2499.myconnection[.]website/
18th of February 2019mail.suspend.service[@]gmail.comОповещения системы безопасности (Security Alerts)http://id-accounts-blocked-326e88561ded6371be008af61bf9594d.myconnection[.]website/
21st of February 2019service.users.blocked[@]gmail.comВаш аккаунт будет заблокирован. (Your account will be blocked.)http://id-accounts-blocked-ffb67f7dd7427b9e4fc4e5571247e812.myconnection[.]website/
22nd of February 2019service.users.blocked[@]gmail.comПрекращение предоставления доступа к сервису (Termination of access to the service)http://id-accounts-blocked-c23102b28e1ae0f24c9614024628e650.myconnection[.]website/

Almost all these emails were mimicking Gmail alerts to entice the user to click on the link. For instance this email received on the 23rd of October 2018 pretends that the account will be closed soon, using images of the text hosted on imgur to bypass Gmail detection :

The only exception was an email received on the 16th of October 2018 pretending to give confidential information on the former Hokim (governor) of Tashkent :

Emails were using simple tricks to bypass detection, at times drw.sh url shortener (this tool belongs to a Russian security company Doctor Web) or by using open re-directions offered in several Google tools.

Every email we have seen used a different sub-domain, including emails from the same Gmail account and with the same subject line. For instance, two different emails entitled “Прекращение предоставления доступа к сервису” and sent from the same address used hxxp://gmallls.con-537d7.my-id[.]top/ and http://gmallls.con-4f137.my-id[.]top/ as phishing domains. We think that the operators used a different sub-domain for every email sent in order to bypass Gmail list of known malicious domains. This would explain the large number of sub-domains identified through passive DNS. We have identified 74 sub-domains for 26 second-level domains used in this campaign (see the appendix below for  full list of discovered domains).

We think that the phishing page stayed online only for a short time after having sent the email in order to avoid detection. We got access to the phishing page of a few emails. We could confirm that the phishing toolkit checked if the password is correct or not (against the actual gmail account) and suspect that they implemented 2 Factor authentication for text messages and 2FA applications, but could not confirm this.

Timeline for the campaign

We found the first evidence of activity in this operation with the registration of domain auth-login[.]com on the 21st of February 2016. Because we discovered the campaign recently, we have little information on attacks during  2016 and 2017, but the domain registration date shows some activity in July and December 2016, and then again in August and October 2017. It is very likely that the campaign started in 2016 and continued in 2017 without any public reporting about it.

Here is a first timeline we obtained based on domain registration dates and dates of web attacks and phishing emails :

To confirm that this group had some activity during  2016 and 2017, we gathered encryption (TLS) certificates for these domains and sub-domains from the crt.sh Certificate Transparency Database. We identified 230 certificates generated for these domains, most of them created by Cloudfare. Here is a new timeline integrating the creation of TLS certificates :

We see here many certificates created since December 2016 and continuing over 2017, which shows that this group had some activity during that time. The large number of certificates over 2017 and 2018 comes from campaign operators using Cloudflare for several domains. Cloudflare creates several short-lived certificates at the same time when protecting a website.

It is also interesting to note that the campaign started in February 2016, with some activity in the summer of 2016, which happens to when the former Uzbek president Islam Karimov died, news first reported by Fergana News, one of the targets of this attack campaign.

Infrastructure Analysis

We identified domains and subdomains of this campaign through analysis of passive DNS information, using mostly the Community access of PassiveTotal. Many domains in 2016/2017 reused the same registrant email address, b.adan1@walla.co.il, which helped us identify other domains related to this campaign :

Based on this list, we identified subdomains and IP addresses associated with them, and discovered three IP addresses used in the operation. We used Shodan historical data and dates of passive DNS data to estimate the timeline of the utilisation of the different servers :

  • 46.45.137.74 was used in 2016 and 2017
  • 139.60.163.29 was used between October 2017 and August 2018
  • 51.15.94.245 was used between September and February 2019

We have identified 74 sub-domains for 26 second-level domains used in this campaign (see the appendix for a full list of IOCs). Most of these domains are mimicking Gmail, but there are also domains mimicking Yandex (auth.yandex.ru.my-id[.]top), mail.ru (mail.ru.my-id[.]top) qip.ru (account.qip.ru.mail-help-support[.]info), yahoo (auth.yahoo.com.mail-help-support[.]info), Live (login.live.com.mail-help-support[.]info) or rambler.ru (mail.rambler.ru.mail-help-support[.]info). Most of these domains are sub-domains of a few generic second-level domains (like auth-mail.com), but there are a few specific second-level domains that are interesting :

  • bit-ly[.]host mimicking bit.ly
  • m-youtube[.]top and m-youtube[.]org for Youtube
  • ecoit[.]email which could mimick https://www.ecoi.net
  • pochta[.]top likely mimick https://www.pochta.ru/, the Russian Post website
  • We have not found any information on vzlom[.]top and fixerman[.]top. Vzlom means “break into” in Russian, so it could have hosted or mimicked a security website

A weird Cyber-criminality Nexus

It is quite unusual to see connections between targeted attacks and cyber-criminal enterprises, however during this investigation we encountered two such links.

The first one is with the domain msoffice365[.]win which was registered by b.adan1@walla.co.il (as well as many other domains from this campaign) on the 7th of December 2016. This domain was identified as a C2 server for a cryptocurrency theft tool called Quant, as described in this Forcepoint report released in December 2017. Virus Total confirms that this domain hosted several samples of this malware in November 2017 (it was registered for a year). We have not seen any malicious activity from this domain related to our campaign, but as explained earlier, we have marginal access to the group’s activity in 2017.

The second link we have found is between the domain auth-login[.]com and the groups behind the Bedep trojan and the Angler exploit kit. auth-login[.]com was linked to this operation through the subdomain login.yandex.ru.auth-login[.]com that fit the pattern of long subdomains mimicking Yandex from this campaign and it was hosted on the same IP address 46.45.137.74 in March and April 2016 according to RiskIQ. This domain was registered in February 2016 by yingw90@yahoo.com (David Bowers from Grovetown, GA in the US according to whois information). This email address was also used to register hundreds of domains used in a Bedep campaign as described by Talos in February 2016 (and confirmed by several other reports). Angler exploit kit is one of the most notorious exploit kit, that was commonly used by cyber-criminals between 2013 and 2016. Bedep is a generic backdoor that was identified in 2015, and used almost exclusively with the Angler exploit kit. It should be noted that Trustwave documented the utilization of Bedep in 2015 to increase the number of views of pro-Russian propaganda videos.

Even if we have not seen any utilisation of these two domains in this campaign, these two links seem too strong to be considered cirmcumstantial. These links could show a collaboration between cyber-criminal groups and state-sponsored groups or services. It is interesting to remember the potential involvement of Russian hacking groups in attacks on Uznews.net editor in 2014, as described by Amnesty international.

Taking Down Servers is Hard

When the attack was discovered, we decided to investigate without sending any abuse requests, until a clearer picture of the campaign emerged. In January, we decided that we had enough knowledge of the campaign and started to send abuse requests – for fake Gmail addresses to Google and for the URL shorteners to Doctor Web. We did not receive any answer but noticed that the Doctor Web URLs were taken down a few days after.

Regarding the Scaleway server, we entered into an unexpected loop with their abuse process.  Scaleway operates by sending the abuse request directly to the customer and then asks them for confirmation that the issue has been resolved. This process works fine in the case of a compromised server, but does not work when the server was rented intentionally for malicious activities. We did not want to send an abuse request because it would have involved giving away information to the operators. We contacted Scaleway directly and it took some time to find the right person on the security team. They acknowledged the difficulty of having an efficient Abuse Process, and after we sent them an anonymized version of this report along with proof that phishing websites were hosted on the server, they took down the server around the 25th of January 2019.

Being an infrastructure provider, we understand the difficulty of dealing with abuse requests. For a lot of hosting providers, the number of requests is what makes a case urgent or not. We encourage hosting providers to better engage with organisations working to protect Civil Society and establish trust relationships that help quickly mitigate the effects of malicious campaigns.

Conclusion

In this report, we have documented a prolonged phishing and web attack campaign focusing on media covering Uzbekistan and Uzbek human right activists. It shows that once again, digital attacks are a threat for human-right activists and independent media. There are several threat actors known to use both phishing and web attacks combined (like the Vietnam-related group Ocean Lotus), but this campaign shows a dual strategy targeting civil society websites and their editors at the same time.

We have no evidence of government involvement in this operation, but these attacks are clearly targeted on prominent voices of Uzbek civil society. They also share strong similarities with the hack of Uznews.net in 2014, where the editor’s mailbox was compromised through a phishing email that appeared as a notice from Google warning her that the account had been involved in distributing illegal pornography.

Over the past 10 years, several organisations like the Citizen Lab or Amnesty International have dedicated lots of time and effort to document digital surveillance and targeted attacks against Civil Society. We hope that this report will contribute to these efforts, and show that today, more than ever, we need to continue supporting civil society against digital surveillance and intrusion.

Counter-Measures Against such Attacks

If you think you are targeted by similar campaigns, here is a list of recommendations to protect yourself.

Against phishing attacks, it is important to learn to recognize classic phishing emails. We give some examples in this report, but you can read other similar reports by the Citizen Lab. You can also read this nice explanation by NetAlert and practice with this Google Jigsaw quizz. The second important point is to make sure that you have configured 2-Factor Authentication on your email and social media accounts. Two-Factor Authentication means using a second way to authenticate when you log-in besides your password. Common second factors include text messages, temporary password apps or hardware tokens. We recommend using either temporary password apps (like Google AuthenticatorFreeOTP) or Hardware Keys (like YubiKeys). Hardware keys are known to be more secure and strongly recommended if you are an at-risk activist or journalist.

Against web attacks, if you are using a CMS like WordPress or Drupal, it is very important to update both the CMS and its plugins very regularly, and avoid using un-maintained plugins (it is very common to have websites compromised because of outdated plugins). Civil society websites are welcome to apply to Deflect for free website protection.

Appendix

Acknowledgement

We would like to thank Front Line Defenders and Scaleway for their help. We would also like to thank ipinfo.io and RiskIQ for their tools that helped us in the investigation.

Indicators of Compromise

Top level domains :

email-service.host
email-session.host
support-email.site
support-email.host
email-support.host
myconnection.website
ecoit.email
my-cabinet.com
my-id.top
msoffice365-online.org
secretonline.top
m-youtube.top
auth-mail.com
mail-help-support.info
mail-support.info
auth-mail.me
auth-login.com
email-x.com
auth-mail.ru
mail-auth.top
msoffice365.win
bit-ly.host
m-youtube.org
vzlom.top
pochta.top
fixerman.top

You can find a full list of indicators on github : https://github.com/equalitie/deflect_labs_6_indicators

  1. Home
  2. >
  3. Deflect
Categories
DDoS Deflect Deflect Labs

Deflect Labs Report #5 – Baskerville

Using Machine Learning to Identify Cyber Attacks

The Deflect platform is a free website security service defending civil society and human rights groups from digital attack. Currently, malicious traffic is identified on the Deflect network by Banjax, a system that uses handwritten rules to flag IPs that are behaving like attacking bots, so that they can be challenged or banned. While Banjax is successful at identifying the most common bruteforce cyber attacks, the approach of using a static set of rules to protect against the constantly evolving tools available to attackers is fundamentally limited. Over the past year, the Deflect Labs team has been working to develop a machine learning module to automatically identify malicious traffic on the Deflect platform, so that our mitigation efforts can keep pace with the methods of attack as these grow in complexity and sophistication.

In this report, we look at the performance of the Deflect Labs’ new anomaly detection tool, Baskerville, in identifying a selection of the attacks seen on the Deflect platform during the last year. Baskerville is designed to consume incoming batches of web logs (either live from a Kafka stream, or from Elasticsearch storage), group them into request sets by host website and IP, extract the browsing features of each request set, and make a prediction about whether the behaviour is normal or not. At its core, Baskerville currently uses the Scikit-Learn implementation of the Isolation Forest anomaly detection algorithm to conduct this classification, though the engine is agnostic to the choice of algorithm and any trained Scikit-Learn classifier can be used in its place. This model is trained on normal web traffic data from the Deflect platform, and evaluated using a suite of offline tools incorporated in the Baskerville module. Baskerville has been designed in such a way that once the performance of the model is sufficiently strong, it can be used for real-time attack alerting and mitigation on the Deflect platform.

To showcase the current capabilities of the Baskerville module, we have replayed the attacks covered in the 2018 Deflect Labs report: Attacks Against Vietnamese Civil Society, passing the web logs from these incidents through the processing and prediction engine. This report was chosen for replay because of the variety of attacks seen across its constituent incidents. There were eight attacks in total considered in this report, detailed in the table below.

Date Start (approx.) Stop (approx.) Target
2018/04/17 08:00 10:00 viettan.org
2018/04/17 08:00 10:00 baotiengdan.com
2018/05/04 00:00 23:59 viettan.org
2018/05/09 10:00 12:30 viettan.org
2018/05/09 08:00 12:00 baotiengdan.com
2018/06/07 01:00 05:00 baotiengdan.com
2018/06/13 03:00 08:00 baotiengdan.com
2018/06/15 13:00 23:30

baotiengdan.com

Table 1: Attack time periods covered in this report. The time period of each attack was determined by referencing the number of Deflect and Banjax logs recorded for each site, relative to the normal traffic volume.

How does it work?

Given one request from one IP, not much can be said about whether or not that user is acting suspiciously, and thus how likely it is that they are a malicious bot, as opposed to a genuine user. If we instead group together all the requests to a website made by one IP over time, we can begin to build up a more complete picture of the user’s browsing behaviour. We can then train an anomaly detection algorithm to identify any IPs that are behaving outside the scope of normal traffic.

The boxplots below illustrate how the behaviour during the Vietnamese attack time periods differs from that seen during an average fortnight of requests to the same sites. To describe the browsing behaviour, 17 features (detailed in the Baskerville documentation) have been extracted based on the request sets (note that the feature values are scaled relative to average distributions, and do not have a physical interpretation). In particular, it can be seen that these attack time periods stand out by having far fewer unique paths requested (unique_path_to_request_ratio), a shorter average path depth (path_depth_average), a smaller variance in the depth of paths requested (path_depth_variance), and a lower payload size (payload_size_log_average). By the ‘path depth’, we mean the number of slashes in the requested URL (so ‘website.com’ has a path depth of zero, and ‘website.com/page1/page2’ has a path depth of two), and by ‘payload size’ we mean the size of the request response in bytes.

Figure 1: The distributions of the 17 scaled feature values during attack time periods (red) and non-attack time periods (blue). It can be seen that the feature distributions are notably different during the attack and non-attack periods.

The separation between the attack and non-attack request sets can be nicely visualised by projecting along the feature dimensions identified above. In the three-dimensional space defined by the average path depth, the average log of the payload size, and the unique path to request ratio, the request sets identified as malicious by Banjax (red) are clearly separated from those not identified as malicious (blue).

Figure 2: The distribution of request sets along three of the 17 feature dimensions for IPs identified as malicious (red) or benign (blue) by the existing banning module, Banjax. The features shown are the average path depth, the average log of the request payload size, and the ratio of unique paths to total requests, during each request set. The separation between the malicious (red) and benign (blue) IPs is evident along these dimensions.

Training a Model

A machine learning classifier enables us to more precisely define the differences between normal and abnormal behaviour, and predict the probability that a new request set comes from a genuine user. For this report, we chose to train an Isolation Forest; an algorithm that performs well on novelty detection problems, and scales for large datasets.

As an anomaly detection algorithm, the Isolation Forest took as training data all the traffic to the Vietnamese websites over a normal two-week period. To evaluate its performance, we created a testing dataset by partitioning out a selection of this data (assumed to represent benign traffic), and combining this with the set of all requests coming from IPs flagged by the Deflect platform’s current banning tool, Banjax (assumed to represent malicious traffic). There are a number of tunable parameters in the Isolation Forest algorithm, such as the number of trees in the forest, and the assumed contamination with anomalies of the training data. Using the testing data, we performed a gridsearch over these parameters to optimize the model’s accuracy.

Replaying the Attacks

The model chosen for use in this report has a precision of 0.90, a recall of 0.86, and a resultant f1 score of 0.88, when evaluated on the testing dataset formulated from the Vietnamese website traffic, described above. If we take the Banjax bans as absolute truth (which is almost certainly not the case), this means that 90% of the IPs predicted as anomalous by Baskerville were also flagged by Banjax as malicious, and that 88% of all the IPs flagged by Banjax as malicious were also identified as anomalous by Baskerville, across the attacks considered in the Vietnamese report. This is demonstrated visually in the graph below, which shows the overlap between the Banjax flag and the Baskerville prediction (-1 indicates malicious, and +1 indicates benign). It can be seen that Baskerville identifies almost all of the IPs picked up by Banjax, and additionally flags a fraction of the IPs not banned by Banjax.

Figure 3: The overlap between the Banjax results (x-axis) and the Baskerville prediction results (colouring). Where the Banjax flag is -1 and the prediction colour is red, both Banjax and Baskerville agree that the request set is malicious. Where the Banjax flag is +1 and the prediction colour is blue, both modules agree that the request set is benign. The small slice of blue where the Banjax flag is -1, and the larger red slice where the Banjax flag is +1, indicate request sets about which the modules do not agree.

The performance of the model can be broken down across the different attack time periods. The grouped bar chart below compares the number of Banjax bans (red) to the number of Baskerville anomalies (green). In general, Baskerville identifies a much greater number of request sets as being malicious than Banjax does, with the exception of the 17th April attack, for which Banjax picked up slightly more IPs than Baskerville. The difference between the two mitigation systems is particularly pronounced on the 13th and 15th June attacks, for which Banjax scarcely identified any malicious IPs at all, but Baskerville identified a high proportion of malicious IPs.

Figure 4: The verdicts of Banjax (left columns) and Baskerville (right columns) across the 6 attack periods. The red/green components show the number of request sets that Banjax/Baskerville labelled as malicious, while the blue/purple components show the number that they labelled as benign. The fact that the green bars are almost everywhere higher than the red bars indicates that Baskerville picks up more traffic as malicious than does Banjax.

This analysis highlights the issue of model validation. It can be seen that Baskerville is picking up more request sets as being malicious than Banjax, but does this indicate that Baskerville is too sensitive to anomalous behaviour, or that Baskerville is outperforming Banjax? In order to say for sure, and properly evaluate Baskerville’s performance, a large testing set of labelled data is needed.

If we look at the mean feature values across the different attacks, it can be seen that the 13th and 15th June attacks (the red and blue dots, respectively, in the figure below) stand out from the normal traffic in that they have a much lower than normal average path depth (path_depth_average), and a much higher than normal 400-code response rate (response4xx_to_request_ratio), which may have contributed to Baskerville identifying a large proportion of their constituent request sets as malicious. Since a low average path depth (e.g. lots of requests made to ‘/’) and a high 400 response code rate (e.g. lots of requests to non-existent pages) are indicative of an IP behaving maliciously, this may suggest that Baskerville’s predictions were valid in these cases. But more labelled data is required for us to be certain about this evaluation.

Figure 5: Breakdown of the mean feature values during the two attack periods (red, blue) for which Baskerville identified a high proportion of malicious IPs, but Banjax did not. These are compared to the mean feature values during a normal two-week period (green).

Putting Baskerville into Action

Replaying the Vietnamese attacks demonstrates that it is possible for the Baskerville engine to identify cyber attacks on the Deflect platform in real time. While Banjax mitigates attacks using a set of static human-written rules describing what abnormal traffic looks like, by comprehensively describing how normal traffic behaves, the Baskerville classifier is able to identify new types of malicious behaviour that have never been seen before.

Although the performance of the Isolation Forest in identifying the Vietnamese attacks is promising, we would require a higher level of accuracy before the Baskerville engine is used to automatically ban IPs from accessing Deflect websites. The model’s accuracy can be improved by increasing the amount of data it is trained on, and by performing additional feature engineering and parameter tuning. However, to accurately assess its skill, we require a large set of labelled testing data, more complete than what is offered by Banjax logs. To this end, we propose to first deploy Baskerville in a developmental stage, during which IPs that are suspected to be malicious will be served a Captcha challenge rather than being absolutely banned. The results of these challenges can be added to the corpus of labelled data, providing feedback on Baskerville’s performance.

In addition to the applications of Baskerville for attack mitigation on the Deflect platform, by grouping incoming logs by host and IP into request sets, and extracting features from these request sets, we have created a new way to visualise and analyse attacks after they occur. We can compare attacks not just by the IPs involved, but also by the type of behaviour displayed. This opens up new possibilities for connecting disparate attacks, and investigating the agents behind them.

Where Next?

The proposed future of Deflect monitoring is the Deflect Labs Information Sharing and Analysis Centre (DL-ISAC). The underlying idea behind this project, summarised in the schematic below, is to split the Baskerville engine into separate User Module and Clearinghouse components (dealing with log processing and model development, respectively), to enable a complete separation of personal data from the centralised modelling. Users would process their own web logs locally, and send off feature vectors (devoid of IP and host site details) to receive a prediction. This allows threat-sharing without compromising personally identifiable information (PII). In addition, this separation would enable the adoption of the DL-ISAC by a much broader range of clients than the Deflect-hosted websites currently being served. Increasing the user base of this software will also increase the amount of browsing data we are able to collect, and thus the strength of the models we are able to train.

Baskerville is an open-source project, with its first release scheduled next quarter. We hope this will represent the first step towards enabling a new era of crowd-sourced threat information sharing and mitigation, empowering internet users to keep their content online in an increasingly hostile web environment.

Figure 6: A schematic of the proposed structure of the DL-ISAC. The infrastructure is split into a log-processing user endpoint, and a central clearinghouse for prediction, analysis, and model development.

A Final Word: Bias in AI

In all applications of machine learning and AI, it is important to consider sources of algorithmic bias, and how marginalised users could be unintentionally discriminated against by the system. In the context of web traffic, we must take into account variations in browsing behaviour across different subgroups of valid, non-bot internet users, and ensure that Baskerville does not penalise underrepresented populations. For instance, checks should be put in place to prevent disadvantaged users with slower internet connections from being banned because their request behaviour differs from those users that benefit from high-speed internet. The Deflect Labs team is committed to prioritising these considerations in the future development of the DL-ISAC.