Introducing OpenStack Event Listener

March 31, 2018

Published on Superuser.


I wanted to write a little about a project that I enjoyed working on called the OpenStack Event Listener, or OSEL for short. This project bridges the OpenStack control plane and an external scanning facility provided by Qualys to initiate batched on-demand scanning of OpenStack instances when security group changes happen. There were a number of interesting challenges. I was never able to really concentrate on it - this project took about 20 percent of my time for a period of three months. I started based on an initial proof-of-concept written by Charles Bitter, and also solicited contributions from Olivier Gagnon and Joseph Sleiman. I offer this partially as catharsis, to allow my brain to mark this part of my mental inventory as ripe for reclamation. I’m also writing on the off chance that someone might find this useful.

The setting

Let me paint a picture of the environment in which this development occurred. The Comcast OpenStack environment was transitioning from the OpenStack Icehouse release (very old) to the Newton release (much more current). This development occurred within the context of the Icehouse environment. Comcast’s security team uses S3 RiskFabric to manage auditing and tracking security vulnerabilities across the board. They also engage the services of Qualys to perform network scanning (in a manner very similar to Nessus) once a day against all the CIDR blocks that comprise Comcast’s Internet-routable IP addresses. Qualys scanning could also be triggered on-demand.

Technical requirements

First, let me describe the technical requirements for OSEL:

  • OSEL would connect to the OpenStack RabbitMQ message bus and register as a listener for “notification” events. This would allow OSEL to inspect all events, including security group changes.
  • When a security group change occurred, OSEL would ensure that it had the details of the change (ports permitted or blocked) as well as a list of all affected IP addresses.
  • OSEL would initiate a Qualys scan using the Qualys API. This would return a scan ID.
  • OSEL would log the change as well as the Qualys scan ID to the Security instance of Splunk to create an audit trail.
  • Qualys scan results would be imported into S3 RiskFabric for security audit management.

Implementation approach

My group does most of its development in Go and it was a good fit for this project by virtue of it’s ability to handle the stream of messages from RabbitMQ. This is what the data I was getting back from the AMQP message looked like. All identifiers have been scrambled.

{
    "_context_roles":[
        "Member"
    ],
    "_context_request_id":"req-f96ea9a5-435e-4177-8e51-bfe60d0fae2a",
    "event_type":"security_group_rule.create.end",
    "timestamp":"2016-10-03 18:10:59.112712",
    "_context_tenant_id":"ada3b9b06482909f9361e803b54f5f32",
    "_unique_id":"eafc9362327442b49d8c03b0e88d0216",
    "_context_tenant_name":"EXAMPLEPROJECT",
    "_context_user":"bca89c1b248e4a78282899ece9e744cc54",
    "_context_user_id":"bca89c1b248e4a78282899ece9e744cc54",
    "payload":{
        "security_group_rule_id":"bf8318fc-f9cb-446b-ffae-a8de016c562"
    },
    "_context_project_name":"EXAMPLEPROJECT",
    "_context_read_deleted":"no",
    "_context_tenant":"ada3b9b06482909f9361e803b54f5f32",
    "priority":"INFO",
    "_context_is_admin":false,
    "_context_project_id":"ada3b9b06482909f9361e803b54f5f32",
    "_context_timestamp":"2016-10-03 18:10:59.079179",
    "_context_user_name":"admin",
    "publisher_id":"network.osctrl1",
    "message_id":"e75fb2ee-85bf-44ba-a083-2445eca2ae10"
}

You can see that this is a security group creation (“event_type”:“security_group_rule.create.end”), creating a security group rule “bf8318fc-f9cb-446b-ffae-a8de016c562” in project “EXAMPLEPROJECT”. That does not tell us much, sadly. In order to resolve what IP addresses were affected when this security group rule was created, OSEL queries neutron for all ports in that tenant, determines what the IP address and associated security groups are for each, and returns the list of IP addresses associated with the security group for which the rule was created. Qualys is a service where you pay a certain amount of money and get a given number of API requests per time period. I did not find a maximum size for a single API request. OSEL implements a batching system: all requests that come in during a given time period get queued until a configurable interval is reached, then they are discharged in a single API request. Batching ensures that you can set a pace that does not exceed the number of API requests for which you have paid. This negates somewhat the real-time-scanning aspect of OSEL but it is necessitated by fiscal responsibility.

Testing pattern

I leaned heavily on dependency injection to make this code as testable as possible. For example, I needed an object that would contain the persistent `syslog.Writer`. I created a `SyslogActioner` interface to represent all interactions with syslog. When the code is operating normally, interactions with syslog occur through methods of the `SyslogActions` struct, but in unit testing mode the `SyslogTestActions` struct is used instead. The `SyslogTestActions` is limited to saving copies of all messages that would have been sent so they can be compared against the intended messages. This facilitates good testing.

Fate of the project

The OSEL project was implemented and installed into production. There were two problems with it. The first problem to become visible was the lack of an exponential backoff for the AMQP connection to the OpenStack control plane’s RabbitMQ. When RabbitMQ had issues - which was surprisingly often - OSEL would hammer away, trying to reconnect. This would not be too much of an issue; despite what was effectively an infinite loop, CPU usage was not extreme. The real problem was that connection failures were logged - and logs could become several gigabytes in a matter of hours. This was mitigated by the OpenStack operations team rotating the logs hourly, and alerting if an hour’s worth of logs exceeded a set size. The second - and fatal - issue is that S3 RiskFabric was not configured to ingest from Qualys scans more than once a day. Since Qualys was already scanning the CIDR block that corresponded to our OpenStack instances once a day, we were essentially just adding noise to the system. The frequency of the S3-Qualys imports could not be easily altered, and as a result the project was shelved.

Remaining work

If OSEL were ever to be un-shelved, here are a few things that I wish I had time to implement:

  • Neutron Port Events: The initial release of OSEL processed only security group rule additions, modifications, or deletions. That covered the base case for when a security group was already associated with a set of OpenStack Networking (neutron) ports. A scan should be similarly launched when a new port is created and associated to a security group. This is what happens when a new host is created.
  • Integrate with the really awesome Firewall as a Service project.
  • Modern OpenStack: In order to make this work with a more modern OpenStack, it would be best to integrate with events generated through Aodh. Aodh is built for this kind of reporting.
  • Implement exponential backoff for AMQP connections as mentioned earlier.

Get involved

If you are interested in contributing to the project, clone the project source from the OpenStack git repository and submit changes using the standard Gerrit-based OpenStack review process!

© 2018 Nate Johnston | Nate's Main Page | Follow on Twitter