Posts for Saturday, May 18, 2013

avatar

Commandline SELinux policy helper functions

To work on SELinux policies, I use a couple of functions that I can call on the shell (command line): seshowif, sefindif, seshowdef and sefinddef. The idea behind the methods is that I want to search (find) for an interface (if) or definition (def) that contains a particular method or call. Or, if I know what the interface or definition is, I want to see it (show).

For instance, to find the name of the interface that allows us to define file transitions from the postfix_etc_t label:

$ sefindif filetrans.*postfix_etc
contrib/postfix.if: interface(`postfix_config_filetrans',`
contrib/postfix.if:     filetrans_pattern($1, postfix_etc_t, $2, $3, $4)

Or to show the content of the corenet_tcp_bind_http_port interface:

$ seshowif corenet_tcp_bind_http_port
interface(`corenet_tcp_bind_http_port',`
        gen_require(`
                type http_port_t;
        ')

        allow $1 http_port_t:tcp_socket name_bind;
        allow $1 self:capability net_bind_service;
')

For the definitions, this is quite similar:

$ sefinddef socket.*create
obj_perm_sets.spt:define(`create_socket_perms', `{ create rw_socket_perms }')
obj_perm_sets.spt:define(`create_stream_socket_perms', `{ create_socket_perms listen accept }')
obj_perm_sets.spt:define(`connected_socket_perms', `{ create ioctl read getattr write setattr append bind getopt setopt shutdown }')
obj_perm_sets.spt:define(`create_netlink_socket_perms', `{ create_socket_perms nlmsg_read nlmsg_write }')
obj_perm_sets.spt:define(`rw_netlink_socket_perms', `{ create_socket_perms nlmsg_read nlmsg_write }')
obj_perm_sets.spt:define(`r_netlink_socket_perms', `{ create_socket_perms nlmsg_read }')
obj_perm_sets.spt:define(`client_stream_socket_perms', `{ create ioctl read getattr write setattr append bind getopt setopt shutdown }')

$ seshowdef manage_files_pattern
define(`manage_files_pattern',`
        allow $1 $2:dir rw_dir_perms;
        allow $1 $3:file manage_file_perms;
')

I have these defined in my ~/.bashrc (they are simple functions) and are used on a daily basis here ;-) If you want to learn a bit more on developing SELinux policies for Gentoo, make sure you read the Gentoo Hardened SELinux Development guide.

Posts for Friday, May 17, 2013

avatar

Looking at the local Linux kernel privilege escalation

There has been a few posts already on the local Linux kernel privilege escalation, which has received the CVE-2013-2094 ID. arstechnica has a write-up with links to good resources on the Internet, but I definitely want to point readers to the explanation that Brad Spengler made on the vulnerability.

In short, the vulnerability is an out-of-bound access to an array within the Linux perf code (which is a performance measuring subsystem enabled when CONFIG_PERF_EVENTS is enabled). This subsystem is often enabled as it offers a wide range of performance measurement techniques (see its wiki for more information). You can check on your own system through the kernel configuration (zgrep CONFIG_PERF_EVENTS /proc/config.gz if you have the latter pseudo-file available – it is made available through CONFIG_IKCONFIG_PROC).

The public exploit maps memory in userland, fills it with known data, then triggers an out-of-bound decrement that tricks the kernel into decrementing this data (mapped in userland). By looking at where the decrement occurred, the exploit now knows the base address of the array. Next, it targets (through the same vulnerability) the IDT base (Interrupt Descriptor Table) and targets the overflow interrupt vector. It increments the top part of the address that the vector points to (which is 0xffffffff, becoming 0×00000000 thus pointing to the userland), maps this memory region itself with shellcode, and then triggers the overflow. The shell code used in the public exploit modifies the credentials of the current task, sets uid/gid with root and gives full capabilities, and then executes a shell.

As Brad mentions, UDEREF (an option in a grSecurity enabled kernel) should mitigate the attempt to get to the userland. On my system, the exploit fails with the following (start of) oops (without affecting the system further) when it tries to close the file descriptor returned from the syscall that invokes the decrement:

[ 1926.226678] PAX: please report this to pageexec@freemail.hu
[ 1926.227019] BUG: unable to handle kernel paging request at 0000000381f5815c
[ 1926.227019] IP: [<ffffffff811016ba>] sw_perf_event_destroy+0x1a/0xa0
[ 1926.227019] PGD 58a7c000 
[ 1926.227019] Thread overran stack, or stack corrupted
[ 1926.227019] Oops: 0002 [#4] PREEMPT SMP 
[ 1926.227019] Modules linked in: libcrc32c
[ 1926.227019] CPU 0 
[ 1926.227019] Pid: 4267, comm: test Tainted: G      D      3.8.7-hardened #1 Bochs Bochs
[ 1926.227019] RIP: 0010:[<ffffffff811016ba>]  [<ffffffff811016ba>] sw_perf_event_destroy+0x1a/0xa0
[ 1926.227019] RSP: 0018:ffff880058a03e08  EFLAGS: 00010246
...

The exploit also finds that the decrement didn’t succeed:

test: semtex.c:76: main: Assertion 'i<0x0100000000/4' failed.

A second mitigation is that KERNEXEC (also offered through grSecurity) which prevents the kernel from executing data that is writable (including userland data). So modifying the IDT would be mitigated as well.

Another important mitigation is TPE – Trusted Path Execution. This feature prevents the execution of binaries that are not located in a root-owned directory and owned by a trusted group (which on my system is 10 = wheel). So users attempting to execute such code will fail with a Permission denied error, and the following is shown in the logs:

[ 3152.165780] grsec: denied untrusted exec (due to not being in trusted group and file in non-root-owned directory) of /home/user/test by /home/user/test[bash:4382] uid/euid:1000/1000 gid/egid:100/100, parent /bin/bash[bash:4352] uid/euid:1000/1000 gid/egid:100/100

However, even though a nicely hardened system should be fairly immune against the currently circling public exploit, it should be noted that it is not immune against the vulnerability itself. The methods above mentioned make it so that that particular way of gaining root access is not possible, but it still allows an attacker to decrement and increment memory in specific locations so other exploits might be found to modify the system.

Now out-of-bound vulnerabilities are not new. Recently (february this year), a vulnerability in the networking code also provided an attack vector to get a local privilege escalation. A mandatory access control system like SELinux has little impact on such vulnerabilities if you allow users to execute their own code. Even confined users can modify the exploit to disable SELinux (since the shell code is ran with ring0 privileges it can access and modify the SELinux state information in the kernel).

Many thanks to Brad for the excellent write-up, and to the Gentoo Hardened team for providing the grSecurity PaX/TPE protections in its hardened-sources kernel.

Posts for Thursday, May 16, 2013

avatar

Gentoo Hardened spring notes

We got back together on the #gentoo-hardened chat channel to discuss the progress of Gentoo Hardened, so it’s time for another write-up of what was said.

Toolchain

GCC 4.8.1 will be out soon, although nothing major has occurred with it since the last meeting. There is a plugin header install problem in 4.8 and its not certain that the (trivial) fix is in 4.8.1, but it certainly is inside Gentoo’s release.

Blueness is also (still, and hopefully for a long time ;-) maintaining the uclibc hardened related toolchain aspects.

Kernel and grSecurity/PaX

The further progress on the XATTR_PAX migration was put on a lower level the past few weeks due to busy, busy… very busy weeks (but this was announced and known in advance). We still need to do XATTR copying in install for packages that do pax markings before src_install() and include the user.pax XATTR patch in the gentoo-sources kernel. This will silence the errors for non-hardened users and fix the loss of XATTR markings for those packages that do pax-mark before install.

The set then needs to be documented further and tested on vanilla and hardened systems.

Zorry asked if a separate script can be provided for those ebuilds that directly call paxctl. These ebuilds might want to switch to the eclass, but if they need to call paxctl or similar directly (for instance because the result is immediately used for further building), a separate script or tool should be made available. Blueness will look into this.

On hardened-sources, we are now with stable 2.6.32-r160, 3.2.42-r1 and 3.8.6 due to some vulnerabilities in earlier versions (in networking code). There is still some bug (nfs-related) that is fixed in 3.2.44 so that part might need a bump as well soon.

SELinux

The selocal command is now available for Gentoo SELinux users, allowing them to easily enhance the policy without having to maintain their own SELinux policy modules (the script is a wrapper that does all that).

The setools package now also uses the SLOT’ed swig, so no more dependency breakage.

On SELinux userspace and policy, both have seen a new release last month, and both are already in the Gentoo portage tree.

Finally, the SELinux policy ebuilds now also call epatch_user so users can customize the policies even further without having to copy ebuilds to their overlay.

Now that tar supports XATTR well, we might want to look into SELinux stages again. Jmbsvicetto did some work on that, but the builds failed during stage1. We’ll look into that later.

Integrity

Nothing much to say, we’re waiting a bit until the patches proposed by the IMA team are merged in the main kernel.

Profiles

Two no-multilib fixes have been applied to the hardened/amd64/no-multilib profiles. One was a QA issue and quickly resolved, the other is due to the profile stacking within Gentoo profiles, where we missed a profile and thus were missing a few masks defined in that (missed) profile. But including the profile creates a lot of duplicates again, so we are going to copy the masks across until the duplicates are resolved in the other profiles.

Blueness will also clean up the experimental 13.0 directory since all hardened profiles now follow 13.0.

Docs

The latest changes on SELinux have been added to the Gentoo SELinux handbook. Also, I’ve been slowly (but surely) adding topics to the SELinux tutorials listing on the Gentoo wiki.

The grSecurity 2 document is very much out of date, blueness hopes to put some time in fixing that soon.

So that’s about it for the short write-up. Zorry will surely post the log later on the appropriate channels. Good work done (again) by all team members!

Programming is Terrible

<iframe class="youtube-player" frameborder="0" height="390" src="http://www.youtube.com/embed/csyL9EC0S0c?version=3&amp;rel=1&amp;fs=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" type="text/html" width="640"></iframe>


Video on GObject bindings and Vala

<iframe class="youtube-player" frameborder="0" height="390" src="http://www.youtube.com/embed/6QrGmA_RR4E?version=3&amp;rel=1&amp;fs=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" type="text/html" width="640"></iframe>

Tal Liron explaining how to generate
bindings (GObjectInstrospection bindings) for Vala


Paludis 1.4.0 Released

Paludis 1.4.0 has been released:

  • Tweaked ‘cave resolve’ output to add blank lines.
  • Support for libarchive 3.1.2.
  • Compatibility fixes for GCC 4.8.

Filed under: paludis releases Tagged: paludis
avatar

Public support channels: irc

I’ve said it before – support channels for free software are often (imo) superior to the commercial support that you might get with vendors. And although those vendors often try to use “modern” techniques, I fail to see why the old, but proven/stable methods would be wrong.

Consider the “Chat with Support” feature that many vendors have on their site. Often, these services use a webbrowser, AJAX-driven method for talking with support engineers. The problem with this that I see is that it is difficult to keep track of the feedback you got over time (unless you manually copy/paste the information), and again that it isn’t public. With free software communities, we still often redirect such “online” support requests to IRC.

Internet Relay Chat has been around for ages (1988 according to wikipedia) and still quite active. Gentoo has all of its support channels on the freenode IRC network: a community-driven, active #gentoo channel with often crosses the 1000 users, a #gentoo-dev development-related channel where many developers communicate, the #gentoo-hardened channel for all questions and support regarding Gentoo Hardened specifics, etc.

Using IRC has many advantages. One is that logs can be kept (either individually or by the project itself) that can be queried later by the people who want to provide support (to see if questions have already been popping up, see what the common questions are for the last few days, etc.) or get support (to see if their question was already answered in the past). Of course, these logs can be made public through web interfaces quite easily. For users, such log functionality is offered through the IRC client. Another very simple, yet interesting feature is highlighting: give the set of terms for which you want to be notified (usually through a highlight and a specific notification in the client), making it easier to be on multiple channels without having to constantly follow-up on all discussions.

Another advantage is that there is such a thing like “bots”. Most Gentoo related channels do not allow active bots on the channels except for the project-approved ones (such as willikens). These bots can provide project-specific help to users and developers alike:

  • Give one-line information about bugs reported on bugzilla (id, assignee, status, but also the URL where the user/developer can view the bug etc.)
  • Give meta information about a package (maintainer, herd, etc.), herd (members), GLSA details, dependency information, etc.
  • Allow users to query if a developer is away or not
  • Create notes (messages) for users that are not online yet but for which you know they come online later (and know their nickname or registered username)
  • Notify when commits are made, or when tweets are sent that match a particular expression, etc.

Furthermore, the IRC protocol has many features that are very interesting to use in free software communities as well. You can still do private chats (when potentially confidential data is exchanged) for instance, or even exchange files (although that is less common to use in free software communities). There is also still some hierarchy in case of abuse (channel operators can remove users from the chat or even ban them for a while) and one can even quiet a channel when for instance online team meetings are held (although using a different channel for that might be an alternative).

IRC also has the advantage that connecting to the IRC channels has a very low requirement (software-wise): one can use console-only chat clients (in case users cannot get their graphical environment to work – example is irssi) or even webbrowser based ones (if one wants to chat from other systems). Even smartphones have good IRC applications, like AndChat for Android.

IRC is also distributed: an IRC network consists of many interconnected servers who pass on all IRC traffic. If one node goes down, users can access a different node and continue. That makes IRC quite high-available. IRC network operators do need to try and keep the network from splitting (“netsplit”) which occurs when one part of the distributed network gets segregated from the other part and thus two “independent” IRC networks are formed. When that occurs, IRC operators will try to join them back as fast as possible. I’m not going to explain the details on this – it suffices to understand that IRC is a distributed manner and thus often much more available than the “support chat” sites that vendors provide.

So although IRC looks archaic, it is a very good match for support channel requirements.

Posts for Wednesday, May 15, 2013

avatar

Overriding the default SELinux policies

Extending SELinux policies with additional rules is easy. As SELinux uses a deny by default approach, all you need to do is to create a policy module that contains the additional (allow) rules, load that and you’re all set. But what if you want to remove some rules?

Well, sadly, SELinux does not support deny rules. Once an allow rule is loaded in memory, it cannot be overturned anymore. Yes, you can disable the module itself that provides the rules, but you cannot selectively disable rules. So what to do?

Generally, you can disable the module that contains the rules you want to disable, and load a custom module that defines everything the original module did, except for those rules you don’t like. For instance, if you do not want the skype_t domain to be able to read/write to the video device, create your own skype-providing module (myskype) with the exact same content (except for the module name at the first line) as the original skype module, except for the video device:

dev_read_sound(skype_t)
# dev_read_video_dev(skype_t)
dev_write_sound(skype_t)
# dev_write_video_dev(skype_t)

Load in this policy, and you now have the skype_t domain without the video access. You will get post-install failures when Gentoo pushes out an update to the policy though, since it will attempt to reload the skype.pp file (through the selinux-skype package) and fail because it declares types and attributes already provided (by myskype). You can exclude the package from being updated, which works as long as no packages depend on it. Or live with the post-install failure ;-) But there might be a simpler approach: epatch_user.

Recently, I added in support for epatch_user in the policy ebuilds. This allows users to create patches against the policy source code that we use and put them in /etc/portage/patches in the directory of the right category/package. For module patches, the working directory used is within the policy/modules directory of the policy checkout. For base, it is below the policy checkout (in other words, the patch will need to use the refpolicy/ directory base). But because of how epatch_user works, any patch taken from the base will work as it will start stripping directories up to the fourth one.

This approach is also needed if you want to exclude rules from interfaces rather than from the .te file: create a small patch and put it in /etc/portage/patches for the sec-policy/selinux-base package (as this provides the interfaces).

Posts for Tuesday, May 14, 2013

avatar

Highlevel assessment of Cdorked and Gentoo Hardened/SELinux

With all the reports surrounding Cdorked, I took a look at if SELinux and/or other Gentoo Hardened technologies could reduce the likelihood that this infection occurs on your system.

First of all, we don’t know yet how the malware gets installed on the server. We do know that the Apache binaries themselves are modified, so the first thing to look at is to see if this risk can be reduced. Of course, using an intrusion detection system like AIDE helps, but even with Gentoo’s qcheck command you can test the integrity of the files:

# qcheck www-servers/apache
Checking www-servers/apache-2.2.24 ...
  * 424 out of 424 files are good

If the binary is modified, this would result in something equivalent to:

Checking www-servers/apache-2.2.24 ...
 MD5-DIGEST: /usr/sbin/apache2
  * 423 out of 424 files are good

I don’t know if the modified binary would otherwise work just fine, I have not been able to find exact details on the infected binary to (in a sandbox environment of course) analyze this further. Also, because we don’t know how they are installed, it is not easy to know if binaries that you built yourself are equally likely to be modified/substituted or if the attack checks checksums of the binaries against a known list.

Assuming that it would run, then the infecting malware would need to set the proper SELinux context on the file (if it overwrites the existing binary, then the context is retained, otherwise it gets the default context of bin_t). If the context is wrong, then starting Apache results in:

apache2: Syntax error on line 61 of /etc/apache2/httpd.conf: Cannot load /usr/lib64/apache2/modules/mod_actions.so into server: /usr/lib64/apache2/modules/mod_actions.so: cannot open shared object file: Permission denied

This is because the modified binary stays in the calling domain context (initrc_t). If you use a targeted policy, then this will not present itself as initrc_t is an unconfined domain. But with strict policies, initrc_t is not allowed to read httpd_modules_t. Even worse, the remainder of SELinux protections don’t apply anymore, since with unconfined domains, all bets are off. That is why Gentoo focuses this hard on using a strict policy.

So, what if the binary runs in the proper domain? Well then, from the articles I read, the malware can do a reverse connect. That means that the domain will attempt to connect to an IP address provided by the attacker (in a specifically crafted URL). For SELinux, this means that the name_connect permission is checked:

# sesearch -s httpd_t -c tcp_socket -p name_connect -ACTS
Found 20 semantic av rules:
   allow nsswitch_domain dns_port_t : tcp_socket { name_connect } ; 
DT allow httpd_t port_type : tcp_socket { name_connect } ; [ httpd_can_network_connect ]
DT allow httpd_t ftp_port_t : tcp_socket { name_connect } ; [ httpd_can_network_relay ]
DT allow httpd_t smtp_port_t : tcp_socket { name_connect } ; [ httpd_can_sendmail ]
DT allow httpd_t postgresql_port_t : tcp_socket { name_connect } ; [ httpd_can_network_connect_db ]
DT allow httpd_t oracledb_port_t : tcp_socket { name_connect } ; [ httpd_can_network_connect_db ]
DT allow httpd_t squid_port_t : tcp_socket { name_connect } ; [ httpd_can_network_relay ]
DT allow httpd_t mssql_port_t : tcp_socket { name_connect } ; [ httpd_can_network_connect_db ]
DT allow httpd_t kerberos_port_t : tcp_socket { name_connect } ; [ allow_kerberos ]
DT allow nsswitch_domain ldap_port_t : tcp_socket { name_connect } ; [ authlogin_nsswitch_use_ldap ]
DT allow httpd_t http_cache_port_t : tcp_socket { name_connect } ; [ httpd_can_network_relay ]
DT allow httpd_t http_port_t : tcp_socket { name_connect } ; [ httpd_can_network_relay ]
DT allow httpd_t http_port_t : tcp_socket { name_connect } ; [ httpd_graceful_shutdown ]
DT allow httpd_t mysqld_port_t : tcp_socket { name_connect } ; [ httpd_can_network_connect_db ]
DT allow httpd_t ocsp_port_t : tcp_socket { name_connect } ; [ allow_kerberos ]
DT allow nsswitch_domain kerberos_port_t : tcp_socket { name_connect } ; [ allow_kerberos ]
DT allow httpd_t pop_port_t : tcp_socket { name_connect } ; [ httpd_can_sendmail ]
DT allow nsswitch_domain ocsp_port_t : tcp_socket { name_connect } ; [ allow_kerberos ]
DT allow httpd_t gds_db_port_t : tcp_socket { name_connect } ; [ httpd_can_network_connect_db ]
DT allow httpd_t gopher_port_t : tcp_socket { name_connect } ; [ httpd_can_network_relay ]

So by default, the Apache (httpd_t) domain is allowed to connect to DNS port (to resolve hostnames). All other name_connect calls depend on SELinux booleans (mentioned after it) that are by default disabled (at least on Gentoo). Disabling hostname resolving is not really feasible, so if the attacker uses a DNS port as port that the malware needs to connect to, SELinux will not deny it (unless you use additional networking constraints).

Now, the reverse connect is an interesting feature of the malware, but not the main one. The main focus of the malware is to redirect customers to particular sites that can trick the user in downloading additional (client) malware. Because this is done internally within Apache, SELinux cannot deal with this. As a user, make sure you configure your browser not to trust non-local iframes and such (always do this, not just because there is a possible threat right now). The configuration of Cdorked is a shared memory segment of Apache itself. Of course, since Apache uses shared memory, the malware embedded within will also have access to the shared memory. However, if this shared memory would need to be accessed by third party applications (the malware seems to grant read/write rights on everybody to this segment) SELinux will prevent this:

# sesearch -t httpd_t -c shm -ACTS
Found 2 semantic av rules:
   allow unconfined_domain_type domain : shm { create destroy getattr setattr read write associate unix_read unix_write lock } ; 
   allow httpd_t httpd_t : shm { create destroy getattr setattr read write associate unix_read unix_write lock } ; 

Only unconfined domains and the httpd_t domain itself have access to httpd_t labeled shared memory.

So what about IMA/EVM? Well, those will not help here since IMA checks for integrity of files that were modified offline. As the modification of the Apache binaries is most likely done online, IMA would just accept this.

For now, it seems that a good system integrity approach is the most effective until we know more about how the malware-infected binary is written to the system in the first place (as this is better protected by MAC controls like SELinux).

Posts for Monday, May 13, 2013

avatar

SECMARK and SELinux

When using SECMARK, the administrator configures the iptables or netfilter rules to add a label to the packet data structure (on the host itself) that can be governed through SELinux policies. Unlike peer labeling, here the labels assigned to the network traffic is completely locally defined. Consider the following command:

# iptables -t mangle -A INPUT -p tcp --src 192.168.1.2 --dport 443
  -j SECMARK --selctx system_u:object_r:myauth_packet_t

With this command, packets that originate from the 192.168.1.2 host and arrive on port 443 (typically used for HTTPS traffic) are marked as myauth_packet_t. SELinux policy writers can then allow domains to receive this type of packets (or send) through the packet class:

# Allow sockets with mydomain_t context to receive packets labeled myauth_packet_t
allow mydomain_t myauth_packet_t:packet recv;

The SELinux policy modules enable this through the corenet_sendrecv_<type>_{client,server}_packets interfaces:

corenet_sendrecv_http_client_packets(mybrowser_t)
# allow mybrowser_t http_client_packet_t:packet { send recv };

As a common rule, packets are marked as client packets or server packets, depending on the role of the domain. In the above example, the domain is a browser, so acts as a web client. So, it needs to send and receive http_client_packet_t. A web server on the other hand would need to send and receive http_server_packet_t. Note that the packets that are sent over the wire do not have any labels assigned to them – this is all local to the system. So even when the source and destination use SELinux with SECMARK, on the source server the packets might be labeled as http_client_packet_t whereas on the target they are seen as http_server_packet_t.

As far as I know, when you want to use SECMARK, you will need to set the contexts with iptables yourself (there is no default labeling), so knowing about the above convention is important.

Again, Paul Moore has more information about this.

Posts for Sunday, May 12, 2013

avatar

Peer labeling in SELinux policy

Allow me to start with an important warning: I don’t have much hands-on experience with the remainder of this post. Its based on the few resources I found on the Internet and a few tests done locally which I’ve investigated in my attempt to understand SELinux policy writing for networking stuff.

So, with that out of the way, let’s look into peer labeling. As mentioned in my previous post, SELinux supports some more advanced networking security features than the default socket restrictions. I mentioned SECMARK and NetLabel before, but NetLabel is actually part of the family of peer labeling technologies.

With this technology approach, all participating systems in the network must support the same labeling method. NetLabel supports CIPSO (Commerial IP Security Option) where hosts label their network traffic to be part of a particular “Domain of Interpretation”. The labels are used by the hosts to identify where a packet should be for. NetLabel, within Linux, is then used to translate those CIPSO labels. SELinux itself labels the incoming sockets based on the NetLabel information and the context of the listening socket, resulting in a context that is governed policy-wise through the peer class. Since this is based on the information in the packet instead of defined on the system itself, this allows remote systems to have a say in how the packets are labeled.

Another peer technology is the Labeled IPSec one. In this case the labels are fully provided by the remote system. I think they are based on the security association within the IPSec setup.

In both cases, in the SELinux policies, three definitions are important to keep an eye out on: interface definitions, node definitions and peer definitions.

Interface definitions allow users to (mainly) set the sensitivity that is allowed to pass the interface. Using semanage interface this can be controlled by the user. One can also assign a different context to the interface – by default, this is netif_t. The permissions that are checked on the traffic is ingress (incoming) and egress (outgoing) traffic, and most policies set this through the following call (comment shows the underlying SELinux rules, where tcp_send and tcp_recv are – I think – obsolete):

corenet_tcp_sendrecv_generic_if(something_t)
# allow something_t netif_t:netif { tcp_send tcp_recv egress ingress };

Node definitions define which targets (nodes, which can be IP addresses or subnets) traffic meant for a particular socket is allow to originate from (recvfrom) or sent to (sendto). Again, users can define their own node types and manage them using semanage node. The default node I already covered in the previous post (node_t) and is allowed by most policies by default through the following call (where the tcp_send and tcp_recv are probably deprecated as well):

corenet_tcp_sendrecv_generic_node(something_t)
# allow something_t node_t:node { tcp_send tcp_recv sendto recvfrom };

Finally, peer definitions are based on the labels from the traffic. If the system uses NetLabel, then the target label will always be netlabel_peer_t since the workings of CIPSO are mainly (only?) mapped towards sensitivity labels (in MLS policy). As a result, SELinux always displays the peer as being netlabel_peer_t. In case of Labeled IPSec, this isn’t the case as the peer label is transmitted by the peer itself.

For NetLabel support, policies generally include two methods – one is to support unlabeled traffic (only needed the moment you have support for labeled traffic) and one is to allow the NetLabel’ed traffic:

corenet_all_recvfrom_unlabeled(something_t)
# allow something_t unlabeled_t:peer recv;
corenet_all_recvfrom_netlabel(something_t)
# allow something_t netlabel_peer_t:peer recv;

In case of IPSec for instance, the peer will have a provided label, as is shown by the call for accepting hadoop traffic:

hadoop_recvfrom(something_t)
# allow something_t hadoop_t:peer recv;

However, this alone is not sufficient for labeled IPSec. We also need to allow the domain to be allowed to send anything towards an IPSec security association. There is an interface called corenet_tcp_recvfrom_labeled that takes two arguments which, amongst other things, enables sendto towards its association.

corenet_tcp_recvfrom_labeled(some_t, thing_t)
# allow { some_t thing_t} self:association sendto;
# allow some_t thing_t:peer recv;
# allow thing_t some_t:peer recv;
# corenet_tcp_recvfrom_netlabel(some_t)
# corenet_tcp_recvfrom_netlabel(thing_t)

This interface is usually called within a *_tcp_connect() interface for a particular domain, like with the mysql_tcp_connect example:

interface(`mysql_tcp_connect',`
        gen_require(`
                type mysqld_t;
        ')

        corenet_tcp_recvfrom_labeled($1, mysqld_t)
        corenet_tcp_sendrecv_mysqld_port($1) # deprecated
        corenet_tcp_connect_mysqld_port($1)
        corenet_sendrecv_mysqld_client_packets($1)
')

When using peer labeling, the domain that is allowed something is based on the socket context of the application. Also, the rules when using peer labeling are in addition to the rules mentioned before (“standard” networking control): name_bind and name_connect are always checked.

For more information, make sure you check Paul Moore’s blog, such as the egress/ingress information. And if you know of resources that show this in a more practical setting (above is mainly to work with the SELinux policy) I’m all ears.

Posts for Saturday, May 11, 2013

avatar

SELinux policy and network controls

Let’s talk about how SELinux governs network streams (and how it reflects this into the policy).

When you don’t do fancy stuff like SECMARK or netlabeling, then the classes that you should keep an eye on are tcp_socket and udp_socket (depending on the protocol). There used to be node and netif as well, but the support (enforcement) for these have been removed a while ago for the “old style” network control enforcement. The concepts are still available though, and I believe they take effect when netlabeling is used. But let’s first look at the regular networking aspects.

The idea behind the regular network related permissions are that you define either daemon-like behavior (which “binds” to a port) or client-like behavior (which “connects” to a port). Consider an FTP daemon (domain ftpd_t) versus FTP client (example domain ncftp_t).

In case of a daemon, the policy would contain the following (necessary) rules:

corenet_tcp_bind_generic_node(ftpd_t) # Somewhat legacy but still needed
corenet_tcp_bind_ftp_port(ftpd_t)
corenet_tcp_bind_ftp_data_port(ftpd_t)
corenet_tcp_bind_all_unreserved_ports(ftpd_t) # In case of passive mode

This gets translated to the following “real” SELinux statements:

allow ftpd_t node_t:tcp_socket node_bind;
allow ftpd_t ftp_port_t:tcp_socket name_bind;
allow ftpd_t ftp_data_port_t:tcp_socket name_bind;
allow ftpd_t unreserved_port_type:tcp_socket name_bind;

I mention that corenet_tcp_bind_generic_node as being somewhat legacy. When you use netlabeling, you can define different nodes (a “node” in that case is a label assigned to an IP address or IP subnet) and as such define policy-wise where daemons can bind on (or clients can connect to). However, without netlabel, the only node that you get to work with is node_t which represents any possible node. Also, the use of passive mode within the ftp policy is governed through the ftpd_use_passive_mode boolean.

For a client, the following policy line would suffice:

corenet_tcp_connect_ftp_port(ncftp_t)
# allow ncftp_t ftp_port_t:tcp_socket name_connect;

Well, I lied. Because of how FTP works, if you use active connections, you need to allow the client to bind on an unreserved port, and allow the server to connect to unreserved ports (cfr code snippet below), but you get the idea.

corenet_tcp_connect_all_unreserved_ports(ftpd_t)

corenet_tcp_bind_generic_node(ncftp_t)
corenet_tcp_bind_all_unreserved_ports(ncftp_t)

In the past, policy developers also had to include other lines, but these have by time become obsolete (corenet_tcp_sendrecv_ftp_port for instance). These methods defined the ability to send and receive messages on the port, but this is no longer controlled this way. If you need such controls, you will need to look at SELinux and SECMARK (which uses packets with the packet class) or netlabel (which uses the peer class and peer types to send or receive messages from).

And that’ll be for a different post.

Posts for Friday, May 10, 2013

avatar

Gentoo metadata support for CPE

Recently, the metadata.xml file syntax definition (the DTD for those that know a bit of XML) has been updated to support CPE definitions. A CPE (Common Platform Enumeration) is an identifier that describes an application, operating system or hardware device using its vendor, product name, version, update, edition and language. This CPE information is used in the CVE releases (Common Vulnerabilities and Exposures) – announcements about vulnerabilities in applications, operating systems or hardware. Not all security vulnerabilities are assigned a CVE number, but this is as close as you get towards a (public) elaborate dictionary of vulnerabilities.

By allowing Gentoo package maintainers to enter (part of) the CPE information in the metadata.xml file, applications that parse the CVE information can now more easily match if software installed on Gentoo is related to a CVE. I had a related post to this not that long ago on my blog and I’m glad this change has been made. With this information at hand, we can start feeding CPE information to the packages and then easily match this with CVEs.

I had a request to “provide” the scripts I used for the previous post. Mind you, these are taking too many assumptions (and probably wrong ones) for now (and I’m not really planning on updating them as I have different methods for getting information related to CVEs), but I’m planning on integrating CPE data in Gentoo’s packages more and then create a small script that generates a “watchlist” that I can feed to cvechecker. But anyway, here are the scripts.

First, I took all CVE information and put it in a simple CSV file. The CSV is the same one used by cvechecker, so check out the application to see where it fetches the data from (there is a CVE RSS feed and a simple XSL transformation). Second, I create a “hitlist” which generates the CPEs. With the recent change to metadata.xml this step can be simplified a lot. Third, I try to match the CPE data with the CVE data, depending on a given time delay of commits. In other words, you can ask possible CVE fixes for commits made in the last few XXX days.

Posts for Thursday, May 9, 2013

avatar

Enabling Kernel Samepage Merging (KSM)

When using virtualization extensively, you will pretty soon hit the limits of your system (at least, the resources on it). When the virtualization is used primarily for testing (such as in my case), the limit is memory. So it makes sense to seek memory optimization strategies on such systems. The first thing to enable is KSM or Kernel Samepage Merging.

This Linux feature looks for memory pages that the applications have marked as being a possible candidate for optimization (sharing) which are then reused across multiple processes. The idea is that, especially for virtualized environments (but KSM is not limited to that), some processes will have the same contents in memory. Without any sharing abilities, these memory pages will be unique (meaning at different locations in your system’s memory). With KSM, such memory pages are consolidated to a single page which is then referred to by the various processes. When one process wants to modify the page, it is “unshared” so that there is no corruption or unwanted modification of data for the other processes.

Such features are not new – VMWare has it named TPS (Transparent Page Sharing) and Xen calls it “Memory CoW” (Copy-on-Write). One advantage of KSM is that it is simple to setup and advantageous for other processes as well. For instance, if you host multiple instances of the same service (web service, database, tomcat, whatever) there is a high chance that several of its memory pages are prime candidates for sharing.

Now before I do mention that this sharing is only enabled when the application has marked it as such. This is done through the madvise() method, where applications mark the memory with MADV_MERGEABLE, meaning that the applications explicitly need to support KSM in order for it to be successful. There is work on the way to support transparent KSM (such as UKSM and PKSM) where no madvise calls would be needed anymore. But beyond quickly reading the home pages (or translated home pages in case of UKSM ;-) I have no experience with those projects.

So let’s get back to KSM. I am currently running three virtual machines (all configured to take at most 1.5 Gb of memory). Together, they take just a little over 1 Gb of memory (sum of their resident set sizes). When I consult KSM, I get the following information:

 # grep -H '' /sys/kernel/mm/ksm/pages_*
/sys/kernel/mm/ksm/pages_shared:48911
/sys/kernel/mm/ksm/pages_sharing:90090
/sys/kernel/mm/ksm/pages_to_scan:100
/sys/kernel/mm/ksm/pages_unshared:123002
/sys/kernel/mm/ksm/pages_volatile:1035

The pages_shared tells me that 48911 pages are shared (which means about 191 Mb) through 90090 references (pages_sharing – meaning the various processes have in total 90090 references to pages that are being shared). That means a gain of 41179 pages (160 Mb). Note that the resident set sizes do not take into account shared pages, so the sum of the RSS has to be subtracted with this to find the “real” memory consumption. The pages_unshared value tells me that 123002 pages are marked with the MADV_MERGEABLE advise flag but are not used by other processes.

If you want to use KSM yourself, configure your kernel with CONFIG_KSM and start KSM by echo’ing the value “1″ into /sys/kernel/mm/ksm/run. That’s all there is to it.

Posts for Wednesday, May 8, 2013

avatar

The Linux “.d” approach

Many services on a Linux system use a *.d directory approach to make their configuration easily configurable by other services. This is a remarkably simple yet efficient method for exposing services towards other applications. Let’s look into how this .d approach works.

Take a look at the /etc/pam.d structure: services that are PAM-aware can place their PAM configuration files in this location, without needing any additional configuration steps or registration. Same with /etc/cron.d: applications that need specific cronjobs do not need to edit /etc/crontab directly (with the problem of concurrent access, overwriting changes, etc.) but instead can place their definitions in the cron.d directory.

This approach is getting more traction, as can be seen from the available “dot-d” directories on a system:

$ ls -d /etc/*.d
/etc/bash_completion.d  /etc/ld.so.conf.d  /etc/pam.d          /etc/sysctl.d
/etc/conf.d             /etc/local.d       /etc/profile.d      /etc/wgetpaste.d
/etc/dracut.conf.d      /etc/logrotate.d   /etc/request-key.d  /etc/xinetd.d
/etc/env.d              /etc/makedev.d     /etc/sandbox.d      /etc/cron.d
/etc/init.d             /etc/modprobe.d    /etc/sudoers.d

An application can place its configuration files in these directories, automatically “plugging” it in into the operating system and the services that it provides. And the more services adopt this approach, the easier it is for applications to be pluggable within the operating system. Even complex systems such as database systems can easily configure themselves this way. And for larger organizations, this is a very interesting approach.

Consider the need to deploy a database server on a Linux system in a larger organization. Each organization has its standards for file system locations, policies for log file management, etc. With the *.d approach, these organizations only need to put files on the file system (a rather primitive feature that every organization supports) and manage these files instead of using specific, proprietary interfaces to configure the environment. But to properly control this flexibility, a few attention points need to be taken into account.

The first is to use a proper naming convention. If the organization has a data management structure, it might have specific names for services. These names are then used throughout the organization to properly identify owners or responsibilities. When using the *.d directories, these naming conventions also allow administrators to easily know who to contact if a malfunctioning definition is placed. For instance, if a log rotation definition has a wrong entry, a file called mylogrotation does not reveal much information. However, CDBM-postgres-querylogs might reveal that the file is placed there by the customer database management team for a postgresql database. And it isn’t only about knowing who to contact (because that could easily be done by comments as well), but also to ensure no conflicts occur. On a shared database system, it is much more likely that two different teams place a postgresql file (which would overwrite the file already there) unless they use a proper naming convention.

The second is to use something identifying where the file comes from. A best practice when using Puppet for instance is to add in a comment to the file such as the following:

# This file is managed by Puppet through the org-pgsl-def module
# Please do not modify manually

This informs the administrator how the file is put there; you might even want to include version information.

A third one is when the order of configuration entries is important. Most *.d supporting tools do not really care about ordering, but some, like udev, do. When that is the case, the common consensus is to use numbers in the beginning of the file name. The numbers then provide a good ordering of the files.

Not all services already offer *.d functionality, although it isn’t that difficult to provide it as well. Consider the Linux audit daemon, whose rules are managed in the /etc/audit/audit.rules file. Not that flexible, isn’t it? But one can create a /etc/audit/audit.rules.d location and have the audit init script read these files (in alphanumeric order), creating the same functionality.

Given enough service adoption, software distribution can be sufficient to configure an application completely and integrate it with all services used by the operating system. And even services that do not support *.d directories can still be easily wrapped around so that their configuration file itself is generated based on the information in such directories. Consider a hypothetical AIDE configuration, where the aide.conf is generated based on the aide.conf.head, aide.d/* and aide.conf.tail files (similar to how resolv.conf is sometimes managed). The generation is triggered right before aide itself is called (perhaps all in a single script).

Such an approach allows full integration:

  • A PAM configuration file is placed, allowing the service authentication to be easily managed by administrators. Changes on the authentication (for instance, switch to an LDAP authentication or introduce some trust relation) is done by placing an updated file.
  • A log rotation configuration file is placed, making sure that the log files for the service do not eventually fill the partitions
  • A syslog configuration is provided, allowing for some events to be sent to a different server instead of keeping it local – or perhaps both
  • A cron configuration is stored so that statistics and other house-cleaning jobs for the service can run at night
  • An audit configuration snippet is added to ensure critical commands and configuration files are properly checked
  • Intrusion detection rules are added when needed
  • Monitoring information is placed on the file system, causing additional monitoring metrics to be automatically picked up
  • Firewall definitions are extended based on the snippets placed on the system

etc. And all this by only placing files on the file system. Keep It Simple, and efficient ;-)

Posts for Tuesday, May 7, 2013

avatar

Added “predictable network interface” info into the handbook

Being long overdue – like many of our documentation-reported bugs :-( I worked on bug 466262 to update the Gentoo Handbook with information about Network Interface Naming. Of course, the installation instructions have also seen the necessary updates to refer to this change.

With some luck (read: time) I might be able to fix various other documentation-related ones soon. I had some problems with the new SELinux userspace that I wanted to get fixed before, and then I worked on the new SELinux policies as well as trying to figure out how SELinux deals with network related aspects. Hence I saw time fly by at the speed of a neutrino…

BTW, the 20130424 policies are in the tree.

Posts for Monday, May 6, 2013

avatar

Overview of Linux capabilities, part 3

In previous posts I talked about capabilities and gave an introduction to how this powerful security feature within Linux can be used (and also exploited). I also covered a few capabilities, so let’s wrap this up with the remainder of them.

CAP_AUDIT_CONTROL
Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and filtering rules
CAP_AUDIT_WRITE
Write records to kernel auditing log
CAP_BLOCK_SUSPEND
Employ features that can block system suspend
CAP_MAC_ADMIN
Override Mandatory Access Control (implemented for the SMACK LSM)
CAP_MAC_OVERRIDE
Allow MAC configuration or state changes (implemented for the SMACK LSM)
CAP_NET_ADMIN
Perform various network-related operations:

  • interface configuration
  • administration of IP firewall, masquerading and accounting
  • modify routing tables
  • bind to any address for transparent proxying
  • set type-of-service (TOS)
  • clear driver statistics
  • set promiscuous mode
  • enabling multicasting
  • use setsockopt() for privileged socket operations
CAP_NET_BIND_SERVICE
Bind a socket to Internet domain privileged ports (less than 1024)
CAP_NET_RAW
Use RAW and PACKET sockets, and bind to any address for transparent proxying
CAP_SETPCAP
Allow the process to add any capability from the calling thread’s bounding set to its inheritable set, and drop capabilities from the bounding set (using prctl()) and make changes to the securebits flags.
CAP_SYS_ADMIN
Very powerful capability, includes:

  • Running quota control, mount, swap management, set hostname, …
  • Perform VM86_REQUEST_IRQ vm86 command
  • Perform IPC_SET and IPC_RMID operations on arbitrary System V IPC objects
  • Perform operations on trusted.* and security.* extended attributes
  • Use lookup_dcookie

and many, many more. man capabilities gives a good overview of them.

CAP_SYS_BOOT
Use reboot() and kexec_load()
CAP_SYS_CHROOT
Use chroot()
CAP_SYS_MODULE
Load and unload kernel modules
CAP_SYS_RESOURCE
Another capability with many consequences, including:

  • Use reserved space on ext2 file systems
  • Make ioctl() calls controlling ext3 journaling
  • Override disk quota limits
  • Increase resource limits
  • Override RLIMIT_NPROC resource limits

and many more.

CAP_SYS_TIME
Set system clock and real-time hardware clock
CAP_SYS_TTY_CONFIG
Use vhangup() and employ various privileged ioctl() operations on virtual terminals
CAP_SYSLOG
Perform privileged syslog() operations and view kernel addresses exposed with /proc and other interfaces (if kptr_restrict is set)
CAP_WAKE_ALARM
Trigger something that will wake up the system

Now when you look through the manual page of the capabilities, you’ll notice it talks about securebits as well. This is an additional set of flags that govern how capabilities are used, inherited etc. System administrators don’t set these flags – they are governed by the applications themselves (when creating threads, forking, etc.) These flags are set on a per-thread level, and govern the following behavior:

SECBIT_KEEP_CAPS
Allow a thread with UID 0 to retain its capabilities when it switches its UIDs to a nonzero (non-root) value. By default, this flag is not set, and even if it is set, it is cleared on an execve call, reducing the likelihood that capabilities are “leaked”.
SECBIT_NO_SETUID_FIXUP
When set, the kernel will not adjust the capability sets when the thread’s effective and file system UIDs are switched between zero (root) and non-zero values.
SECBIT_NOROOT
If set, the kernel does not grant capabilities when a setuid-root program is executed, or when a process with an effective or real UID of 0 (root) calls execve.

Manipulating these bits requires the CAP_SETPCAP capability. Except for the SECBIT_KEEP_CAPS security bit, the others are preserved on an execve() call, and all bits are inherited by child processes (such as when fork() is used).

As a user or admin, you can also see capability-related information through the /proc file system:

 # grep ^Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff

$ grep ^Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000001fffffffff

The capabilities listed therein are bitmasks for the various capabilities. The mask 1FFFFFFFFF holds 37 positions, which match the 37 capabilities known (again, see uapi/linux/capabilities.h in the kernel sources to see the values of each of the capabilities). Again, the pscap can be used to get information about the enabled capabilities of running processes in a more human readable format. But another tool provided by the sys-libs/libcap is interested as well to look at: capsh. The tool offers many capability-related features, including decoding the status fields:

$ capsh --decode=0000001fffffffff
0x0000001fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,
cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,
cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,
cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,
cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,
cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,
cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,
cap_syslog,35,36

Next to fancy decoding, capsh can also launch a shell with reduced capabilities. This makes it a good utility for jailing chroots even more.

Posts for Sunday, May 5, 2013

avatar

Overview of Linux capabilities, part 2

As I’ve (in a very high level) described capabilities and talked a bit on how to work with them, I started with a small overview of file-related capabilities. So next up are process-related capabilities (note, this isn’t a conform terminology, more some categorization that I do myself).

CAP_IPC_LOCK
Allow the process to lock memory
CAP_IPC_OWNER
Bypass the permission checks for operations on System V IPC objects (similar to the CAP_DAC_OVERRIDE for files)
CAP_KILL
Bypass permission checks for sending signals
CAP_SETUID
Allow the process to make arbitrary manipulations of process UIDs and create forged UID when passing socket credentials via UNIX domain sockets
CAP_SETGID
Same, but then for GIDs
CAP_SYS_NICE
This capability governs several permissions/abilities, namely to allow the process to

  • change the nice value of itself and other processes
  • set real-time scheduling priorities for itself, and set scheduling policies and priorities for arbitrary processes
  • set the CPU affinity for arbitrary processes
  • apply migrate_pages to arbitrary processes and allow processes to be migrated to arbitrary nodes
  • apply move_pages to arbitrary processes
  • use the MPOL_MF_MOVE_ALL flag with mbind() and move_pages()

The abilities related to page moving, migration and nodes is of importance for NUMA systems, not something most workstations have or need.

CAP_SYS_PACCT
Use acct(), to enable or disable system resource accounting for the process
CAP_SYS_PTRACE
Allow the process to trace arbitrary processes using ptrace(), apply get_robust_list() against arbitrary processes and inspect processes using kcmp().
CAP_SYS_RAWIO
Allow the process to perform I/O port operations, access /proc/kcore and employ the FIBMAP ioctl() operation.

Capabilities such as CAP_KILL and CAP_SETUID are very important to govern correctly, but this post would be rather dull (given that the definitions of the above capabilities can be found from the manual page) if I wouldn’t talk a bit more about its feasibility. Take a look at the following C application code:

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/capability.h>
#include <sys/prctl.h>
#include <sys/types.h>
#include <unistd.h>

int main(int argc, char ** argv) {
  printf("cap_setuid and cap_setgid: %d\n", prctl(PR_CAPBSET_READ, CAP_SETUID|CAP_SETGID, 0, 0, 0));
  printf(" %s\n", cap_to_text(cap_get_file(argv[0]), NULL));
  printf(" %s\n", cap_to_text(cap_get_proc(), NULL));
  if (setresuid(0, 0, 0));
    printf("setresuid(): %s\n", strerror(errno));
  execve("/bin/sh", NULL, NULL);
}

At first sight, it looks like an application to get root privileges (setresuid()) and then spawn a shell. If that application would be given CAP_SETUID and CAP_SETGID effectively, it would allow anyone who executed it to automatically get a root shell, wouldn’t it?

$ gcc -o test -lcap test.c
# setcap cap_setuid,cap_setgid+ep test
$ ./test
cap_setuid and cap_setgid: 1
 = cap_setgid,cap_setuid+ep
 =
setresuid() failed: Operation not permitted

So what happened? After all, the two capabilities are set with the +ep flags given. Then why aren’t these capabilities enabled? Well, this binary was stored on a file system that is mounted with the nosuid option. As a result, this capability is not enabled and the application didn’t work. If I move the file to another file system that doesn’t have the nosuid option:

$ /usr/local/bin/test
cap_setuid and cap_setgid: 1
 = cap_setgid,cap_setuid+ep
 = cap_setgid,cap_setuid+ep
setresuid() failed: Operation not permitted

So the capabilities now do get enabled, so why does this still fail? This now is due to SELinux:

type=AVC msg=audit(1367393377.342:4778): avc:  denied  { setuid } for  pid=21418 comm="test" capability=7  scontext=staff_u:staff_r:staff_t tcontext=staff_u:staff_r:staff_t tclass=capability

And if you enable grSecurity’s TPE, we can’t even start the binary to begin with:

$ ./test
-bash: ./test: Permission denied
$ /lib/ld-linux-x86-64.so.2 /home/test/test
/home/test/test: error while loading shared libraries: /home/test/test: failed to map segment from shared object: Permission denied

# dmesg
...
[ 5579.567842] grsec: From 192.168.100.1: denied untrusted exec (due to not being in trusted group and file in non-root-owned directory) of /home/test/test by /home/test/test[bash:4221] uid/euid:1002/1002 gid/egid:100/100, parent /bin/bash[bash:4195] uid/euid:1002/1002 gid/egid:100/100

When all these “security obstacles” are not enabled, then the call succeeds:

$ /usr/local/bin/test
cap_setuid and cap_setgid: 1
 = cap_setgid,cap_setuid+ep
 = cap_setgid,cap_setuid+ep
setresuid() failed: Success
root@hpl tmp # 

This again shows how important it is to regularly review capability-enabled files on the file system, as this is a major security problem that cannot be detected by only looking for setuid binaries, but also that securing a system is not limited to one or a few settings: one always has to take the entire setup into consideration, hardening the system so it becomes more difficult for malicious users to abuse it.

# filecap -a
file                 capabilities
/usr/local/bin/test     setgid, setuid

Posts for Saturday, May 4, 2013

avatar

Overview of Linux capabilities, part 1

In the previous posts, I talked about capabilities and how they can be used to allow processes to run in a privileged fashion without granting them full root access to the system. An example given was how capabilities can be leveraged to run ping without granting it setuid root rights. But what are the various capabilities that Linux is, well, capable of?

There are many, and as time goes by, more capabilities are added to the set. The last capability added to the main Linux kernel tree was the CAP_BLOCK_SUSPEND in the 3.5 series. An overview of all capabilities can be seen with man capabilities or by looking at the Linux kernel source code, include/uapi/linux/capability.h. But because you are all lazy, and because it is a good exercise for myself, I’ll go through many of them in this and the next few posts.

For now, let’s look at file related capabilities. As a reminder, if you want to know which SELinux domains are “granted” a particular capability, you can look this up using sesearch. The capability is either in the capability or capability2 class, and is named after the capability itself, without the CAP_ prefix:

$ sesearch -c capability -p chown -A
CAP_CHOWN
Allow making changes to the file UIDs and GIDs.
CAP_DAC_OVERRIDE
Bypass file read, write and execute permission checks. I came across a reddit post that was about this capability not that long ago.
CAP_DAC_READ_SEARCH
Bypass file read permission and directory read/search permission checks.
CAP_FOWNER
This capability governs 5 capabilities in one:

  • Bypass permission checks on operations that normally require the file system UID of the process to match the UID of the file (unless already granted through CAP_DAC_READ_SEARCH and/or CAP_DAC_OVERRIDE)
  • Allow to set extended file attributes
  • Allow to set access control lists
  • Ignore directory sticky bit on file deletion
  • Allow specifying O_NOATIME for files in open() and fnctl() calls
CAP_FSETID
Do not clear the setuid/setgid permission bits when a file is modified
CAP_LEASE
Allow establishing leases on files
CAP_LINUX_IMMUTABLE
Allow setting FS_APPEND_FL and FP_IMMUTABLE_FL inode flags
CAP_MKNOD
Allow creating special files with mknod
CAP_SETFCAP
Allow setting file capabilities (what I did with the anotherping binary in the previous post)

When working with SELinux (especially when writing applications), you’ll find that the CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE capability come up often. This is the case when applications are written to run as root yet want to scan through, read or even execute non-root owned files. Without SELinux, because these run as root, this is all granted. However, when you start confining those applications, it becomes apparent that they require this capability. Another example is when you run user applications, as root, like when trying to play a movie or music file with mplayer when this file is owned by a regular user:

type=AVC msg=audit(1367145131.860:18785): avc:  denied  { dac_read_search } for
pid=8153 comm="mplayer" capability=2  scontext=staff_u:sysadm_r:mplayer_t
tcontext=staff_u:sysadm_r:mplayer_t tclass=capability

type=AVC msg=audit(1367145131.860:18785): avc:  denied  { dac_override } for
pid=8153 comm="mplayer" capability=1  scontext=staff_u:sysadm_r:mplayer_t
tcontext=staff_u:sysadm_r:mplayer_t tclass=capability

Notice the time stamp: both checks are triggered at the same time. What happens is that the Linux security hooks first check for DAC_READ_SEARCH (the “lesser” grants of the two) and then for DAC_OVERRIDE (which contains DAC_READ_SEARCH and more). In both cases, the check failed in the above example.

The CAP_LEASE capability is one that I had not heard about before (actually, I had not heard of getting “file leases” on Linux either). A file lease allows for the lease holder (which requires this capability) to be notified when another process tries to open or truncate the file. When that happens, the call itself is blocked and the lease holder is notified (usually using SIGIO) about the access. It is not really to lock a file (since, if the lease holder doesn’t properly release it, it is forcefully “broken” and the other process can continue its work) but rather to properly close the file descriptor or flushing caches, etc.

BTW, on my system, only 5 SELinux domains hold the lease capability.

There are 37 capabilities known by the Linux kernel at this time. The above list has 9 file related ones. So perhaps next I can talk about process capabilities.

Posts for Friday, May 3, 2013

avatar

Restricting and granting capabilities

As capabilities are a way for running processes with some privileges, without having the need to grant them root privileges, it is important to understand that they exist if you are a system administrator, but also as an auditor or other security-related function. Having processes run as a non-root user is no longer sufficient to assume that they do not hold any rights to mess up the system or read files they shouldn’t be able to read.

The grsecurity kernel patch set, which is applied to the Gentoo hardened kernel sources, contains for instance CONFIG_GRKERNSEC_CHROOT_CAPS which, as per its documentation, “restrcts the capabilities on all root processes within a chroot jail to stop module insertion, raw i/o, system and net admin tasks, rebooting the system, modifying immutable files, modifying IPC owned by another, and changing the system time.” But other implementations might even use capabilities to restrict the users. Consider LXC (Linux Containers). When a container is started, CAP_SYS_BOOT (the ability to shutdown/reboot the system/container) is removed so that users cannot abuse this privilege.

You can also grant capabilities to users selectively, using pam_cap.so (the Capabilities Pluggable Authentication Module). For instance, to allow some users to ping, instead of granting the cap_net_raw immediately (+ep), we can assign the capability to some users through PAM, and have the ping binary inherit and use this capability instead (+p). That doesn’t mean that the capability is in effect, but rather that it is in a sort-of permitted set. Applications that are granted a certain permission this way can either use this capability if the user is allowed to have it, or won’t otherwise.

# setcap cap_net_raw+p anotherping

# vim /etc/pam.d/system-login
... add in something like
auth     required     pam_cap.so

# vim /etc/security/capability.conf
... add in something like
cap_net_raw           user1

The logic used with capabilities can be described as follows (it is not as difficult as it looks):

        pI' = pI
  (***) pP' = fP | (fI & pI)
        pE' = pP' & fE          [NB. fE is 0 or ~0]

  I=Inheritable, P=Permitted, E=Effective // p=process, f=file
  ' indicates post-exec().

So, for instance, the second line reads “The permitted set of capabilities of the newly forked process is set to the permitted set of capabilities of its executable file, together with the result of the AND operation between the inherited capabilities of the file and the inherited capabilities of the parent process.”

As an admin, you might want to keep an eye out for binaries that have particular capabilities set. With filecap you can list which capabilities are in the effective set of files found on the file system (for instance, +ep).

# filecap 
file                 capabilities
/bin/anotherping     net_raw

Similarly, with pscap you can see the capabilities set on running processes.

# pscap -a
ppid  pid   name        command           capabilities
6148  6152  root        bash              full

It might be wise to take this up in the daily audit reports.

Posts for Thursday, May 2, 2013

avatar

Capabilities, a short intro

Capabilities. You probably have heard of them already, but when you start developing SELinux policies, you’ll notice that you come in closer contact with them than before. This is because SELinux, when applications want to do something “root-like”, checks the capability of that application. Without SELinux, this either requires the binary to have the proper capability set, or the application to run in root modus. With SELinux, the capability also needs to be granted to the SELinux context (the domain in which the application runs).

But forget about SELinux for now, and let’s focus on capabilities. Capabilities in Linux are flags that tell the kernel what the application is allowed to do, but unlike file access, capabilities for an application are system-wide: there is no “target” to which it applies. Think about an “ability” of an application. See for yourself through man capabilities. If you have no additional security mechanism in place, the Linux root user has all capabilities assigned to it. And you can remove capabilities from the root user if you want to, but generally, capabilities are used to grant applications that tiny bit more privileges, without needing to grant them root rights.

Consider the ping utility. It is marked setuid root on some distributions, because the utility requires the (cap)ability to send raw packets. This capability is known as CAP_NET_RAW. However, thanks to capabilities, you can now mark the ping application with this capability and drop the setuid from the file. As a result, the application does not run with full root privileges anymore, but with the restricted privileges of the user plus one capability, namely the CAP_NET_RAW.

Let’s take this ping example to the next level: copy the binary (possibly relabel it as ping_exec_t if you run with SELinux), make sure it does not hold the setuid and try it out:

# cp ping anotherping
# chcon -t ping_exec_t anotherping

Now as a regular user:

$ ping -c 1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.057 ms

$ anotherping -c 1 127.0.0.1
ping: icmp open socket: Operation not permitted

Let’s assign the binary with the CAP_NET_RAW capability flag:

# setcap cap_net_raw+ep anotherping

And tadaa:

$ anotherping -c 1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.054 ms

What setcap did was place an extended attribute to the file, which is a binary representation of the capabilities assigned to the application. The additional information (+ep) means that the capability is permitted and effective.

So long for the primer, I’ll talk about the various capabilities in a later post.

Posts for Wednesday, May 1, 2013

avatar

SELinux mount options

When you read through the Gentoo Hardened SELinux handbook, you’ll notice that we sometimes update /etc/fstab with some SELinux-specific settings. So, what are these settings about and are there more of them?

First of all, let’s look at a particular example from the installation instructions so you see what I am talking about:

tmpfs  /tmp  tmpfs  defaults,noexec,nosuid,rootcontext=system_u:object_r:tmp_t  0 0

What the rootcontext= option does here is to set the context of the “root” of that file system (meaning, the context of /tmp in the example) to the specified context before the file system is made visible to the userspace. Because we do it soon, the file system is known as tmp_t throughout its life cycle (not just after the mount or so).

Another option that you’ll frequently see on the Internet is the context= option. This option is most frequently used for file systems that do not support extended attributes, and as such cannot store the context of files on the file system. With the context= mount option set, all files on that file system get the specified context. For instance, context=system_u:object_r:removable_t.

If the file system does support extended attributes, you might find some benefit in using the defcontext= option. When set, the context of files and directories (and other resources on that file system) that do not have a SELinux context set yet will use this default context. However, once a context is set, it will use that context instead.

The last context-related mount option is fscontext=. With this option, you set the context of the “filesystem” class object of the file system rather than the mount itself (or the files). Within SELinux, “filesystem” is one of the resource classes that can get a context. Remember the /tmp mount example from before? Well, even though the files are labeled tmp_t, the file system context itself is still tmpfs_t.

It is important to know that, if you use one of these mount options, context= is mutually exclusive to the other options as it “forces” the context on all resources (including the filesystem class).

Posts for Tuesday, April 30, 2013

avatar

Qemu-KVM monitor tips and tricks

When running KVM guests, the Qemu/KVM monitor is a nice interface to interact with the VM and do specific maintenance tasks on. If you run the KVM guests with VNC, then you can get to this monitor through Ctrl-Alt-2 (and Ctrl-Alt-1 to get back to the VM display). I personally run with the monitor on the standard input/output where the VM is launched as its output is often large and scrolling in the VNC doesn’t seem to work well.

I decided to give you a few tricks that I use often on the monitor to handle the VMs.

When I do not start the VNC server associated with the VM by default, I can enable it on the monitor using change vnc while getting details is done using info vnc. To disable VNC again, use change vnc none.

(qemu) info vnc
Server: disabled
(qemu) change vnc 127.0.0.1:20
(qemu) change vnc password
Password: ******
(qemu) info vnc
Server:
     address: 127.0.0.1:5920
        auth: vnc
Client: none

Similarly, if you need to enable remote debugging, you can use the gdbserver option.

Getting information using info is dead-easy, and supports a wide area of categories: balloon info, block devices, character devices, cpus, memory mappings, network information, etcetera etcetera… Just enter info to get an overview of all supported commands.

To easily manage block devices, you can see the current state of devices using info block and then use change <blockdevice> <path> to update it.

(qemu) info block
virtio0: removable=0 io-status=ok file=/srv/virt/gentoo/hardened2selinux/selinux-base.img ro=0 drv=qcow2 encrypted=0 bps=0 bps_rd=0 bps_wr=0 iops=0 iops_rd=0 iops_wr=0
ide1-cd0: removable=1 locked=0 tray-open=0 io-status=ok [not inserted]
floppy0: removable=1 locked=0 tray-open=0 [not inserted]
sd0: removable=1 locked=0 tray-open=0 [not inserted]
(qemu) change ide1-cd0 /srv/virt/media/systemrescuecd-x86-2.2.0.iso

To powerdown the system, use system_powerdown. If that fails, you can use quit to immediately shut down (terminate) the VM. To reset it, use system_reset. You can also hot-add PCI devices and manipulate CPU states, or even perform live migrations between systems.

When you use qcow2 image formats, you can take a full VM snapshot using savevm and, when you later want to return to this point again, use loadvm. This is interesting when you want to do potentially harmful changes on the system and want to easily revert back if things got broken.

(qemu) savevm 20130419
(qemu) info snapshots
     ID        TAG                 VM SIZE                DATE       VM CLOCK
     1         20130419               224M 2013-04-19 12:05:16   00:00:17.294
(qemu) loadvm 20130419

Posts for Monday, April 29, 2013

No cameras

What is the technology that has had the most impact in the shortest amount of time? Many people would say “the Internet” and not be completely off but I think that one thing trumps even that: Digital photography.

I have never really been a person taking or collecting personal photos, they don’t capture a moment for me, always look like a caricature of the event I remember. But many people love pictures. Pictures of your kids, of the spots you traveled to or of family celebrations. Pictures have been part of our culture for a long time now. Museums are full of pictures from 100 years ago showing us how people lived, how they dressed, what they cherished.

But obviously those pictures all look staged: Taking a photo was time consuming and expensive so you only recorded the most important events. Weddings. The birth of a child (or maybe its christening in christian households). A 70s birthday.

Technology and better processes made pictures cheaper in the last third of the 20th century but taking them was still kind of a chore: You had a film with 24 or maybe 36 potential pictures and had to fill it up before you could develop it. On the other hand for travels you needed to make every picture count cause you wouldn’t take 10 films with you but just a handful. Pictures were special, precious and more often than not staged (as in fake).

When digital cameras became a valid alternative (quality wise) it created a disruption as I have not seen it afterwards: Even people who might still to this day consider the Internet sort of an annoying fad jumped onto the bandwagon and started taking digital pictures. Obviously convenience was a factor, the idea that you could develop just one picture if you needed a copy. Simple distribution via the Internet was another. But last but not least, the abolishing of scarcity changed photography: No longer would a photographer setup people into pose for 5 minutes to take 2 — better 3, for safety! — shots. You just shot as a process, an ongoing stream of snapshots of your reality, of what you saw. You could still stage pictures but why would you if you could just capture the actual moment?

But while the technology has been widely embraced some people are quite unhappy and I am not referring to those who used to make a living making or selling film rollls.

11123331 c03278d8fc No cameras

cc licensed ( BY NC ) flickr photo shared by satanslaundromat

In the lingo of our civil liberty activists photos and cameras have been closely associated with surveillance. People even started agitating against services like Google’s Street View without any real legal reason1. To make it simpler: CAMERA BAD! PRIVACY-HULK SMASH!

But having our elite of privacy and civil liberties experts taint the idea of cameras by universally linking them to surveillance allowed other entities to latch onto the idea. Nowadays it seems normal for shops to ban cameras, for random people to claim that you may not take a picture of whatever item you want to take a picture of or for people wanting to forbid you from using a camera in an obviously public space.

We are creating a world where everybody using a camera is considered suspicious. A breaker of the social contract. Probably a Google agent! The only people left with cameras in the public are cops (because of “safety” and “fight against terror” and “security”).

A few months ago a link was being shared on the Internet. At http://www.paris1914.com/ you could see color pictures from scenes of Paris in 1914 (if you have not yet, check them out, they are beautiful). How do we want the page www.hamburg2013.com to look like? Full of life and examples of how people live there? Full of zeitgeist? Or an empty page with a short remark “No Cameras”?

When talking regulation we always have to balance different stakeholder’s wishes and needs, different rights and wishes. But it is essential not to throw out the baby with the bathwater: Even people who believe to do the best for everyone can go wrong and a witch hunt on cameras and people using them is really the last thing we want.

  1. cause the front of your house is not at all covered by your privacy rights

The post No cameras appeared first on tante.blog.

flattr this!

avatar

photorec to the rescue

Once again PhotoRec has been able to save files from a corrupt FAT USB drive. The application scans the partition, looking for known files (based on the file magic) and then restores those files. The files are not named as they were though, so there is still some manual work left, but the recovery works pretty well:

PhotoRec 6.12, Data Recovery Utility, May 2011
Christophe GRENIER <grenier>

http://www.cgsecurity.org

Disk /dev/sdc1 - 1000 GB / 931 GiB (RO) - WD My Book
     Partition                  Start        End    Size in sectors
     No partition             0   0  1 121600 253 63 1953520002 [Whole disk]


Pass 1 - Reading sector  464342462/1953520002, 10738 files found
Elapsed time 2h46m01s - Estimated time to completion 8h52m25
jpg: 7429 recovered
txt: 961 recovered
mp3: 558 recovered
tx?: 373 recovered
riff: 297 recovered
gif: 218 recovered
exe: 151 recovered
ifo: 126 recovered
mpg: 91 recovered
pdf: 83 recovered
others: 451 recovered

In Gentoo, you can find the package as part of app-admin/testdisk. To recover the files, I ran the following command:

$ photorec /log /d /path/to/recovery/dest /dev/sdc1

While skimming through the recovered files, I found a few ones that I deleted a long time ago but apparently never got overwritten (the data, that is). Scary to see how easy such recovery is… Makes me remember that, if you really want to delete files in a less recoverable manner, you can use shred for that.

And for those out there yelling to backup this data – you’re absolutely correct, but no. I backup my systems and important files daily, but this disk contained (mainly) raw picture images and videorecordings. The manipulated, finished images and recordings are backed up (or at least on a disk and somewhere online), but the raw images and recordings are often too much to introduce a backup for, and if I would really lost them, I wouldn’t shed a tear (nor panic).

Posts for Sunday, April 28, 2013

avatar

Securely handling libffi

I’ve recently came across libffi again. No, not because it was mentioned during the Gentoo Hardened online meeting, but because my /var/tmp wasn’t mounted correctly, and emerge (actually python) uses libffi. Most users won’t notice this, because libffi works behind the scenes. But when it fails, it fails bad. And SELinux actually helped me quickly identify what the problem is.

$ emerge --info
segmentation fault

The abbreviation “libffi” comes from Foreign Function Interface, and is a library that allows developers to dynamically call code from another application or library. But the method how it approaches this concerns me a bit. Let’s look at some strace output:

8560  open("/var/tmp/ffiZ8gKPd", O_RDWR|O_CREAT|O_EXCL, 0600) = 11
8560  unlink("/var/tmp/ffiZ8gKPd")      = 0
8560  ftruncate(11, 4096)               = 0
8560  mmap(NULL, 4096, PROT_READ|PROT_EXEC, MAP_SHARED, 11, 0) = -1 EACCES (Permission denied)

Generally, what libffi does, is to create a file somewhere where it can write files (it checks the various mounts on a system to get a list of possible target file systems), adds the necessary data (that it wants to execute) to it, unlinks the file from the file system (but keep the file descriptor open, so that the file cannot (easily) be modified on the system anymore) and then maps it to memory for executable access. If executing is allowed by the system (for instance because the mount point does not have noexec), then SELinux will trap it because the domain (in our case now, portage_t) is trying to execute an (unlinked) file for which it holds no execute rights on:

type=AVC msg=audit(1366656205.201:2221): avc:  denied  { execute } for  
pid=8560 comm="emerge" path=2F7661722F66666962713154465A202864656C6574656429 
dev="dm-3" ino=6912 scontext=staff_u:sysadm_r:portage_t tcontext=staff_u:object_r:var_t
tclass=file

When you notice something like this (an execute on an unnamed file), then this is because the file descriptor points to a file already unlinked from the system. Finding out what it was about might be hard (but with strace it is easy as … well, whatever is easy for you).

Now what happened was that, because /var/tmp wasn’t mounted, files created inside it got the standard type (var_t) which the Portage domain isn’t allowed to execute. It is allowed to execute a lot of types, but not that one ;-) When /var/tmp is properly mounted, the file gets the portage_tmp_t type where it does hold execute rights for.

Now generally, I don’t like having world-writeable locations without noexec. For /tmp, noexec is enabled, but for /var/tmp I have (well, had ;-) to allow execution from the file system, mainly because some (many?) Gentoo package builds require it. So how about this dual requirement, of allowing Portage to write (and execute) its own files, and allow libffi to do its magic? Certainly, from a security point of view, I might want to restrict this further…

Well, we need to make sure that the location where Portage works with (the location pointed to by $PORTAGE_TMPDIR) is specifically made available for Portage: have the directory only writable by the Portage user. I keep it labeled as tmp_t so that the existing policies apply, but it might work with portage_tmp_t immediately set as well. Perhaps I’ll try that one later. With that set, we can have this mount-point set with exec rights (so that libffi can place its file there) in a somewhat more secure manner than allowing exec on world-writeable locations.

So now my /tmp and /var/tmp (and /run and /dev/shm and /lib64/rc/init.d) are tmpfs-mounts with the noexec (as well as nodev and nosuid) bits set, with the location pointed towards by $PORTAGE_TMPDIR being only really usable by the Portage user:

$ ls -ldZ /var/portage
drwxr-x---. 4 portage root system_u:object_r:tmp_t 4096 Apr 22 21:45 /var/portage/

And libffi? Well, allowing applications to create their own executables and executing it is something that should be carefully governed. I’m not aware of any existing or past vulnerabilities, but I can imagine that opening the ffi* file(s) the moment they come up (to make sure you have a file descriptor) allows you to overwrite the content after libffi has created it but before the application actually executes it. By limiting the locations where applications can write files to (important step one) and the types they can execute (important step two) we can already manage this a bit more. Using regular DAC, this is quite difficult to achieve, but with SELinux, we can actually control this a bit more.

Let’s first see how many domains are allowed to create, write and execute files:

$ sesearch -c file -p write,create,execute -A | grep write | grep create \
  | grep execute | awk '{print $1}' | sort | uniq | wc -l
32

Okay, 32 target domains. Not that bad, and certainly doable to verify manually (hell, even in a scripted manner). You can now check which of those domains have rights to execute generic binaries (bin_t), possibly needed for command execution vulnerabilities or privilege escalation. Or that have specific capabilities. And if you want to know which of those domains use libffi, you can use revdep-rebuild to find out which files are linked to the libffi libraries.

It goes to show that trying to keep your box secure is a never-ending story (please, companies, allow your system administrators to do their job by giving them the ability to continuously increase security rather than have them ask for budget to investigate potential security mitigation directives based on the paradigm of business case and return on investment using pareto-analytics blaaaahhhh….), and that SELinux can certainly be an important method to help achieve it.

Posts for Saturday, April 27, 2013

avatar

How logins get their SELinux user context

Sometimes, especially when users are converting their systems to be SELinux-enabled, their user context is wrong. An example would be when, after logon (in permissive mode), the user is in the system_u:system_r:local_login_t domain instead of a user domain like staff_u:staff_r:staff_t.
So, how does a login get its SELinux user context?

Let’s look at the entire chain of SELinux context changes across a boot. At first, when the system boots, the kernel (and all processes invoked from it) run in the kernel_t domain (I’m going to ignore the other context fields for now until they become relevant). When the kernel initialization has been completed, the kernel executes the init binary. When you use an initramfs, then a script might be called. This actually doesn’t matter that much yet, since SELinux stays within the kernel_t domain until a SELinux-aware init is launched.

When the init binary is executed, init of course starts. But as mentioned, init is SELinux-aware, meaning it will invoke SELinux-related commands. One of these is that it will load the SELinux policy (as stored in /etc/selinux) and then reexecute itself. Because of that, its process context changes from kernel_t towards init_t. This is because the init binary itself is labeled as init_exec_t and a type transition is defined from kernel_t towards init_t when init_exec_t is executed.

Ok, so init now runs in init_t and it goes on with whatever it needs to do. This includes invoking init scripts (which, btw, run in initrc_t because the scripts are labeled initrc_exec_t or with a type that has the init_script_file_type attribute set, and a transition from init_t to initrc_t is defined when such files are executed). When the bootup is finally completed, init launches the getty processes. The commands are mentioned in /etc/inittab:

$ grep getty /etc/inittab
c1:12345:respawn:/sbin/agetty --noclear 38400 tty1 linux
c2:2345:respawn:/sbin/agetty 38400 tty2 linux
...

These binaries are also explicitly labeled getty_exec_t. As a result, the getty (or agetty) processes run in the getty_t domain (because a transition is defined from init_t to getty_t when getty_exec_t is executed). Ok, so gettys run in getty_t. But what happens when a user now logs on to the system?

Well, the getty’s invoke the login binary which, you guessed it right, is labeled as something: login_exec_t. As a result (because, again, a transition is defined in the policy), the login process runs as local_login_t. Now the login process invokes the various PAM subroutines which follow the definitions in /etc/pam.d/login. On Gentoo systems, this by default points to the system-local-login definitions which points to the system-login definitions. And in this definition, especially under the sessions section, we find a reference to pam_selinux.so:

session         required        pam_selinux.so close
...
session         required        pam_selinux.so multiple open

Now here is where some of the magic starts (see my post on Using pam_selinux to switch contexts for the gritty details). The methods inside the pam_selinux.so binary will look up what the context should be for a user login. For instance, when the root user logs on, it has SELinux checking what SELinux user root is mapped to, equivalent to running semanage login -l:

$ semanage login -l | grep ^root
root                      root                     

In this case, the SELinux user for root is root, but this is not always the case (that login and user are the same). For instance, my regular administrative account maps to the staff_u SELinux user.

Next, it checks what the default context should be for this user. This is done by checking the default_contexts file (such as the one in /etc/selinux/strict/contexts although user-specific overrides can be (and are) placed in the users subdirectory) based on the context of the process that is asking SELinux what the default context should be. In our case, it is the login process running as local_login_t:

$ grep -HR local_login_t /etc/selinux/strict/contexts/*
default_contexts:system_r:local_login_t user_r:user_t staff_r:staff_t sysadm_r:sysadm_t unconfined_r:unconfined_t
users/unconfined_u:system_r:local_login_t               unconfined_r:unconfined_t
users/guest_u:system_r:local_login_t            guest_r:guest_t
users/user_u:system_r:local_login_t             user_r:user_t
users/staff_u:system_r:local_login_t            staff_r:staff_t sysadm_r:sysadm_t
users/root:system_r:local_login_t  unconfined_r:unconfined_t sysadm_r:sysadm_t staff_r:staff_t user_r:user_t
users/xguest_u:system_r:local_login_t   xguest_r:xguest_t

Since we are verifying this for the root SELinux user, the following line of the users/root file is what matters:

system_r:local_login_t  unconfined_r:unconfined_t sysadm_r:sysadm_t staff_r:staff_t user_r:user_t

Here, SELinux looks for the first match in that line that the user has access to. This is defined by the roles that the user is allowed to access:

$ semanage user -l | grep root
root            staff_r sysadm_r

As root is allowed both the staff_r and sysadm_r roles, the first one found in the default context file that matches will be used. So it is not the order in which the roles are displayed in the semanage user -l output that matters, but the order of the contexts in the default context file. In the example, this is sysadm_r:sysadm_t:

system_r:local_login_t  unconfined_r:unconfined_t sysadm_r:sysadm_t staff_r:staff_t user_r:user_t
                        <-----------+-----------> <-------+-------> <------+------> <-----+----->
                                    `- no matching role   `- first (!)     `- second      `- no match

Now that we know what the context should be, this is used for the first execution that the process (still login) will do. So login changes the Linux user (if applicable) and invokes the shell of that user. Because this is the first execution that is done by login, the new context is set (being root:sysadm_r:sysadm_t) for the shell.

And that is why, if you run id -Z, it returns the user context (root:sysadm_r:sysadm_t) if everything works out fine ;-)

Posts for Friday, April 26, 2013

avatar

New SELinux userspace release

A new release of the SELinux userspace utilities was recently announced. I have made the packages for Gentoo available and they should now be in the main tree (~arch of course). During the testing of the packages however, I made a stupid mistake of running the tests on the wrong VM, one that didn’t contain the new packages. Result: no regressions (of course). My fault for not using in-ebuild tests properly, as I should. So you’ll probably see me blogging about the in-ebuild testing soon ;-)

In any case, the regressions I did find out (quite fast after I updated my main laptop with them as well) where a missing function in libselinux, a referral to a non-existing makefile when using “semanage permissive” and the new sepolicy application requiring yum python bindings. At least, with the missing function (hopefully correctly) resolved, all tests I usually do (except for the permissive domains) are now running well again.

This only goes to show how important testing is. Of course, I reported the bugs on the mailinglist of the userspace utilities as well. Hopefully they can look at them while I’m asleep so I can integrate fixes tomorrow more easily ;-)

Posts for Thursday, April 25, 2013

avatar

Gentoo protip: using buildpkgonly

If you don’t want to have the majority of builds run in the background while you are busy on the system, but you don’t want to automatically install software in the background when you are not behind your desk, then perhaps you can settle for using binary packages. I’m not saying you need to setup a build server and such or do your updates first in a chroot.

No, what this tip is for is to use the –buildpkgonly parameter of emerge at night, building some of your software (often the larger ones) as binary packages only (storing those in the ${PKGDIR} which defaults to /usr/portage/packages). When you are then on your system, you can run the update with binary package included:

# emerge -puDk world

To use –buildpkgonly, all package(s) that Portage wants to build must have all their dependencies met. If not, then the build will not go through and you’re left with no binary packages at all. So what we do is to create a script that looks at the set of packages that would be build, and then one for one building the binary package.

When ran, the script will attempt to build binary packages for those that have no dependency requirements anymore. Those builds will then create a binary package but will not be merged on the system. When you later update your system, including binary packages, those packages that have been build during the night will be merged quickly, reducing the build load on your system while you are working on it.

#!/bin/sh

LIST=$(mktemp);

emerge -puDN --color=n --columns --quiet=y world | awk '{print $2}' > ${LIST};

for PACKAGE in $(cat ${LIST});
do
  printf "Building binary package for ${PACKAGE}... "
  emerge -uN --quiet-build --quiet=y --buildpkgonly ${PACKAGE};
  if [[ $? -eq 0 ]];
  then
    echo "ok";
  else
    echo "failed";
  fi
done

I ran this a couple of days ago when -uDN world showed 46 package updates (including a few hefty ones like chromium). After running this script, 35 of them had a binary package ready so the -uDN world now only needed to build 11 packages, merging the remainder from binary packages.

Posts for Wednesday, April 24, 2013

avatar

Using strace to troubleshoot SELinux problems

When SELinux is playing tricks on you, you can just “allow” whatever it wants to do, but that is not always an option: sometimes, there is no denial in sight because the problem lays within SELinux-aware applications (applications that might change their behavior based on what the policy sais or even based on if SELinux is enabled or not). At other times, you get a strange behavior that isn’t directly visible what the cause is. But mainly, if you want to make sure that allowing something is correct (and not just a corrective action), you need to be absolutely certain that what you want to allow is security-wise acceptable.

To debug such issues, I often take the strace command to debug the application at hand. To use strace, I toggle the allow_ptrace boolean (strace uses ptrace() which, by default, isn’t allowed policy-wise) and then run the offending application through strace (or attach to the running process if it is a daemon). For instance, to debug a tmux issue we had with the policy not that long ago:

# setsebool allow_ptrace on
# strace -o strace.log -f -s 256 tmux

The resulting log file (strace.log) might seem daunting at first to look at. What you see are the system calls that the process is performing, together with their options but also the return code of each call. This is especially important as SELinux, if it denies something, often returns something like EACCESS (Permission Denied).

7313  futex(0x349e016f080, FUTEX_WAKE_PRIVATE, 2147483647) = 0
7313  futex(0x5aad58fd84, FUTEX_WAKE_PRIVATE, 2147483647) = 0
7313  stat("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
7313  stat("/home", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
7313  stat("/home/swift", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
7313  stat("/home/swift/.pki", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
7313  stat("/home/swift/.pki/nssdb", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
7313  statfs("/home/swift/.pki/nssdb", 0x3c3cab6fa50) = -1 EACCES (Permission denied)

Most (if not all) of the methods shown in a strace log are documented through manpages, so you can quickly find out that futex() is about fast user-space locking, stat() (man 2 stat to see the information about the method instead of the application) is about getting file status and statfs() is for getting file system statistics.

The most common permission issues you’ll find are file related:

7313  open("/proc/filesystems", O_RDONLY) = -1 EACCES (Permission denied)

In the above case, you notice that the application is trying to open the /proc/filesystems file read-only. In the SELinux logs, this might be displayed as follows:

audit.log:type=AVC msg=audit(1365794728.180:3192): avc:  denied  { read } for  
pid=860 comm="nacl_helper_boo" name="filesystems" dev="proc" ino=4026532034 
scontext=staff_u:staff_r:chromium_naclhelper_t tcontext=system_u:object_r:proc_t tclass=file

Now the case of tmux before was not an obvious one. In the end, I compared the strace output’s of two runs (one in enforcing and one in permissive) to find what the difference would be. This is the result:

Enforcing:

10905 fcntl(9, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) 
10905 fcntl(9, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0

Permissive:

10905 fcntl(9, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) 
10905 fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0

You notice the difference? In enforcing-mode, one of the flags on the file descriptor has O_RDONLY whereas the one in permissive mode as O_RDWR. This means that the file descriptor in enforcing mode is read-only whereas in permissive-mode is read-write. What we then do in the strace logs is to see where this file descriptor (with id=9) comes from:

10905 dup(0)     = 9
10905 dup(1)     = 10
10905 dup(2)     = 11

As the man-pages sais, dup() duplicates a file descriptor. And because, by convention, the first three file descriptors of an application correspond with standard input (0), standard output (1) and error output (2), we now know that the file descriptor with id=9 comes from the standard input file descriptor. Although this one should be read-only (it is the input that the application gets = reads), it seems that tmux might want to use this for writes as well. And that is what happens – tmux sends the file descriptor to the tmux server to check if it is a tty and then uses it to write to the screen.

Now what does that have to do with SELinux? It has to mean something, otherwise running in permissive mode would give the same result. After some investigation, we found out that using newrole to switch roles changes the flags of the standard input (as then provided by newrole) from O_RDWR to O_RDONLY (code snippet from newrole.c – look at the first call to open()):

/* Close the tty and reopen descriptors 0 through 2 */
if (ttyn) {
        if (close(fd) || close(0) || close(1) || close(2)) {
                fprintf(stderr, _("Could not close descriptors.\n"));
                goto err_close_pam;
        }
        fd = open(ttyn, O_RDONLY | O_NONBLOCK);
        if (fd != 0)
                goto err_close_pam;
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK);
        fd = open(ttyn, O_RDWR | O_NONBLOCK);
        if (fd != 1)
                goto err_close_pam;
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK);
        fd = open(ttyn, O_RDWR | O_NONBLOCK);
        if (fd != 2)
                goto err_close_pam;
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK);
}

Such obscure problems are much easier to detect and troubleshoot thanks to tools like strace.

Posts for Tuesday, April 23, 2013

avatar

SLOT’ing the old swig-1

The SWIG tool helps developers in building interfaces/libraries that can be accessed from many other languages than the ones the library is initially written in or for. The SELinux userland utility setools uses it to provide Python and Ruby interfaces even though the application itself is written in C. Sadly, the tool currently requires swig-1 for its building of the interfaces and uses constructs that do not seem to be compatible with swig-2 (same with the apse package, btw).

I first tried to patch setools to support swig-2, but eventually found regressions in the libapol library it provides so the patch didn’t work out (that is why some users mentioned that a previous setools version did build with swig – yes it did, but the result wasn’t correct). Recently, a post on Google Plus’ SELinux community showed me that I wasn’t wrong in this matter (it really does require swig-1 and doesn’t seem to be trivial to fix).

Hence, I have to fix the gentoo build problem where one set of tools requires swig-1 and another swig-2. Otherwise world-updates and even building stages for SELinux systems would fail as Portage finds incompatible dependencies. One way to approach this is to use Gentoo’s support for SLOTs. When a package (ebuild) in Gentoo defines a SLOT, it tells the package manager that the same package but a different version might be installed alongside the package if that has a different SLOT version. In case of swig, the idea is to give swig-1 a different slot than swig-2 (which uses SLOT="0") and make sure that both do not install the same files (otherwise you get file collisions).

Luckily, swig places all of its files except for the swig binary itself in /usr/share/swig/<version>, so all I had left to do was to make sure the binary itself is renamed. I chose to use swig1.3 (similar as to how tools like ruby and python and for some packages even java is implemented on Gentoo). The result (through bug 466650) is now in the tree, as well as an adapted setools package that uses the new swig SLOT.

Thanks to Samuli Suominen for getting me on the (hopefully ;-) right track. I don’t know why I was afraid of doing this, it was much less complex than I thought (now let’s hope I didn’t break other things ;-)

Posts for Monday, April 22, 2013

avatar

Mitigating DDoS attacks

Lately, DDoS attacks have been in the news more than I was hoping for. It seems that the botnets or other methods that are used to generate high-volume traffic to a legitimate service are becoming more and more easy to get and direct. At the time that I’m writing this post (a few days before its published though), the popular Reddit site is undergoing a DDoS attack which I hope will be finished (or mitigated) soon.

But what can a service do against DDoS attacks? After all, DDoS is like gasping for air if you can’t swim and are (almost) drowning: the air is the legitimate traffic, but the water is overwhelming and your mouth, pharynx and trachea just aren’t made to deal with this properly. And unlike specific Denial-of-Service attacks that use a vulnerability or malcrafted URL, you cannot just install some filter or upgrade a component to be safe again.

Methods for mitigating DDoS attacks (beyond increasing your bandwidth as that is very expensive and the botnets involved can go up to 130 Gbps, not a bandwidth you are probably willing to pay for if legitimate services on your site have enough with 10 Mbps) that come to mind are of all sorts of “classes”…

Configure your servers and services that they stay alive under pressure. Look for the sweet spot where performance of the services is still stable where a higher load means performance degradation. If you have some experience with load testing, you know that throughput on a service initially goes up linearly with the load (first phase). Then, it slows down (but still rises – phase 2) up to a point that, when you increase the load even further just a bit, the service degrades (and sometimes doesn’t even get back to its feed when you remove the additional load again – phase3). You need to look for the spot where load and performance is stable (somewhere at the middle of the second phase) and configure your systems so that additional load is dropped. Yes, this means that the DDoS will be more effective, but also means that your systems can easily get back up to their feet when the attack has finished (and you get a more predictable load and consequences).

Investigate if you can have a backup service that has a higher throughput ability (with reduced functionality). If the DDoS attack focuses on the system resources rather than network resources involved, such a backup “lighter” service can be used to still provide basic functionality (for instance a more static website), but even in case of network resource consumption it can have the advantage that the network consumption that your servers are placing (while replying to the requests) are lower.

Depending on the service you offer (and financial means you have at your disposal) you can look at redirecting traffic to more specialized services. Companies like Prolexic have systems that “scrub” the DDoS traffic from all traffic and only send legitimate requests to your systems. There are several methods for redirecting load, but a common one is to change the DNS records for your service(s) to point to the addresses of those specialized services instead. The lower the TTL (Time To Live) is of the records, the faster the redirect might take place. If you want to be able to handle an increase in load without specialized services, you might want to be able to redirect traffic to cloud services (where you host your service as well) which are generally capable of handling higher throughput than your own equipment (but this too comes at an additional cost).

Some people mention that you can switch IP address. This is true only if the DDoS attack is targeting IP addresses and not (DNS-resolved) URIs. You could set up additional IP addresses that are not registered in DNS (yet) and during the attack, extend the service resolving towards the additional addresses as well. If you do not notice a load spread of the DDoS attack towards the new addresses, you can remove the old addresses from DNS. But again, this won’t work generally – not only are most DDoS attacks using DNS-resolved URIs, most of the time attackers are actively involved in the attack and will quickly notice if such a “failover” has occurred (and react against it).

Depending on your relationship with your provider or location service, you can ask if the edge routers (preferably those of the ISP) can have fallback source filtering rules available to quickly enable. Those fallback rules would then only allow traffic from networks that you know most (all?) of your customers and clients are at. This isn’t always possible, but if you have a service that targets mainly people within your country, have the filter only allow traffic from networks of that country. If the DDoS attack uses geographically spread resources, it might be that the number of bots inside those allowed networks are low enough that your service can continue.

Configure your firewalls (and ask that your ISP does the same) to not accept (drop) traffic not expected. If the services on your architecture do not use external DNS, then you can drop incoming DNS response packets (a popular DDoS attack method is by using spoofed addresses towards open DNS resolvers; called a DNS reflection attack).

And finally, if you are not bound to a single data center, you might want to spread services across multiple locations. Although more difficult from a management point of view, a dispersed/distributed architecture allows other services to continue running while one is being attacked.

Posts for Sunday, April 21, 2013

avatar

Introducing selocal for small SELinux policy enhancements

When working with a SELinux-enabled system, administrators will eventually need to make small updates to the existing policy. Instead of building their own full policy (always an option, but most likely not maintainable in the long term) one or more SELinux policy modules are created (most distributions use a modular approach to the SELinux policy development) which are then loaded on their target systems.

In the past, users had to create their own policy module by creating (and maintaining) the necessary .te file(s), building the proper .pp files and loading it in the active policy store. In Gentoo, from policycoreutils-2.1.13-r11 onwards, a script is provided to the users that hopefully makes this a bit more intuitive for regular users: selocal.

As the name implies, selocal aims to provide an interface for handling local policy updates that do not need to be packaged or distributed otherwise. It is a command-line application that you feed single policy rules at one at a time. Each rule can be accompanied with a single-line comment, making it obvious for the user to know why he added the rule in the first place.

# selocal --help
Usage: selocal [<command>] [<options>] <rule>

Command can be one of:
  -l, --list            List the content of a SELinux module
  -a, --add             Add an entry to a SELinux module
  -d, --delete          Remove an entry from a SELinux module
  -M, --list-modules    List the modules currently known by selocal
  -u, --update-dep      Update the dependencies for the rules
  -b, --build           Build the SELinux module (.pp) file (requires privs)
  -L, --load            Load the SELinux module (.pp) file (requires privs)

Options can be one of:
  -m, --module <module>         Module name to use (default: selocal)
  -c, --comment <comment>       Comment (with --add)

The option -a requires that a rule is given, like so:
  selocal -a "dbadm_role_change(staff_r)"
The option -d requires that a line number, as shown by the --list, is given, like so:
  selocal -d 12

Let’s say that you need to launch a small script you written as a daemon, but you want this to run while you are still in the staff_t domain (it is a user-sided daemon you use personally). As regular staff_t isn’t allowed to have processes bind on generic ports/nodes, you need to enhance the SELinux policy a bit. With selocal, you can do so as follows:

# selocal --add "corenet_tcp_bind_generic_node(staff_t)" --comment "Launch local webserv.py daemon"
# selocal --add "corenet_tcp_bind_generic_port(staff_t)" --comment "Launch local webserv.my daemon"
# selocal --build --load
(some output on building the policy module)

When finished, the local policy is enhanced with the two mentioned rules. You can query which rules are currently stored in the policy:

# selocal --list
12: corenet_tcp_bind_generic_node(staff_t) # Launch local webserv.py daemon
13: corenet_tcp_bind_generic_port(staff_t) # Launch local webserv.py daemon

If you need to delete a rule, just pass the line number:

# selocal --delete 13

Having this tool around also makes it easier to test out changes suggested through bugreports. When I test such changes, I add in the bug report ID as the comment so I can track which settings are still local and which ones have been pushed to our policy repository. Underlyingly, selocal creates and maintains the necessary policy file in ~/.selocal and by default uses the selocal policy module name.

I hope this tool helps users with their quest on using SELinux. Feedback and comments are always appreciated! It is a small bash script and might still have a few bugs in it, but I have been using it for a few months so most quirks should be handled.

Posts for Saturday, April 20, 2013

avatar

VTemplate: a web project boilerplate which combines various industry standards

You’re about to start setting up the delivery mechanism for a web-based project. What do you do?

First, let’s fetch ourselves a framework. Not just any framework, but one which supports PSR-0 and encourages freedom in our domain code architecture. Kohana fits the bill nicely.

Let’s set up our infrastructure now: add Composer and Phing. After setting them up, let’s configure Composer to pull in PHPSpec2 and Behat along with Mink so we can do BDD. Oh yes, and Swiftmailer too, because what web-app nowadays doesn’t need a mailing library?

Still not yet done, let’s pull in Mustache so that we can do sane frontend development, and merge it in with KOstache. Now we can pull the latest HTML5BoilerPlate and shift its files to the appropriate template directories.

Finally, let’s set up some basic view auto loading and rendering for rapid frontend development convenience, and various drivers to hook up to our domain logic. As a finishing touch let’s convert those pesky CSS files into Stylus.

Phew! Wouldn’t it be great if all this was done already for us? Here’s where I introduce vtemplate – a web project boilerplate which combines various industry standards. You can check it out on GitHub.

It’s a little setup I use myself and is project agnostic enough that I can safely use it as a starting point for any of my current projects. Fully open-source, guaranteed by 100s of frontend designers, and by good PHP developers – so go ahead and check it out!

avatar

Transforming GuideXML to DocBook

I recently committed an XSL stylesheet that allows us to transform the GuideXML documents (both guides and handbooks) to DocBook. This isn’t part of a more elaborate move to try and push DocBook instead of GuideXML for the Gentoo Documentation though (I’d rather direct documentation development more to the Gentoo wiki instead once translations are allowed): instead, I use it to be able to generate our documentation in other formats (such as PDF but also ePub) when asked.

If you’re not experienced with XSL: XSL stands for Extensible Stylesheet Language and can be seen as a way of “programming” in XML. A stylesheet allows developers to transform one XML document towards another format (either another XML, or as text-like output like wiki) while manipulating its contents. In case of documentation, we try to keep as much structure in the document as possible, but other uses could be to transform a large XML with only a few interesting fields towards a very small XML (only containing those fields you need) for further processing.

For now (and probably for the foreseeable future), the stylesheet is to be used in an offline mode (we are not going to provide auto-generated PDFs of all documents) as the process to convert a document from GuideXML to DocBook to XML:FO to PDF is quite resource-intensive. But users that are interested can use the stylesheet as linked above to create their own PDFs of the documentation.

Assuming you have a checkout of the Gentoo documentation, this process can be done as follows (example for the AMD64 handbook):

$ xsltproc docbook.xsl /path/to/handbook-amd64.xml > /somewhere/handbook-amd64.docbook
$ cd /somewhere
$ xsltproc --output handbook-amd64.fo --stringparam paper.type A4 \
  /usr/share/sgml/docbook/xsl-stylesheets/fo/docbook.xsl handbook-amd64.docbook
$ fop handbook-amd64.fo handbook-amd64.pdf

The docbook stylesheets are offered by the app-text/docbook-xsl-stylesheets package whereas the fop command is provided by dev-java/fop.

I have an example output available (temporarily) at my dev space (amd64 handbook) but I’m not going to maintain this for long (so the link might not work in the near future).

Posts for Friday, April 19, 2013

avatar

Comparing performance with sysbench: performance analysis

So in the past few posts I discussed how sysbench can be used to simulate some workloads, specific to a particular set of tasks. I used the benchmark application to look at the differences between the guest and host on my main laptop, and saw a major performance regression with the memory workload test. Let’s view this again, using parameters more optimized to view the regressions:

$ sysbench --test=memory --memory-total-size=32M --memory-block-size=64 run
Host:
  Operations performed: 524288 (2988653.44 ops/sec)
  32.00 MB transferred (182.41 MB/sec)

Guest:
  Operations performed: 524288 (24920.74 ops/sec)
  32.00 MB transferred (1.52 MB/sec)

$ sysbench --test=memory --memory-total-size=32M --memory-block-size=32M run
Host:
  Operations performed: 1 (  116.36 ops/sec)
  32.00 MB transferred (3723.36 MB/sec)

Guest:
  Operations performed: 1 (   89.27 ops/sec)
  32.00 MB transferred (2856.77 MB/sec)

From looking at the code (gotta love Gentoo for making this obvious ;-) we know that the memory workload, with a single thread, does something like the following:

total_bytes = 0;
repeat until total_bytes >= memory-total-size:
  thread_mutex_lock()
  total_bytes += memory-block-size
  thread_mutex_unlock()
  
  (start event timer)
  pointer -> buffer;
  while pointer <-> end-of(buffer)
    write somevalue at pointer
    pointer++
  (stop event timer)

Given that the regression is most noticeable when the memory-block-size is very small, the part of the code whose execution count is much different between the two runs is the mutex locking, global memory increment and the start/stop of event timer.

In a second phase, we also saw that mutex locking itself is not impacted. In the above case, we have 524288 executions. However, if we run the mutex workload this number of times, we see that this hardly has any effect:

$ sysbench --test=mutex --mutex-num=1 --mutex-locks=524288 --mutex-loops=0 run
Host:      total time:        0.0275s
Guest:     total time:        0.0286s

The code for the mutex workload, knowing that we run with one thread, looks like this:

mutex_locks = 524288
(start event timer)
do
  lock = get_mutex()
  thread_mutex_lock()
  global_var++
  thread_mutex_unlock()
  mutex_locks--
until mutex_locks = 0;
(stop event timer)

To check if the timer might be the culprit, let’s look for a benchmark that mainly does timer checks. The cpu workload can be used, when we tell sysbench that the prime to check is 3 (as its internal loop runs from 3 till the given number, and when the given number is 3 it skips the loop completely) and we ask for 524288 executions.

$ sysbench --test=cpu --cpu-max-prime=3 --max-requests=524288 run
Host:  total time:  0.1640s
Guest: total time: 21.0306s

Gotcha! Now, the event timer (again from looking at the code) contains two parts: getting the current time (using clock_gettime()) and logging the start/stop (which is done in memory structures). Let’s make a small test application that gets the current time (using the real-time clock as the sysbench application does) and see if we get similar results:

$ cat test.c
#include <stdio.h>
#include <time.h>

int main(int argc, char **argv, char **arge) {
  struct timespec tps;
  long int i = 524288;
  while (i-- > 0)
    clock_gettime(CLOCK_REALTIME, &tps);
}

$ gcc -lrt -o test test.c
$ time ./test
Host:  0m0.019s
Guest: 0m5.030s

So given that the clock_gettime() is ran twice in the sysbench, we already have 10 seconds of overhead on the guest (and only 0,04s on the host). When such time-related functions are slow, it is wise to take a look at the clock source configured on the system. On Linux, you can check this by looking at /sys/devices/system/clocksource/*.

# cd /sys/devices/system/clocksource/clocksource0
# cat current_clocksource
kvm-clock
# cat available_clocksource
kvm-clock tsc hpet acpi_pm

Although kvm-clock is supposed to be the best clock source, let’s switch to the tsc clock:

# echo tsc > current_clocksource

If we rerun our test application, we get a much more appreciative result:

$ time ./test
Host:  0m0.019s
Guest: 0m0.024s

So, what does that mean for our previous benchmark results?

$ sysbench --test=cpu --cpu-max-prime=20000 run
Host:            35,3049 sec
Guest (before):  36,5582 sec
Guest (now):     35,6416 sec

$ sysbench --test=fileio --file-total-size=6G --file-test-mode=rndrw --max-time=300 --max-requests=0 --file-extra-flags=direct run
Host:            1,8424 MB/sec
Guest (before):  1,5591 MB/sec
Guest (now):     1,5912 MB/sec

$ sysbench --test=memory --memory-block-size=1M --memory-total-size=10G run
Host:            3959,78 MB/sec
Guest (before)   3079,29 MB/sec
Guest (now):     3821,89 MB/sec

$ sysbench --test=threads --num-threads=128 --max-time=10s run
Host:            9765 executions
Guest (before):   512 executions
Guest (now):      529 executions

So we notice that this small change has nice effects on some of the tests. The CPU benchmark improves from 3,55% overhead to 0,95%; fileio is the same (from 15,38% to 13,63%), memory improves from 22,24% overhead to 3,48% and threads remains about status quo (from 94,76% slower to 94,58%).

That doesn’t mean that the VM is now suddenly faster or better than before – what we changed was how fast a certain time measurement takes, which the benchmark software itself uses rigorously. This goes to show how important it is to

  1. understand fully how the benchmark software works and measures
  2. realize the importance of access to source code is not to be misunderstood
  3. know that performance benchmarks give figures, but do not tell you how your users will experience the system

That’s it for the sysbench benchmark for now (the MySQL part will need to wait until a later stage).

avatar

Comparing performance with sysbench: memory, threads and mutexes

In the previous post, I gave some feedback on the cpu and fileio workload tests that sysbench can handle. Next on the agenda are the memory, threads and mutex workloads.

When using the memory workload, sysbench will allocate a buffer (provided through the –memory-block-size parameter, defaults to 1kbyte) and each execution will read or write to this memory (–memory-oper, defaults to write) in a random or sequential manner (–memory-access-mode, defaults to sequential).

$ sysbench --test=memory --memory-block-size=1M --memory-total-size=10G run
Host throughput, 1M:  3959,78 MB/sec
Guest throughput, 1M: 3079,29 MB/sec

The guest has a lower throughput (about 77% of the host), which is lower than what most online posts provide on KVM performance. We’ll get back to that later. Let’s look at the default block size of 1k (meaning that the benchmark will do a lot more executions before it reaches the total memory (in load):

$ sysbench --test=memory --memory-total-size=1G run
Host throughput, 1k:  1702,59 MB/sec
Guest throughput, 1k:   23,67 MB/sec

This is a lot worse: the guest’ throughput is only 1,4% of the host throughput! The qemu-kvm process on the host is also taking up a lot of CPU.

Now let’s take a look at the other workload, threads. In this particular workload, you identify the number of threads (–num-threads), the number of locks (–thread-locks) and the number of times a thread should run its ‘lock-yield..unlock’ workload (–thread-yields). The more locks you identify, the less number of threads will have the same lock (each thread is allocated a single lock during an execution, but every new execution will give it a new lock so the threads do not always take the same lock).

Note that parts of this is also handled by the other tests: mutex’es are used when a new operation (execution) for the thread is prepared. In case of the memory-related workload above, the smaller the buffer size, the more frequent thread operations are needed. In the last run we did (with the bad performance), millions of operations were executed (although no yields were performed). Something similar can be simulated using a single lock, single thread and a very high number of operations and no yields:

$ sysbench --test=threads --num-threads=1 --thread-yields=0 --max-requests=1000000 --thread-locks=1 run
Host runtime:    0,3267 s  (event:    0,2278)
Guest runtime:  40,7672 s  (event:   30,6084)

This means that the guest “throughput” problems from the memory identified above seem to be related to this rather than memory-specific regressions. To verify if the scheduler itself also shows regressions, we can run more threads concurrently. For instance, running 128 threads simultaneously, using the otherwise default settings, during 10 seconds:

$ sysbench --test=threads --num-threads=128 --max-time=10s run
Host:   9765 executions (events)
Guest:   512 executions (events)

Here we get only 5% throughput.

Let’s focus on the mutex again, as sysbench has an additional mutex workload test. The workload has each thread running a local fast loop (simple increments, –mutex-loops) after which it takes a random mutex (one of –mutex-num), locks it, increments a global variable and then releases the mutex again. This is repeated for the number of locks identified (–mutex-locks). If mutex operations would be the cause of the performance issues above, then we would notice that the mutex operations are a major performance regression on my system.

Let’s run that workload with a single thread (default), no loops and a single mutex.

$ sysbench --test=mutex --mutex-num=1 --mutex-locks=50000000 --mutex-loops=1 run
Host (duration):   2600,57 ms
Guest (duration):  2571,44 ms

In this example, we see that the mutex operations are almost at the same speed (99%) of the host, so pure mutex operations are not likely to be the cause of the performance regressions earlier on. So what does give the performance problems? Well, that investigation will be for the third and last post in this series ;-)

Posts for Thursday, April 18, 2013

avatar

Another Gentoo Hardened month has passed

Another month has passed, so time to mention again what we have all been doing lately ;-)

Toolchain

Version 4.8 of GCC is available in the tree, but currently masked. The package contains a fix needed to build hardened-sources, and a fix for the asan (address sanitizer). asan support in GCC 4.8 might be seen as an improvement security-wise, but it is yet unclear if it is an integral part of GCC or could be disabled with a configure flag. Apparently, asan “makes building gcc 4.8 crazy”. Seeing that it comes from Google, and building Google Chromium is also crazy, I start seeing a pattern here.

Anyway, it turns out that PaX/grSec and asan do not get along yet (ASAN assumes/uses hardcoded userland address space size values, which breaks when UDEREF is set as it pitches a bit from the size):

ERROR: AddressSanitizer failed to allocate 0x20000001000 (2199023259648) bytes at address 0x0ffffffff000

Given that this is hardcoded in the resulting binaries, it isn’t sufficient to change the size value from 47 bits to 46 bits as hardened systems can very well boot a kernel with and another kernel without UDEREF, causing the binaries to fail on the other kernel. Instead, a proper method would be to dynamically check the size of a userland address.

However, GCC 4.8 also brings along some nice enhancements and features. uclibc profiles work just fine with GCC 4.8, including armv7a and mips/mipsel. The latter is especially nice to hear, since mips used to require significant effort with previous GCCs.

Kernel and grSecurity/PaX

More recent kernels have now been stabilized to stay close to the grSecurity/PaX upstream developments. The most recent stable kernel now is hardened-sources-3.8.3. Others still available are hardened-sources versions 3.2.40-r1 and 2.6.32-r156.

The support for XATTR_PAX is still progressing, but a few issues have come up. One is that non-hardened systems are seeing warnings about pax-mark not being able to set the XATTR_PAX on tmpfs since vanilla kernels do not have the patch to support user.* extended attribute namespaces for tmpfs. A second issue is that the install application, as provided by coreutils, does not copy extended attributes. This has impact on ebuilds where pax markings are done before the install phase of a package. But only doing pax markings after the install phase isn’t sufficient either, since sometimes we need the binaries to be marked already for test phases or even in the compile phase. So this is still something on the near horizon.

Most likely the necessary tools will be patched to include extended attributes on copy operations. However, we need to take care only to copy over those attributes that make sense: user.pax does, but security ones like security.evm and security.selinux shouldn’t as those are either recomputed when needed, or governed through policy. The idea is that USE=”pax_kernel” will enable the above on coreutils.

SELinux

The SELinux support in Gentoo has seen a fair share of updates on the userland utilities (like policycoreutils, setools, libselinux and such). Most of these have already made the stable tree or are close to be bumped to stable. The SELinux policy also has been updated a lot: most changes can be tracked through bugzilla, looking for the sec-policy r13 whiteboard. The changes can be applied to the system immediately if you use the live ebuilds (like selinux-base-9999), but I’m planning on releasing revision 13 of our policy set soon.

System Integrity

Some of the “early adopter” problems we’ve noticed on Gentoo Hardened have been integrated in the repositories upstream and are slowly progressing towards the main Linux kernel tree.

Profiles

All hardened profiles have been moved to the 13.0 base. Some people frowned when they noticed that the uclibc profiles do not inherit from any architecture-related profile. This is however with reason: the architecture profiles are (amongst other reasons) focusing on the glibc specifics of the architecture. Since the profile intended here is for uclibc, those changes are not needed (nor wanted). Hence, these are collapsed in a single profile.

Documentation

For SELinux, the SELinux handbook now includes information about USE=”unconfined” as well as the selinux_gentoo init script as provided by policycoreutils. Users who are already running with SELinux enabled can just look at the Change History to see which changes affect them.

A set of tutorials (which I’ve blogged about earlier as well) have been put online at the Gentoo Wiki. Next to the SELinux tutorials, an article pertaining to AIDE has been added as well as it fits nicely within the principles/concepts of the System Integrity subproject.

Media

If you don’t do it already, start following @GentooHardened ;-)

avatar

Comparing performance with sysbench: cpu and fileio

Being busy with virtualization and additional security measures, I frequently come in contact with people asking me what the performance impact is. Now, you won’t find the performance impact of SELinux here as I have no guests nor hosts that run without SELinux. But I did want to find out what one can do to compare system (and later application) performance, so I decided to take a look at the various benchmark utilities available. In this first post, I’ll take a look at sysbench (using 0.4.12, released on March 2009 – unlike what you would think from the looks of the site alone) to compare the performance of my KVM guest versus host.

The obligatory system information: the host is a HP Pavilion dv7 3160eb with an Intel Core i5-430M processor (dual-core with 2 threads per core). Frequency scaling is disabled – the CPU is fixed at 2.13 Ghz. The system has 4Gb of memory (DDR3), the internal hard disks are configured as a software RAID1 and with LVM on top (except for the file system that hosts the virtual guest images, which is a plain software RAID1). The guests run with the boot options given below, meaning 1.5Gb of memory, 2 virtual CPUs of the KVM64 type. The CFLAGS for both are given below as well, together with the expanded set given by gcc ${CFLAGS} -E -v – </dev>&1 | grep cc1.

/usr/bin/qemu-kvm -monitor stdio -nographic -gdb tcp::1301 \
  -vnc 127.0.0.1:14 \
  -net nic,model=virtio,macaddr=00:11:22:33:44:b3,vlan=0 \
  -net vde,vlan=0 \
  -drive file=/srv/virt/gentoo/test/pg1.img,if=virtio,cache=none \
  -k nl-be -m 1536 -cpu kvm64 -smp 2

# For host
CFLAGS="-march=core2 -O2 -pipe"
#CFLAGS="-D_FORTIFY_SOURCE=2 -fno-strict-overflow -march=core2 \
         -fPIE -O2 -fstack-protector-all"
# For guest
CFLAGS="-march=x86-64 -O2 -pipe"
#CFLAGS="-fno-strict-overflow -march=x86-64 -fPIE -O2 \
         -fstack-protector-all"

I am aware that the CFLAGS between the two are not the same (duh), and I know as well that the expansion given above isn’t entirely accurate. But still, it gives some idea on the differences.

Now before I go on to the results, please keep in mind that I am not a performance expert, not even a performance experienced or even performance wanna-be experienced person: the more I learn about the inner workings of an operating system such as Linux, the more complex it becomes. And when you throw in additional layers such as virtualization, I’m almost completely lost. In my day-job, some people think they can “prove” the inefficiency of a hypervisor by counting from 1 to 100’000 and adding the numbers, and then take a look at how long this takes. I think this is short-sighted, as this puts load on a system that does not simulate reality. If you really want to do performance measures for particular workloads, you need to run those workloads and not some small script you hacked up. That is why I tend to focus on applications that use workload simulations for infrastructural performance measurements (like HammerDB for performance testing databases). But for this blog post series, I’m first going to start with basic operations and later posts will go into more detail for particular workloads (such as database performance measurements).

Oh, and BTW, when I display figures with a comma (“,”), that comma means decimal (so “1,00″ = “1″).

The figures below are numbers that can be interpreted in many ways, and can prove everything. I’ll sometimes give my interpretation to it, but don’t expect to learn much from it – there are probably much better guides out there for this. The posts are more of a way to describe how sysbench works and what you should take into account when doing performance benchmarks.

So the testing is done using sysbench, which is capable of running CPU, I/O, memory, threading, mutex and MySQL tests. The first run of it that I did was a single-thread run for CPU performance testing.

$ sysbench --test=cpu --cpu-max-prime=20000 run

This test verifies prime numbers by dividing the number with sequentially increasing numbers and verifying that the remainder (modulo calculation) is zero. If it is, then the number is not prime and the calculation goes on to the next number; otherwise, if none have a remainder of 0, then the number is prime. The maximum number that it divides by is calculated by taking the integer part of the square root of the number (so for 17, this is 4). This algorithm is very simple, so you should also take into account that during the compilation of the benchmark, the compiler might already have optimized some of it.

Let’s look at the numbers.

Run     Stat     Host      Guest
1.1    total   35,4331   37,0528
     e.total   35,4312   36,8917
1.2    total   35,1482   36,1951
     e.total   35,1462   36,0405
1.3    total   35,3334   36,4266
     e.total   35,3314   36,2640
================================
avg    total   35,3049   36,5582
     e.total   35,3029   36,3987
med    total   35,3334   36,4266
     e.total   35,3314   36,2640

On average (I did three runs on each system), the guest took 3,55% more time to finish the test than the host (total). If we look at the pure calculation (so not the remaining overhead of the inner workings – e.total) then the guest took 3,10% more time. The median however (the run that wasn’t the fastest nor the slowest of the three) has the guest taking 3,09% more time (total) and 2,64% more time (e.total).

Let’s look at the two-thread results.

Run     Stat     Host      Guest
1.1    total   17,5185   18,0905
     e.total   35,0296   36,0217
1.2    total   17,8084   18,1070
     e.total   35,6131   36,0518
1.3    total   18,0683   18,0921
     e.total   36,1322   36,0194
================================
avg    total   17,5185   18,0965
     e.total   35,0296   36,0310
med    total   17,8084   18,0921
     e.total   35,6131   36,0194

With these figures, we notice that the guest average total run time takes 1,67% more time to complete, and the event time only 1,23%. I was personally expecting that the guest would have a higher percentage than previously (gut feeling – never trust it when dealing with complex matter) but was happy to see that the difference wasn’t higher. I’m not going to start analyze this in more detail and just go to the next test: fileio.

In case of fileio testing, I assume that the hypervisor will take up more overhead, but keep in mind that you also need to consider the environmental factors: LVM or not, RAID1 or not, mount options, etc. Since I am comparing guests versus hosts here, I should look for a somewhat comparable setup. Hence, I will look for the performance of the host (software raid, LVM, ext4 file system with data=ordered) and the guest (images on software raid, ext4 file system with data=ordered and barrier=0, and LVM in guest).

Furthermore, running a sysbench test suggests a file that is much larger than the available RAM. I’m going to run the tests on a 6Gb file size, but enable O_DIRECT for writes so that some caches (page cache) are not used. This can be done using –file-extra-flags=direct.

As with all I/O-related benchmarks, you need to define which kind of load you want to test with. Are the I/Os sequential (like reading or writing a large file completely) or random? For databases, you are most likely interested in random reads (data, for selects) and sequential writes (into transaction logs). A file server usually has random read/write. In the below test, I’ll use a combined random read/write.

$ sysbench --test=fileio --file-total-size=6G prepare
$ sysbench --test=fileio --file-total-size=6G --file-test-mode=rndrw --max-time=300 --max-requests=0 --file-extra-flags=direct run
$ sysbench --test=fileio --file-total-size=6G cleanup

In the output, the throughput seems to be most important:

Operations performed:  4348 Read, 2898 Write, 9216 Other = 16462 Total
Read 67.938Mb  Written 45.281Mb  Total transferred 113.22Mb  (1.8869Mb/sec)

In the above case, the throughput is 1,8869 Mbps. So let’s look at the (averaged) results:

Host:  1,8424 Mbps
Guest: 1,5591 Mbps

The above figures (which are an average of 3 runs) tell us that the guest has a throughput of about 84,75% (so we take about 15% performance hit on random read/write I/O). Now I used sysbench here for some I/O validation of guest between host, but other usages apply as well. For instance, let’s look at the impact of data=ordered versus data=journal (taken on the host):

6G, data=ordered, barrier=1: 1,8435 Mbps
6G, data=ordered, barrier=0: 2,1328 Mbps
6G, data=journal, barrier=1: 599,85 Kbps
6G, data=journal, barrier=0: 767,93 Kbps

From the figures, we can see that the data=journal option slows down the throughput to a final figure about 30% of the original throughput (70% decrease!). Also, disabling barriers has a positive impact on performance, giving about 15% throughput gain. This is also why some people report performance improvements when switching to LVM, as – as far as I can tell (but finding a good source on this is difficult) – LVM by default disables barriers (but does honor the barrier=1 mount option if you provide it).

That’s about it for now – the next post will be about the memory and threads tests within sysbench.

Posts for Wednesday, April 17, 2013

avatar

Simple drawing for I/O positioning

Instead of repeatedly trying to create an overview of the various layers involved with I/O operations within Linux on whatever white-board is in the vicinity, I decided to draw one up in Draw.io that I can then update as I learn more from this fascinating world. The drawing’s smaller blocks within the layers are meant to give some guidance to what is handled where, so they are definitely not complete.

So for those interested (or those that know more of it than I ever will and prepared to help me out):

io-layers

I hope it isn’t too far from the truth.

Posts for Tuesday, April 16, 2013

avatar

What could SELinux have done to mitigate the postgresql vulnerability?

Gentoo is one of the various distributions which supports SELinux as a Mandatory Access Control system to, amongst other things, mitigate the results of a succesfull exploit against software. So what about the recent PostgreSQL vulnerability?

When correctly configured, the PostgreSQL daemon will run in the postgresql_t domain. In SELinux-speak, a domain can be seen as a name granted to a set of permissions (what is allowed) and assigned to one or more processes. A process that “runs in domain postgresql_t” will be governed by the policy rules (what is and isn’t allowed) for that domain.

The vulnerability we speak of is about creating new files or overwriting existing files, potentially corrupting the database itself (when the database files are overwritten). Creating new files is handled through the create privilege on files (and add_name on directories), writing into files is handled through the write privilege. Given certain circumstances, one could even write commands inside files that are executed by particular users on the system (btw, the link gives a great explanation on the vulnerability).

So let’s look at what SELinux does and could have done.

In the current situation, as we explained, postgresql_t is the only domain we need to take into account (the PostgreSQL policy does not use separate domains for the runtime processes). Let’s look at what directory labels it is allowed to write into:

$ sesearch -s postgresql_t -c dir -p add_name -SCATd
Found 11 semantic av rules:
   allow postgresql_t postgresql_log_t : dir { add_name } ; 
   allow postgresql_t var_log_t : dir { add_name } ; 
   allow postgresql_t var_lock_t : dir { add_name } ; 
   allow postgresql_t tmp_t : dir { add_name } ; 
   allow postgresql_t postgresql_tmp_t : dir { add_name } ; 
   allow postgresql_t postgresql_var_run_t : dir { add_name } ; 
   allow postgresql_t postgresql_db_t : dir { add_name } ; 
   allow postgresql_t etc_t : dir { add_name } ; 
   allow postgresql_t tmpfs_t : dir { add_name } ; 
   allow postgresql_t var_lib_t : dir { add_name } ; 
   allow postgresql_t var_run_t : dir { add_name } ; 

So the PostgreSQL service is allowed to create files inside directories labeled with one of the following labels:

  • postgresql_log_t, used for PostgreSQL log files (/var/log/postgresql)
  • var_log_t, used for the generic log files (/var/log)
  • var_lock_t, used for lock files (/run/lock or /var/lock)
  • tmp_t, used for the temporary file directory (/tmp or /var/tmp)
  • postgresql_tmp_t, used for the PostgreSQL temporary files/directories
  • postgresql_var_run_t, used for the runtime information (like PID files) of PostgreSQL (/var/run/postgresql)
  • postgresql_db_t, used for the PostgreSQL database files (/var/lib/postgresql)
  • etc_t, used for the generic system configuration files (/etc/)
  • var_lib_t, used for the /var/lib data
  • var_run_t, used for the /var/run or /run data

Next to this, depending on the label of the directory, the PostgreSQL service is allowed to write into files with the following label assigned (of importance to both creating new files as well as overwriting existing ones):

$ sesearch -s postgresql_t -c file -p write -SCATd
Found 11 semantic av rules:
   allow postgresql_t postgresql_log_t : file { write } ; 
   allow postgresql_t postgresql_lock_t : file { write } ; 
   allow postgresql_t faillog_t : file { write } ; 
   allow postgresql_t lastlog_t : file { write } ; 
   allow postgresql_t postgresql_tmp_t : file { write } ; 
   allow postgresql_t hugetlbfs_t : file { write } ; 
   allow postgresql_t postgresql_var_run_t : file { write } ; 
   allow postgresql_t postgresql_db_t : file { write } ; 
   allow postgresql_t postgresql_t : file { write } ; 
   allow postgresql_t security_t : file { write } ; 
   allow postgresql_t etc_t : file { write } ;

Found 6 semantic te rules:
   type_transition postgresql_t var_log_t : file postgresql_log_t; 
   type_transition postgresql_t var_lock_t : file postgresql_lock_t; 
   type_transition postgresql_t tmp_t : file postgresql_tmp_t; 
   type_transition postgresql_t tmpfs_t : file postgresql_tmp_t; 
   type_transition postgresql_t var_lib_t : file postgresql_db_t; 
   type_transition postgresql_t var_run_t : file postgresql_var_run_t; 

If an exploit creates a new file, the add_name permission on the directory is needed. If otoh the exploit is overwriting existing files, I think the only permission needed here is the write on the files (also open but all the writes have open as well in the above case).

Now accessing and being able to write files into the database file directory is expected – it is the functionality of the server, so unless we could separate domains more, this is a “hit” we need to take. Sadly though, this is also the label used for the PostgreSQL service account home directory here (not sure if this is for all distributions), making it more realistic that an attacker writes something in the home directory .profile file and hopes for the administrator to do something like su postgres -.

Next, the etc_t write privileges also worry me, not mainly because it can write there, but also because I can hardly understand why – PostgreSQL is supposed to run under its own, non-root user (luckily) so unless there are etc_t labeled directories owned by the PostgreSQL service account (or world writeable – please no, kthx). And this isn’t an “inherited” permission from something – the policy currently has files_manage_etc_files(postgresql_t) set, and has been since 2005 or earlier. I’m really wondering if this is still needed.

But I digress. Given that there are no PostgreSQL-owned directories nor world-writeable ones in /etc, let’s look at a few other ones.

  • security_t is used for the SELinux pseudo file system, and is used for the SEPostgreSQL support. From the looks of it, only the root Linux user has the rights to do really harmful things on this file system (and only if he too has write permissions on security_t), non-root should be limited to verifying if contexts exist or have particular rights. Still, I might investigate this further as I’m intrigued about many of the pseudo files in /sys/fs/selinux that I’m not fully sure yet what they deal with.
  • tmp_t should not be a major concern. Most (if not all) daemons and services that use temporary files have file transitions to their own type so that access to these files, even if it would be allowed by regular permissions, is still prohibited by SELinux
  • lastlog_t is also a weird one, again because it shouldn’t be writeable for anyone else but root accounts; if succesfull, an attacker can overwrite the lastlog information which might be used by some as a means for debugging who was logged on when (part of forensics).

Given the information above, it is a bit sad to see that SELinux can’t protect PostgreSQL users from this particular vulnerability – most of the “mitigation” (if any) is because the process runs as non-root to begin with (which is another hint at users not to think SELinux is sufficient to restrict the permissions of processes). But could it have been different?

In my opinion, yes, and I’ll see if we can learn from it for the future.

First of all, we should do more policy code auditing. It might not be easy to remove policy rules generally, but we should at least try. I use a small script that enables auditing (SELinux auditing, so auditallow statements) for the entire domain, and then selectively disables auditing until I get no hits anymore. The remainder of auditallow statements warrant a closer look to see if they are still needed or not. I’ll get onto that in the next few days.

Second, we might want to have service accounts use a different home directory, where they do have the necessary search privileges for, but no write privileges. Exploits that write stuff into a home directory (hoping for a su postgresql -) are then mitigated a bit.

Third, we might want to look into separating the domains according to the architecture of the service. This requires intimate knowledge of the ins and outs of PostgreSQL and might even require PostgreSQL patching, so is not something light. But if no patching is needed (such as when all process launches are done using known file executions) we could have a separate domain for the master process, server processes and perhaps even the various subfunction processes (like the WAL writer, BG writer, etc.). The Postfix service has such a more diverse (but also complex) policy. Such a subdomain structure in the policy might reduce the risk if the vulnerable process (I think this is the master process) does not need to write to database files (as this is handled by other processes), so no postgresql_db_t write privileges.

If others have ideas on how we can improve service security (for instance through SELinux policy development) or knows of other exploits related to this vulnerability that I didn’t come across yet, please give a comment on it below.

Posts for Sunday, April 14, 2013

python timings

April 14th, 2013

On the one hand, we measure database query latency in milliseconds. On the other hand, a read from L1 cache costs less than a nanosecond. That got me thinking that there is a pretty big spectrum in between the two. I wonder how much time typical language constructs cost. Just as a reminder, here is the typical list of important timings:

            0.5 ns        read from L1 cache
            1   ns        execute cpu instruction
            7   ns        read from L2 cache
          100   ns        read from memory
       20,000   ns        transmit over local network
    8,000,000   ns        read from disk
  150,000,000   ns        transmit over the internet Europe -> US
1,000,000,000   ns        one second

There happens to be a really easy way to do a quick and dirty measurement using ipython, with its built-in timing feature. It takes an expression that it will execute a number of times, depending on how long it takes, with an upper bound in seconds. So for really trivial expressions you get a large number of repetitions:

In [66]: %timeit 1+2
10000000 loops, best of 3: 20.7 ns per loop

The catch is that timeit expects an expression, so the simplest way to get around that is to make every test a function call, and in there we can run arbitrary expressions and statements alike. The baseline will then be a function with an empty body.

Here are the results from my cpython 2.7.3:

            5 ns        assignment
            4 ns        integer_addition
           10 ns        string_concat
            5 ns        string_interpolate
           35 ns        dict_lookup
           77 ns        list_comprehension

           22 ns        branch
        1,095 ns        try_catch

       86,895 ns        create_class        
           97 ns        instantiate_class
          135 ns        call_method
          105 ns        call_function

          217 ns        get_current_time
        1,745 ns        get_current_date

Clearly this leaves a lot to be desired from a methodological standpoint. The reference list of latencies is not scaled to my laptop in particular, plus we are adding the overhead of a function call to every measurement (and then trying to subtract it out), but at least it's constant across all measurements. At best these numbers are a rough indication of how much things cost, but that's good enough for our purpose.

Finally, here is the code:

def call_function():  # 105ns
    pass

def create_class():  # 87us
    class C(object):
        pass

class D(object):
    def meth(self):
        pass

def instantiate_class():  # 202ns
    D()

d = D()
def call_method():  # 240ns
    d.meth()

def assignment():  # 110ns
    a = 1

def branch():  # 127ns
    if True:
        pass

def try_catch():  # 1.2us
    try:
        raise Exception
    except:
        pass


def integer_addition():  # 109ns
    1 + 2

def string_concat():  # 115ns
    "a" + "b"

def string_interpolate():  # 110ns
    "a%s" % "b"

d = {'a': 1}
def dict_lookup():  # 140ns
    d['a']

l = []
def list_comprehension():  # 182ns
    [x for x in l]

import time
def get_current_time():  # 322ns
    time.time()

from datetime import datetime
def get_current_date():  # 1.85us
    datetime.now()

Posts for Friday, April 12, 2013

avatar

Missing HP 3515 Network Driver

So this week I hit a fun problem when building a new PC for a HP 3515 tower, this is the first one of these that I have built and I had thought I had loaded all the drivers from the HP website into our deployment tool (MDT 2012), however it turns out that the NIC driver was missing and wasn’t [...]

Posts for Thursday, April 11, 2013

avatar

Integrity checking with AIDE

As to at least do some progress in the integrity part of Gentoo Hardened (a subproject I’d like to extend towards greater heights), I dediced to write up a small guide on how to work with AIDE. The tool is simple enough (and it allowed me to test its SELinux policy module a bit) so you’ll get by fairly quickly.

However, what I’d like to know a bit more about is on how to use AIDE on a hypervisor level, scanning through the file systems of the guests, without needing in-guest daemons. I wrote a small part in the guide, but I need to test it more thoroughly. In the end, I’d like to have a configuration that AIDE is running on the host, mounting the guest file systems, scanning the necessary files and sending out reports, all one at a time (snapshot, mount, scan+report, unmount, destroy snapshot, next).

If anyone has pointers towards such a setup, it’d be greatly appreciated. It provides, in my opinion, a secure way of scanning systems even if they are completely compromised (in other words you couldn’t trust anything running inside the guest or running with the libraries or software within the guest).

Posts for Wednesday, April 10, 2013

avatar

Architecture’s existential crisis

Four posts ago, I took a break from the usual technical and on-going project posts, and instead went on a four part spree talking about Architecture. In particular, I tackled the question of Architecture’s existential crisis. It talks about issues about discipline and professionalism (actually inspired by Bob Martin’s similar talks in the software industry), the philosophies that architecture idolises, and overarching goals of the profession and the world in general.

The reason I spent so much time on this is because I believe that it is wrong to treat architecture as superficially as an art form. It is not a commodified object of entertainment like a book or movie. It isn’t something where people are given the choice to consume it. Instead, it is inherently part of our day to day lives and affects everyone. This means architects have a responsibility to others.

I’ve converted the rather long post into a LaTeX-compiled PDF, so those who haven’t read it due to the sheer size can enjoy it. Download it here.

Will resume to the usual topics after this.

Posts for Tuesday, April 9, 2013

avatar

Not needing run_init for password-less service management

One of the things that has been bugging me was why, even with having pam_rootok.so set in /etc/pam.d/run_init, I cannot enjoy passwordless service management without using run_init directly:

# rc-service postgresql-9.2 status
Authenticating root.
Password: 

# run_init rc-service postgresql-9.2 status
Authenticating root.
 * status: started

So I decided to strace the two commands and look for the differences. I found out that there is even a SELinux permission for being able to use the rootok setting for passwords! Apparently, pam_rootok.so is SELinux-aware and does some additional checks.

Although I don’t know the exact details of it, it looks for the context before the call (exec) of run_init occurred. Then it checks if this domain has the right for passwd { rootok } (unless SELinux is in permissive, in which case it just continues) and only then it allows the “rootok” to succeed.

Now why doesn’t this work without using run_init? I think it has to do with how we integrate run_init in the scripts, because out of the trace I found that the previous context was also run_init_t (instead of sysadm_t):

20451 open("/proc/self/task/20451/attr/current", O_RDONLY) = 3
20451 read(3, "root:sysadm_r:run_init_t\0", 4095) = 25
20451 close(3)                          = 0
20451 gettid()                          = 20451
20451 open("/proc/self/task/20451/attr/prev", O_RDONLY) = 3
20451 read(3, "root:sysadm_r:run_init_t\0", 4095) = 25
20451 close(3) 

Because there already is a transition to run_init_t upon calling the scripts, the underlying call to runscripts causes the “previous” attribute to be set to run_init_t as well, and only then is run_init called (which then causes the PAM functions to be called). But by prepending the commands with run_init (which quickly causes the PAM functions to be called) the previous context is sysadm_t.

I tested on a system with the following policy update, and this succeeds nicely.

policy_module(localruninit, 1.0)

gen_require(`
  class passwd { passwd chfn chsh rootok };
  type run_init_t;
')

allow run_init_t self:passwd rootok;

I’ll probably add this in Gentoo’s policy.

avatar

How far reaching vulnerabilities can go

If you follow the news a bit, you know that PostgreSQL has had a significant security vulnerability. The PostgreSQL team announced it up front and communicated how they would deal with the vulnerability (which basically comes down to saying that it is severe, that the public repositories will be temporarily frozen as developers add in the necessary fixes and start building the necessary software for a new release, and at the release moment give more details about the vulnerability.

The exploitability of the vulnerability was quickly identified, and we know that compromises wouldn’t take long. A blog post from the schemaverse tells us that exploits won’t take long (less than 24 hours) and due to the significance of the vulnerability, it cannot be stressed enough that patching should really be part of the minimal security requirements of any security-conscious organization. But patching alone isn’t the only thing to consider.

The notice that PostgreSQL mentions also that restricting access to the database through pg_hba.conf isn’t sufficient, as the vulnerable code is executed before the pg_hba.conf file is read. So one of the mitigations for the vulnerability would be a firewall (hostbased or network) that restricts access to the database so only trusted addresses are allowed. I’m personally an advocate in favor of hostbased firewalls.

But the thing that hits me the most, is the amount of applications that use “embedded” postgresql database services in their product. If you take part of a larger organization with a large portfolio of software titles running in the data center, you’ll undoubtedly have seen lists (through network scans or otherwise) of systems that are running PostgreSQL as part of the product installation (and not as a “managed” database service). The HP GUIDManager or the NNMI components or the Systems Insight Manager use embedded PostgreSQL services. The cloudera manager can be easily set up with an “embedded” PostgreSQL (which doesn’t mean it isn’t a full-fledged PostgreSQL, but rather that the setup and management of the service is handled by the product instead of by “your own” DBA team). Same with Servoy.

I don’t disagree with all products providing embedded database platforms, and especially not with choosing for PostgreSQL which I consider a very mature, stable and feature-rich (and not to be forgotten, very active community) database platform. But I do hope that these products take up their responsibility and release updated versions or patches for their installations to their customers very soon.

Perhaps I should ask our security operational team to take a scan to actively follow-up on these…

The Margaret Thatcher hate

Margaret Thatcher cropped2 The Margaret Thatcher hate

Margaret Thatcher (provided by Chris Collins of the Margaret Thatcher Foundation CC-BY-SA)

Yesterday Margaret Thatcher died. While I don’t live in the UK the news still impacted all of Europe; not just because it’s relatively small but also because the political agenda she kind of defined (deregulation, privatization) has been highly influential and devastating in many parts of Europe causing a lot of suffering and poverty. I agree with hardly anything she did (apart maybe from her work on ice cream) and consider myself an opponent of everything she stood for politically.

And even from that perspective I find it weird to see her being called a witch (as in “The witch is dead”), being slandered in basically every possible way.

She has not been in politics since 1992 (that was more than 20 years ago) and still people talk about her like she was still running the country making all the wrong decisions, oppressing people for the hell of it and basically being like a Bond villain just worse.

It is important not to forget the bad she did in the eulogies, important not to let the historical perspective be painted in a good light just because everybody just wants to make sure not to speak badly about the recently deceased.

But celebrating her death as if she ate a baby every morning? Talking without respect for her as a human being and without any sort of empathy for the people around her just because some believe that she was cold and unempathetic? No way.

I’ve never believed in the whole “an eye for an eye” dogma: She started a really bad political program but that does not mean that she loses the basic respect that every human being deserves. That just doesn’t go away for anyone: And if you believe that you can just decide who “deserves” it, I think your opinion is as dangerous as Mrs. Thatcher’s was.

Human rights are not distributed arbitrarily based on who you believe deserves them. It’s really that simple.

P.S.: I wonder if the tone of the articles would have been different if it wasn’t a woman that many can paint to be the villain destroying Europe if not the world.

The post The Margaret Thatcher hate appeared first on tante.blog.

flattr this!

Posts for Sunday, April 7, 2013

avatar

Separate puppet provider for Gentoo/SELinux?

While slowly transitioning my playground infrastructure towards Puppet, I already am in process of creating a custom provider for things such as services. Puppet uses providers as “implementations” for the functions Puppet needs. For instance, for the service type (which handles init script services), there are providers for RedHat, Debian, FreeBSD, … and it also has providers called gentoo and openrc. The openrc one uses the service scripts that Gentoo’s OpenRC provides, such as rc-service and rc-status.

On a SELinux-enabled system, and especially when using a decentralized Puppet environment (I dropped the puppet master set in favor of a decentralized usage of Puppet), if you call rc-service to, say, start a service, it will ask for the users’ password. Of course, Puppet doesn’t want this, so I have to prefix the commands with run_init and have a pam_rootok.so rule in run_init’s PAM definition.

So far that’s a simple change – I just patched the openrc.rb file to do so. But then the second problem I’m facing is that Puppet wants to use return code based commands for checking the run-time state of services. Even though some of my services weren’t running, Puppet either thought they were or called the start routine and consider the service started. Sadly that wasn’t the case, as the rc-* scripts always return 0 (you’ll need to parse the output).

So what I did now is to create a simple script called runstatus which returns the state of services. It’s crude, but seems to work:

#!/bin/sh

SERVICENAME=$1;

# We need to exit:
# 0 - if running
# 1 - if dead but PID exists
# 2 - if dead but lock file exists
# 3 - if not running
# 4 - if unknown

rc-status -a -C | grep ${SERVICENAME} | grep -q started && exit 0;
rc-status -a -C | grep ${SERVICENAME} | grep -q stopped && exit 3;
exit 4;

I then have the service provider (I now provide my own instead of patching the openrc one) call runstatus to get the state of a service, as well as call it after trying to start a service. But as this is quite basic functioning, I’m wondering if I’m doing things the right way or not. Who else has experience with Puppet and Gentoo, and did you have to tweak things to get services and such working?

Posts for Friday, April 5, 2013

NAS at Home version 2

Today I was hanging out on G+ and I noticed that Linus Torvalds is copying me and also wants a home file server. I noticed he wants to go the other route though and wants to just buy a pre-built appliance instead of building his own. Takes a little fun out of it but I most certainly understand his desire to go that route.

After looking at what people recommended I came across the Qnap TS-469-PRO app from Qnap. It looks really nice and even runs embedded Linux. Seeing as how I also run Linux and I’ve finally convinced my wife to run OS X I really like the sound of it so far.

Does anyone out there have any experience with these? I know there are other companies out there making competing appliances. Is there any advantage to this route over the homemade route besides it’s plug n’ play?


Planet Larry is not officially affiliated with Gentoo Linux. Original artwork and logos copyright Gentoo Foundation. Yadda, yadda, yadda.