Josh Boyer 34f9218
From 0a6cc45426fe3baaf231efd7efe4300fd879efc8 Mon Sep 17 00:00:00 2001
Josh Boyer 1582123
From: Jason Baron <jbaron@redhat.com>
Josh Boyer 1582123
Date: Mon, 24 Oct 2011 14:59:02 +1100
Josh Boyer 1582123
Subject: [PATCH] epoll: limit paths
Josh Boyer 1582123
Josh Boyer 1582123
epoll: limit paths
Josh Boyer 1582123
Josh Boyer 1582123
The current epoll code can be tickled to run basically indefinitely in
Josh Boyer 1582123
both loop detection path check (on ep_insert()), and in the wakeup paths.
Josh Boyer 1582123
The programs that tickle this behavior set up deeply linked networks of
Josh Boyer 1582123
epoll file descriptors that cause the epoll algorithms to traverse them
Josh Boyer 1582123
indefinitely.  A couple of these sample programs have been previously
Josh Boyer 1582123
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
Josh Boyer 1582123
Josh Boyer 1582123
To fix the loop detection path check algorithms, I simply keep track of
Josh Boyer 1582123
the epoll nodes that have been already visited.  Thus, the loop detection
Josh Boyer 1582123
becomes proportional to the number of epoll file descriptor and links.
Josh Boyer 1582123
This dramatically decreases the run-time of the loop check algorithm.  In
Josh Boyer 1582123
one diabolical case I tried it reduced the run-time from 15 mintues (all
Josh Boyer 1582123
in kernel time) to .3 seconds.
Josh Boyer 1582123
Josh Boyer 1582123
Fixing the wakeup paths could be done at wakeup time in a similar manner
Josh Boyer 1582123
by keeping track of nodes that have already been visited, but the
Josh Boyer 1582123
complexity is harder, since there can be multiple wakeups on different
Josh Boyer 1582123
cpus...Thus, I've opted to limit the number of possible wakeup paths when
Josh Boyer 1582123
the paths are created.
Josh Boyer 1582123
Josh Boyer 1582123
This is accomplished, by noting that the end file descriptor points that
Josh Boyer 1582123
are found during the loop detection pass (from the newly added link), are
Josh Boyer 1582123
actually the sources for wakeup events.  I keep a list of these file
Josh Boyer 1582123
descriptors and limit the number and length of these paths that emanate
Josh Boyer 1582123
from these 'source file descriptors'.  In the current implemetation I
Josh Boyer 1582123
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
Josh Boyer 1582123
length 4 and 10 of length 5.  Note that it is sufficient to check the
Josh Boyer 1582123
'source file descriptors' reachable from the newly added link, since no
Josh Boyer 1582123
other 'source file descriptors' will have newly added links.  This allows
Josh Boyer 1582123
us to check only the wakeup paths that may have gotten too long, and not
Josh Boyer 1582123
re-check all possible wakeup paths on the system.
Josh Boyer 1582123
Josh Boyer 1582123
In terms of the path limit selection, I think its first worth noting that
Josh Boyer 1582123
the most common case for epoll, is probably the model where you have 1
Josh Boyer 1582123
epoll file descriptor that is monitoring n number of 'source file
Josh Boyer 1582123
descriptors'.  In this case, each 'source file descriptor' has a 1 path of
Josh Boyer 1582123
length 1.  Thus, I believe that the limits I'm proposing are quite
Josh Boyer 1582123
reasonable and in fact may be too generous.  Thus, I'm hoping that the
Josh Boyer 1582123
proposed limits will not prevent any workloads that currently work to
Josh Boyer 1582123
fail.
Josh Boyer 1582123
Josh Boyer 1582123
In terms of locking, I have extended the use of the 'epmutex' to all
Josh Boyer 1582123
epoll_ctl add and remove operations.  Currently its only used in a subset
Josh Boyer 1582123
of the add paths.  I need to hold the epmutex, so that we can correctly
Josh Boyer 1582123
traverse a coherent graph, to check the number of paths.  I believe that
Josh Boyer 1582123
this additional locking is probably ok, since its in the setup/teardown
Josh Boyer 1582123
paths, and doesn't affect the running paths, but it certainly is going to
Josh Boyer 1582123
add some extra overhead.  Also, worth noting is that the epmuex was
Josh Boyer 1582123
recently added to the ep_ctl add operations in the initial path loop
Josh Boyer 1582123
detection code using the argument that it was not on a critical path.
Josh Boyer 1582123
Josh Boyer 1582123
Another thing to note here, is the length of epoll chains that is allowed.
Josh Boyer 1582123
Currently, eventpoll.c defines:
Josh Boyer 1582123
Josh Boyer 1582123
/* Maximum number of nesting allowed inside epoll sets */
Josh Boyer 1582123
#define EP_MAX_NESTS 4
Josh Boyer 1582123
Josh Boyer 1582123
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
Josh Boyer 1582123
+ 1).  However, this limit is currently only enforced during the loop
Josh Boyer 1582123
check detection code, and only when the epoll file descriptors are added
Josh Boyer 1582123
in a certain order.  Thus, this limit is currently easily bypassed.  The
Josh Boyer 1582123
newly added check for wakeup paths, stricly limits the wakeup paths to a
Josh Boyer 1582123
length of 5, regardless of the order in which ep's are linked together.
Josh Boyer 1582123
Thus, a side-effect of the new code is a more consistent enforcement of
Josh Boyer 1582123
the graph depth.
Josh Boyer 1582123
Josh Boyer 1582123
Thus far, I've tested this, using the sample programs previously
Josh Boyer 1582123
mentioned, which now either return quickly or return -EINVAL.  I've also
Josh Boyer 1582123
testing using the piptest.c epoll tester, which showed no difference in
Josh Boyer 1582123
performance.  I've also created a number of different epoll networks and
Josh Boyer 1582123
tested that they behave as expectded.
Josh Boyer 1582123
Josh Boyer 1582123
I believe this solves the original diabolical test cases, while still
Josh Boyer 1582123
preserving the sane epoll nesting.
Josh Boyer 1582123
Josh Boyer 1582123
Signed-off-by: Jason Baron <jbaron@redhat.com>
Josh Boyer 1582123
Cc: Nelson Elhage <nelhage@ksplice.com>
Josh Boyer 1582123
Cc: Davide Libenzi <davidel@xmailserver.org>
Josh Boyer 1582123
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Boyer 1582123
---
Josh Boyer 1582123
 fs/eventpoll.c            |  226 ++++++++++++++++++++++++++++++++++++++++-----
Josh Boyer 1582123
 include/linux/eventpoll.h |    1 +
Josh Boyer 1582123
 include/linux/fs.h        |    1 +
Josh Boyer 1582123
 3 files changed, 203 insertions(+), 25 deletions(-)
Josh Boyer 1582123
Josh Boyer 1582123
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
Josh Boyer 34f9218
index b84fad9..414ac74 100644
Josh Boyer 1582123
--- a/fs/eventpoll.c
Josh Boyer 1582123
+++ b/fs/eventpoll.c
Josh Boyer 1582123
@@ -197,6 +197,12 @@ struct eventpoll {
Josh Boyer 1582123
 
Josh Boyer 1582123
 	/* The user that created the eventpoll descriptor */
Josh Boyer 1582123
 	struct user_struct *user;
Josh Boyer 1582123
+
Josh Boyer 1582123
+	struct file *file;
Josh Boyer 1582123
+
Josh Boyer 1582123
+	/* used to optimize loop detection check */
Josh Boyer 1582123
+	int visited;
Josh Boyer 1582123
+	struct list_head visitedllink;
Josh Boyer 1582123
 };
Josh Boyer 1582123
 
Josh Boyer 1582123
 /* Wait structure used by the poll hooks */
Josh Boyer 1582123
@@ -255,6 +261,12 @@ static struct kmem_cache *epi_cache __read_mostly;
Josh Boyer 1582123
 /* Slab cache used to allocate "struct eppoll_entry" */
Josh Boyer 1582123
 static struct kmem_cache *pwq_cache __read_mostly;
Josh Boyer 1582123
 
Josh Boyer 1582123
+/* Visited nodes during ep_loop_check(), so we can unset them when we finish */
Josh Boyer 1582123
+LIST_HEAD(visited_list);
Josh Boyer 1582123
+
Josh Boyer 1582123
+/* Files with newly added links, which need a limit on emanating paths */
Josh Boyer 1582123
+LIST_HEAD(tfile_check_list);
Josh Boyer 1582123
+
Josh Boyer 1582123
 #ifdef CONFIG_SYSCTL
Josh Boyer 1582123
 
Josh Boyer 1582123
 #include <linux/sysctl.h>
Josh Boyer 1582123
@@ -276,6 +288,12 @@ ctl_table epoll_table[] = {
Josh Boyer 1582123
 };
Josh Boyer 1582123
 #endif /* CONFIG_SYSCTL */
Josh Boyer 1582123
 
Josh Boyer 1582123
+static const struct file_operations eventpoll_fops;
Josh Boyer 1582123
+
Josh Boyer 1582123
+static inline int is_file_epoll(struct file *f)
Josh Boyer 1582123
+{
Josh Boyer 1582123
+	return f->f_op == &eventpoll_fops;
Josh Boyer 1582123
+}
Josh Boyer 1582123
 
Josh Boyer 1582123
 /* Setup the structure that is used as key for the RB tree */
Josh Boyer 1582123
 static inline void ep_set_ffd(struct epoll_filefd *ffd,
Josh Boyer 1582123
@@ -711,12 +729,6 @@ static const struct file_operations eventpoll_fops = {
Josh Boyer 1582123
 	.llseek		= noop_llseek,
Josh Boyer 1582123
 };
Josh Boyer 1582123
 
Josh Boyer 34f9218
-/* Fast test to see if the file is an eventpoll file */
Josh Boyer 1582123
-static inline int is_file_epoll(struct file *f)
Josh Boyer 1582123
-{
Josh Boyer 1582123
-	return f->f_op == &eventpoll_fops;
Josh Boyer 1582123
-}
Josh Boyer 1582123
-
Josh Boyer 1582123
 /*
Josh Boyer 1582123
  * This is called from eventpoll_release() to unlink files from the eventpoll
Josh Boyer 1582123
  * interface. We need to have this facility to cleanup correctly files that are
Josh Boyer 1582123
@@ -926,6 +938,96 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
Josh Boyer 1582123
 	rb_insert_color(&epi->rbn, &ep->rbr);
Josh Boyer 1582123
 }
Josh Boyer 1582123
 
Josh Boyer 1582123
+
Josh Boyer 1582123
+
Josh Boyer 1582123
+#define PATH_ARR_SIZE 5
Josh Boyer 1582123
+/* These are the number paths of length 1 to 5, that we are allowing to emanate
Josh Boyer 1582123
+ * from a single file of interest. For example, we allow 1000 paths of length
Josh Boyer 1582123
+ * 1, to emanate from each file of interest. This essentially represents the
Josh Boyer 1582123
+ * potential wakeup paths, which need to be limited in order to avoid massive
Josh Boyer 1582123
+ * uncontrolled wakeup storms. The common use case should be a single ep which
Josh Boyer 1582123
+ * is connected to n file sources. In this case each file source has 1 path
Josh Boyer 1582123
+ * of length 1. Thus, the numbers below should be more than sufficient.
Josh Boyer 1582123
+ */
Josh Boyer 1582123
+int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 };
Josh Boyer 1582123
+int path_count[PATH_ARR_SIZE];
Josh Boyer 1582123
+
Josh Boyer 1582123
+static int path_count_inc(int nests)
Josh Boyer 1582123
+{
Josh Boyer 1582123
+	if (++path_count[nests] > path_limits[nests])
Josh Boyer 1582123
+		return -1;
Josh Boyer 1582123
+	return 0;
Josh Boyer 1582123
+}
Josh Boyer 1582123
+
Josh Boyer 1582123
+static void path_count_init(void)
Josh Boyer 1582123
+{
Josh Boyer 1582123
+	int i;
Josh Boyer 1582123
+
Josh Boyer 1582123
+	for (i = 0; i < PATH_ARR_SIZE; i++)
Josh Boyer 1582123
+		path_count[i] = 0;
Josh Boyer 1582123
+}
Josh Boyer 1582123
+
Josh Boyer 1582123
+static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
Josh Boyer 1582123
+{
Josh Boyer 1582123
+	int error = 0;
Josh Boyer 1582123
+	struct file *file = priv;
Josh Boyer 1582123
+	struct file *child_file;
Josh Boyer 1582123
+	struct epitem *epi;
Josh Boyer 1582123
+
Josh Boyer 1582123
+	list_for_each_entry(epi, &file->f_ep_links, fllink) {
Josh Boyer 1582123
+		child_file = epi->ep->file;
Josh Boyer 1582123
+		if (is_file_epoll(child_file)) {
Josh Boyer 1582123
+			if (list_empty(&child_file->f_ep_links)) {
Josh Boyer 1582123
+				if (path_count_inc(call_nests)) {
Josh Boyer 1582123
+					error = -1;
Josh Boyer 1582123
+					break;
Josh Boyer 1582123
+				}
Josh Boyer 1582123
+			} else {
Josh Boyer 1582123
+				error = ep_call_nested(&poll_loop_ncalls,
Josh Boyer 1582123
+							EP_MAX_NESTS,
Josh Boyer 1582123
+							reverse_path_check_proc,
Josh Boyer 1582123
+							child_file, child_file,
Josh Boyer 1582123
+							current);
Josh Boyer 1582123
+			}
Josh Boyer 1582123
+			if (error != 0)
Josh Boyer 1582123
+				break;
Josh Boyer 1582123
+		} else {
Josh Boyer 1582123
+			printk(KERN_ERR "reverse_path_check_proc: "
Josh Boyer 1582123
+				"file is not an ep!\n");
Josh Boyer 1582123
+		}
Josh Boyer 1582123
+	}
Josh Boyer 1582123
+	return error;
Josh Boyer 1582123
+}
Josh Boyer 1582123
+
Josh Boyer 1582123
+/**
Josh Boyer 1582123
+ * reverse_path_check - The tfile_check_list is list of file *, which have
Josh Boyer 1582123
+ *                      links that are proposed to be newly added. We need to
Josh Boyer 1582123
+ *                      make sure that those added links don't add too many
Josh Boyer 1582123
+ *                      paths such that we will spend all our time waking up
Josh Boyer 1582123
+ *                      eventpoll objects.
Josh Boyer 1582123
+ *
Josh Boyer 1582123
+ * Returns: Returns zero if the proposed links don't create too many paths,
Josh Boyer 1582123
+ *	    -1 otherwise.
Josh Boyer 1582123
+ */
Josh Boyer 1582123
+static int reverse_path_check(void)
Josh Boyer 1582123
+{
Josh Boyer 1582123
+	int length = 0;
Josh Boyer 1582123
+	int error = 0;
Josh Boyer 1582123
+	struct file *current_file;
Josh Boyer 1582123
+
Josh Boyer 1582123
+	/* let's call this for all tfiles */
Josh Boyer 1582123
+	list_for_each_entry(current_file, &tfile_check_list, f_tfile_llink) {
Josh Boyer 1582123
+		length++;
Josh Boyer 1582123
+		path_count_init();
Josh Boyer 1582123
+		error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
Josh Boyer 1582123
+					reverse_path_check_proc, current_file,
Josh Boyer 1582123
+					current_file, current);
Josh Boyer 1582123
+		if (error)
Josh Boyer 1582123
+			break;
Josh Boyer 1582123
+	}
Josh Boyer 1582123
+	return error;
Josh Boyer 1582123
+}
Josh Boyer 1582123
+
Josh Boyer 1582123
 /*
Josh Boyer 1582123
  * Must be called with "mtx" held.
Josh Boyer 1582123
  */
Josh Boyer 1582123
@@ -987,6 +1089,11 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
Josh Boyer 1582123
 	 */
Josh Boyer 1582123
 	ep_rbtree_insert(ep, epi);
Josh Boyer 1582123
 
Josh Boyer 1582123
+	/* now check if we've created too many backpaths */
Josh Boyer 1582123
+	error = -EINVAL;
Josh Boyer 1582123
+	if (reverse_path_check())
Josh Boyer 1582123
+		goto error_remove_epi;
Josh Boyer 1582123
+
Josh Boyer 1582123
 	/* We have to drop the new item inside our item list to keep track of it */
Josh Boyer 1582123
 	spin_lock_irqsave(&ep->lock, flags);
Josh Boyer 1582123
 
Josh Boyer 1582123
@@ -1011,6 +1118,14 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
Josh Boyer 1582123
 
Josh Boyer 1582123
 	return 0;
Josh Boyer 1582123
 
Josh Boyer 1582123
+error_remove_epi:
Josh Boyer 1582123
+	spin_lock(&tfile->f_lock);
Josh Boyer 1582123
+	if (ep_is_linked(&epi->fllink))
Josh Boyer 1582123
+		list_del_init(&epi->fllink);
Josh Boyer 1582123
+	spin_unlock(&tfile->f_lock);
Josh Boyer 1582123
+
Josh Boyer 1582123
+	rb_erase(&epi->rbn, &ep->rbr);
Josh Boyer 1582123
+
Josh Boyer 1582123
 error_unregister:
Josh Boyer 1582123
 	ep_unregister_pollwait(ep, epi);
Josh Boyer 1582123
 
Josh Boyer 1582123
@@ -1275,18 +1390,35 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
Josh Boyer 1582123
 	int error = 0;
Josh Boyer 1582123
 	struct file *file = priv;
Josh Boyer 1582123
 	struct eventpoll *ep = file->private_data;
Josh Boyer 1582123
+	struct eventpoll *ep_tovisit;
Josh Boyer 1582123
 	struct rb_node *rbp;
Josh Boyer 1582123
 	struct epitem *epi;
Josh Boyer 1582123
 
Josh Boyer 1582123
 	mutex_lock_nested(&ep->mtx, call_nests + 1);
Josh Boyer 1582123
+	ep->visited = 1;
Josh Boyer 1582123
+	list_add(&ep->visitedllink, &visited_list);
Josh Boyer 1582123
 	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
Josh Boyer 1582123
 		epi = rb_entry(rbp, struct epitem, rbn);
Josh Boyer 1582123
 		if (unlikely(is_file_epoll(epi->ffd.file))) {
Josh Boyer 1582123
+			ep_tovisit = epi->ffd.file->private_data;
Josh Boyer 1582123
+			if (ep_tovisit->visited)
Josh Boyer 1582123
+				continue;
Josh Boyer 1582123
 			error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
Josh Boyer 1582123
-					       ep_loop_check_proc, epi->ffd.file,
Josh Boyer 1582123
-					       epi->ffd.file->private_data, current);
Josh Boyer 1582123
+					ep_loop_check_proc, epi->ffd.file,
Josh Boyer 1582123
+					ep_tovisit, current);
Josh Boyer 1582123
 			if (error != 0)
Josh Boyer 1582123
 				break;
Josh Boyer 1582123
+		} else {
Josh Boyer 1582123
+			/* if we've reached a file that is not associated with
Josh Boyer 1582123
+			 * an ep, then then we need to check if the newly added
Josh Boyer 1582123
+			 * links are going to add too many wakeup paths. We do
Josh Boyer 1582123
+			 * this by adding it to the tfile_check_list, if it's
Josh Boyer 1582123
+			 * not already there, and calling reverse_path_check()
Josh Boyer 1582123
+			 * during ep_insert()
Josh Boyer 1582123
+			 */
Josh Boyer 1582123
+			if (list_empty(&epi->ffd.file->f_tfile_llink))
Josh Boyer 1582123
+				list_add(&epi->ffd.file->f_tfile_llink,
Josh Boyer 1582123
+					 &tfile_check_list);
Josh Boyer 1582123
 		}
Josh Boyer 1582123
 	}
Josh Boyer 1582123
 	mutex_unlock(&ep->mtx);
Josh Boyer 1582123
@@ -1307,8 +1439,30 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
Josh Boyer 1582123
  */
Josh Boyer 1582123
 static int ep_loop_check(struct eventpoll *ep, struct file *file)
Josh Boyer 1582123
 {
Josh Boyer 1582123
-	return ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
Josh Boyer 1582123
+	int ret;
Josh Boyer 1582123
+	struct eventpoll *ep_cur, *ep_next;
Josh Boyer 1582123
+
Josh Boyer 1582123
+	ret = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
Josh Boyer 1582123
 			      ep_loop_check_proc, file, ep, current);
Josh Boyer 1582123
+	/* clear visited list */
Josh Boyer 1582123
+	list_for_each_entry_safe(ep_cur, ep_next, &visited_list, visitedllink) {
Josh Boyer 1582123
+		ep_cur->visited = 0;
Josh Boyer 1582123
+		list_del(&ep_cur->visitedllink);
Josh Boyer 1582123
+	}
Josh Boyer 1582123
+	return ret;
Josh Boyer 1582123
+}
Josh Boyer 1582123
+
Josh Boyer 1582123
+static void clear_tfile_check_list(void)
Josh Boyer 1582123
+{
Josh Boyer 1582123
+	struct file *file;
Josh Boyer 1582123
+
Josh Boyer 1582123
+	/* first clear the tfile_check_list */
Josh Boyer 1582123
+	while (!list_empty(&tfile_check_list)) {
Josh Boyer 1582123
+		file = list_first_entry(&tfile_check_list, struct file,
Josh Boyer 1582123
+					f_tfile_llink);
Josh Boyer 1582123
+		list_del_init(&file->f_tfile_llink);
Josh Boyer 1582123
+	}
Josh Boyer 1582123
+	INIT_LIST_HEAD(&tfile_check_list);
Josh Boyer 1582123
 }
Josh Boyer 1582123
 
Josh Boyer 1582123
 /*
Josh Boyer 1582123
@@ -1316,8 +1470,9 @@ static int ep_loop_check(struct eventpoll *ep, struct file *file)
Josh Boyer 1582123
  */
Josh Boyer 1582123
 SYSCALL_DEFINE1(epoll_create1, int, flags)
Josh Boyer 1582123
 {
Josh Boyer 1582123
-	int error;
Josh Boyer 1582123
+	int error, fd;
Josh Boyer 1582123
 	struct eventpoll *ep = NULL;
Josh Boyer 1582123
+	struct file *file;
Josh Boyer 1582123
 
Josh Boyer 1582123
 	/* Check the EPOLL_* constant for consistency.  */
Josh Boyer 1582123
 	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
Josh Boyer 1582123
@@ -1334,11 +1489,25 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
Josh Boyer 1582123
 	 * Creates all the items needed to setup an eventpoll file. That is,
Josh Boyer 1582123
 	 * a file structure and a free file descriptor.
Josh Boyer 1582123
 	 */
Josh Boyer 1582123
-	error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
Josh Boyer 1582123
+	fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
Josh Boyer 1582123
+	if (fd < 0) {
Josh Boyer 1582123
+		error = fd;
Josh Boyer 1582123
+		goto out_free_ep;
Josh Boyer 1582123
+	}
Josh Boyer 1582123
+	file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
Josh Boyer 1582123
 				 O_RDWR | (flags & O_CLOEXEC));
Josh Boyer 1582123
-	if (error < 0)
Josh Boyer 1582123
-		ep_free(ep);
Josh Boyer 1582123
-
Josh Boyer 1582123
+	if (IS_ERR(file)) {
Josh Boyer 1582123
+		error = PTR_ERR(file);
Josh Boyer 1582123
+		goto out_free_fd;
Josh Boyer 1582123
+	}
Josh Boyer 1582123
+	fd_install(fd, file);
Josh Boyer 1582123
+	ep->file = file;
Josh Boyer 1582123
+	return fd;
Josh Boyer 1582123
+
Josh Boyer 1582123
+out_free_fd:
Josh Boyer 1582123
+	put_unused_fd(fd);
Josh Boyer 1582123
+out_free_ep:
Josh Boyer 1582123
+	ep_free(ep);
Josh Boyer 1582123
 	return error;
Josh Boyer 1582123
 }
Josh Boyer 1582123
 
Josh Boyer 1582123
@@ -1404,21 +1573,27 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
Josh Boyer 1582123
 	/*
Josh Boyer 1582123
 	 * When we insert an epoll file descriptor, inside another epoll file
Josh Boyer 1582123
 	 * descriptor, there is the change of creating closed loops, which are
Josh Boyer 1582123
-	 * better be handled here, than in more critical paths.
Josh Boyer 1582123
+	 * better be handled here, than in more critical paths. While we are
Josh Boyer 1582123
+	 * checking for loops we also determine the list of files reachable
Josh Boyer 1582123
+	 * and hang them on the tfile_check_list, so we can check that we
Josh Boyer 1582123
+	 * haven't created too many possible wakeup paths.
Josh Boyer 1582123
 	 *
Josh Boyer 1582123
-	 * We hold epmutex across the loop check and the insert in this case, in
Josh Boyer 1582123
-	 * order to prevent two separate inserts from racing and each doing the
Josh Boyer 1582123
-	 * insert "at the same time" such that ep_loop_check passes on both
Josh Boyer 1582123
-	 * before either one does the insert, thereby creating a cycle.
Josh Boyer 1582123
+	 * We need to hold the epmutex across both ep_insert and ep_remove
Josh Boyer 1582123
+	 * b/c we want to make sure we are looking at a coherent view of
Josh Boyer 1582123
+	 * epoll network.
Josh Boyer 1582123
 	 */
Josh Boyer 1582123
-	if (unlikely(is_file_epoll(tfile) && op == EPOLL_CTL_ADD)) {
Josh Boyer 1582123
+	if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
Josh Boyer 1582123
 		mutex_lock(&epmutex);
Josh Boyer 1582123
 		did_lock_epmutex = 1;
Josh Boyer 1582123
-		error = -ELOOP;
Josh Boyer 1582123
-		if (ep_loop_check(ep, tfile) != 0)
Josh Boyer 1582123
-			goto error_tgt_fput;
Josh Boyer 1582123
 	}
Josh Boyer 1582123
-
Josh Boyer 1582123
+	if (op == EPOLL_CTL_ADD) {
Josh Boyer 1582123
+		if (is_file_epoll(tfile)) {
Josh Boyer 1582123
+			error = -ELOOP;
Josh Boyer 1582123
+			if (ep_loop_check(ep, tfile) != 0)
Josh Boyer 1582123
+				goto error_tgt_fput;
Josh Boyer 1582123
+		} else
Josh Boyer 1582123
+			list_add(&tfile->f_tfile_llink, &tfile_check_list);
Josh Boyer 1582123
+	}
Josh Boyer 1582123
 
Josh Boyer 1582123
 	mutex_lock_nested(&ep->mtx, 0);
Josh Boyer 1582123
 
Josh Boyer 1582123
@@ -1437,6 +1612,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
Josh Boyer 1582123
 			error = ep_insert(ep, &epds, tfile, fd);
Josh Boyer 1582123
 		} else
Josh Boyer 1582123
 			error = -EEXIST;
Josh Boyer 1582123
+		clear_tfile_check_list();
Josh Boyer 1582123
 		break;
Josh Boyer 1582123
 	case EPOLL_CTL_DEL:
Josh Boyer 1582123
 		if (epi)
Josh Boyer 1582123
@@ -1455,7 +1631,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
Josh Boyer 1582123
 	mutex_unlock(&ep->mtx);
Josh Boyer 1582123
 
Josh Boyer 1582123
 error_tgt_fput:
Josh Boyer 1582123
-	if (unlikely(did_lock_epmutex))
Josh Boyer 1582123
+	if (did_lock_epmutex)
Josh Boyer 1582123
 		mutex_unlock(&epmutex);
Josh Boyer 1582123
 
Josh Boyer 1582123
 	fput(tfile);
Josh Boyer 1582123
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
Josh Boyer 1582123
index f362733..657ab55 100644
Josh Boyer 1582123
--- a/include/linux/eventpoll.h
Josh Boyer 1582123
+++ b/include/linux/eventpoll.h
Josh Boyer 1582123
@@ -61,6 +61,7 @@ struct file;
Josh Boyer 1582123
 static inline void eventpoll_init_file(struct file *file)
Josh Boyer 1582123
 {
Josh Boyer 1582123
 	INIT_LIST_HEAD(&file->f_ep_links);
Josh Boyer 1582123
+	INIT_LIST_HEAD(&file->f_tfile_llink);
Josh Boyer 1582123
 }
Josh Boyer 1582123
 
Josh Boyer 1582123
 
Josh Boyer 1582123
diff --git a/include/linux/fs.h b/include/linux/fs.h
Josh Boyer 34f9218
index ba98668..d393a68 100644
Josh Boyer 1582123
--- a/include/linux/fs.h
Josh Boyer 1582123
+++ b/include/linux/fs.h
Josh Boyer 1582123
@@ -985,6 +985,7 @@ struct file {
Josh Boyer 1582123
 #ifdef CONFIG_EPOLL
Josh Boyer 1582123
 	/* Used by fs/eventpoll.c to link all the hooks to this file */
Josh Boyer 1582123
 	struct list_head	f_ep_links;
Josh Boyer 1582123
+	struct list_head	f_tfile_llink;
Josh Boyer 1582123
 #endif /* #ifdef CONFIG_EPOLL */
Josh Boyer 1582123
 	struct address_space	*f_mapping;
Josh Boyer 1582123
 #ifdef CONFIG_DEBUG_WRITECOUNT
Josh Boyer 1582123
-- 
Josh Boyer 1582123
1.7.6.4
Josh Boyer 1582123