![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
|
-----Original Message----- From: pgsql-general-owner (AT) postgresql (DOT) org [mailto gsql-general-owner (AT) postgresql (DOT) org] On Behalf OfThomas Hallgren Sent: Wednesday, October 27, 2004 11:16 AM To: pgsql-general (AT) postgresql (DOT) org Subject: Re: [GENERAL] Reasoning behind process instead of thread based nd02tsk (AT) student (DOT) hig.se wrote: Two: If a single process in a multi-process application crashes, that process alone dies. The buffer is flushed, and all the other child processes continue happily along. In a multi-threaded environment, when one thread dies, they all die. So this means that if a single connection thread dies in MySQL, all connections die? Seems rather serious. I am doubtful that is how they have implemented it. That all depends on how you define crash. If a thread causes an unhandled signal to be raised such as an illegal memory access or a floating point exception, the process will die, hence killing all threads. But a more advanced multi-threaded environment will install handlers for such signals that will handle the error gracefully. It might not even be necesarry to kill the offending thread. Some conditions are harder to handle than others, such as stack overflow and out of memory, but it can be done. So to state that multi-threaded environments in general kills all threads when one thread chrashes is not true. Having said that, I have no clue as to how advanced MySQL is in this respect. |
#2
| ||||
| ||||
|
|
There are clear advantages to separate process space for servers. 1. Separate threads can stomp on each other's memory space. (e.g. imagine a wild, home-brew C function gone bad). |
|
2. Separate processes can have separate user ids, and [hence] different rights for file access. A threaded server will have to either be started at the level of the highest user who will attach or will have to impersonate the users in threads. Impersonation is very difficult to make portable. |
|
3. Separate processes die when they finish, releasing all resources to the operating system. Imagine a threaded server with a teeny-tiny memory leak, that stays up 24x7. Eventually, you will start using disk for ram, or even use all available disk and simply crash. Sure, but a memory leak is a serious bug and most leaks will have a |
|
Threaded servers have one main advantate: Threads are lightweight processes and starting a new thread is faster than starting a new executable. A few more from the top of my head: |
#3
| |||||
| |||||
|
|
Threaded servers have one main advantate: Threads are lightweight processes and starting a new thread is faster than starting a new executable. A few more from the top of my head: |
|
1. Threads communicate much faster than processes (applies to locking and parallel query processing). 2. All threads in a process can share a common set of optimized query plans. |
|
3. All threads can share lots of data cached in memory (static but frequently accessed tables etc.). |
|
4. In environments built using garbage collection, all threads can share the same heap of garbage collected data. 5. A multi-threaded system can apply in-memory heuristics for self adjusting heaps and other optimizations. 6. And lastly, my favorite; a multi-threaded system can be easily integrated with, and make full use of, a multi-threaded virtual execution environment such as a Java VM. |
|
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. |
#4
| ||||
| ||||
|
|
A lot of these advantages are due to sharing an address space, right? Well, the processes in PostgreSQL share address space, just not *all* of it. They communicate via this shared memory. Whitch is a different beast altogether. The inter-process mutex handling |
|
2. All threads in a process can share a common set of optimized query plans. PostgreSQL could do this too, but I don't think anyone's looked into sharing query plans, probably quite difficult. Perhaps. It depends on the design. If the plans are immutable once |
|
Table data is already shared. If two backends are manipulating the same table, they can lock directly via shared memory rather than some OS primitive. Sure, some functionality can be achieved using shared memory. But it |
|
I think PostgreSQL has nicely combined the benefits of shared memory with the robustness of multiple processes... |
#5
| |||
| |||
|
|
Martijn van Oosterhout wrote: A lot of these advantages are due to sharing an address space, right? Well, the processes in PostgreSQL share address space, just not *all* of it. They communicate via this shared memory. Whitch is a different beast altogether. The inter-process mutex handling that you need to synchronize shared memory access is much more expensive than the mechanisms used to synchronize threads. |
|
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. |
#6
| |||
| |||
|
|
Marco, Wouldn't locking a process to a CPU cause the CPU to be idle during IO waits and semaphore locks? Even if you didn't lock each DB process to a CPU, IO waits and locks for one session would stop processing on other sessions managed by the same process. Moving the scheduler to user space seems to be reimplementing something the kernel knows best about. Ever worked with Ada tasking architectures? Not pretty. |
#7
| |||
| |||
|
|
On Thu, 28 Oct 2004 Richard_D_Levine (AT) raytheon (DOT) com wrote: Marco, Wouldn't locking a process to a CPU cause the CPU to be idle during IO waits and semaphore locks? Even if you didn't lock each DB process to a CPU, IO waits and locks for one session would stop processing on other sessions managed by the same process. Moving the scheduler to user space seems to be reimplementing something the kernel knows best about. Ever worked with Ada tasking architectures? Not pretty. Quick answers: - there won't be any I/O wait; - there won't be any semaphore-related wait; - in my previous message, I've never mentioned the kernel scheduler; - no, the kernel knows nothing about PostgreSQL sessions. It seems quite obvious to me that in the "one flow of execution per CPU" model, all I/O is non-blocking. Everything is event-driven. |
|
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. |
#8
| ||||
| ||||
|
|
I honestly don't think you could really do a much better job of scheduling than the kernel. The kernel has a much better idea of what processes are waiting on, and more importantly, what other work is happening on the same machine that also needs CPU time. I agree 100% with Martijn. Below is a reply that I sent to Marco some |
|
You ask what an event is? An event can be: - input from a connection (usually a new query); - notification that I/O needed by a pending query has completed; - if we don't want a single query starve the server, an alarm of kind (I think this is a corner case, but still possible ![]() - something else I haven't thought about. |
|
At any given moment, there are many pending queries. Most of them will be waiting for I/O to complete. That's how the server handles concurrent users. |
|
(*) They're oriented to general purpose processes. Think of how CPU usage affects relative priorities. In a DB context, there may be other criteria of greater significance. Roughly speaking, the larger the part of the data a single session holds locked, the sooner it should be completed. The kernel has no knowledge of this. To the kernel, "big" processes are those that are using a lot of CPU. And the policy is to slow them down. To a DB, a "big" queries are those that force the most serialization ("lock a lot"), and they should be completed as soon as possible. |
#9
| |||
| |||
|
|
1. non-blocking is nice, but lots of OSes (eg POSIX) don't support it on disk I/O unless you use a completely different interface. |
|
2. If one of your 'processes' decides to do work for half an hour (say, a really big merge sort), you're stuck. |
|
I honestly don't think you could really do a much better job of scheduling than the kernel. |
#10
| |||
| |||
|
|
I don't see the big difference between what Marco is suggesting and user threads -- or to be more precise, I think user threads and event-based programming are just two sides of the same coin. A user thread just represents the state of a computation -- say, a register context and some stack. It is exactly that *state* that is passed to a callback function in the event-based model. The only difference is that with user threads the system manages context for you, whereas the event-based model lets the programmer manage it. Which model is better is difficult to say. |
|
Martijn van Oosterhout wrote: 1. non-blocking is nice, but lots of OSes (eg POSIX) don't support it on disk I/O unless you use a completely different interface. We could implement I/O via something like POSIX AIO or a pool of worker threads that do the actual I/O in a synchronous fashion. But yeah, either way it's a major change. 2. If one of your 'processes' decides to do work for half an hour (say, a really big merge sort), you're stuck. It would be relatively easy to insert yield points into the code to prevent this from occurring. However, preemptive scheduling would come in handy when running "foreign" code (e.g. user-defined functions in C). I honestly don't think you could really do a much better job of scheduling than the kernel. I think we could do better than the kernel by taking advantage of domain-specific knowledge, I'm just not sure we could beat the kernel by enough to make this worth doing. BTW, I think this thread is really interesting -- certainly more informative than a rehash of the usual "processes vs. threads" debate. |
![]() |
| Thread Tools | |
| Display Modes | |
| |