Dynamic loading is often achieved (as in Linux) with user-mode code that runs before main() - it's not typically a kernel function any more.
The linker builds an import table which is saved to the executable file, listing the names and versions of the relocatable modules needed.
When the program starts, the dynamic loader (which is often in a shared library - which has specialised load code) searches for the shared library files containing the required code. For each such file, it allocates memory segments and reads or maps the code and initialised data into memory, relocating it if necessary (to whatever virtual address was allocated), and depending on the architecture, may build a jump table for the exported symbols. In the case of shared libraries, it's commonly the case that there is a single virtual address range reserved by the operating system, and all programs that share the same code will see it at the same virtual address.
In short, you can do this too. If you don't need to load shared libraries, then you just need to find the sizes and alignment requirements of the code and data sections in the loadable module, allocate memory for each section, read in the code or data, and then read through the relocations section of the executable file, applying fixups to adjust for the actual addresses of the memory you allocated.
If you have no idea of the content of an relocatable image file, I'd suggest you start poking around one with binutils. You have a steep learning curve ahead of you, but it can be surmounted.
Clifford Heath.