py: Store bytecode arg names in bytecode (were in own array).

This saves a lot of RAM for 2 reasons:

1. For functions that don't have default values, var args or var kw
args (which is a large number of functions in the general case), the
mp_obj_fun_bc_t type now fits in 1 GC block (previously needed 2 because
of the extra pointer to point to the arg_names array).  So this saves 16
bytes per function (32 bytes on 64-bit machines).

2. Combining separate memory regions generally saves RAM because the
unused bytes at the end of the GC block are saved for 1 of the blocks
(since that block doesn't exist on its own anymore).  So generally this
saves 8 bytes per function.

Tested by importing lots of modules:

- 64-bit Linux gave about an 8% RAM saving for 86k of used RAM.
- pyboard gave about a 6% RAM saving for 31k of used RAM.
diff --git a/py/showbc.c b/py/showbc.c
index 28fed14..13d257d 100644
--- a/py/showbc.c
+++ b/py/showbc.c
@@ -57,7 +57,7 @@
     ip += sizeof(mp_uint_t); \
 } while (0)
 
-void mp_bytecode_print(const void *descr, const byte *ip, mp_uint_t len) {
+void mp_bytecode_print(const void *descr, mp_uint_t n_total_args, const byte *ip, mp_uint_t len) {
     const byte *ip_start = ip;
 
     // get code info size
@@ -80,6 +80,14 @@
     }
     printf("\n");
 
+    // bytecode prelude: arg names (as qstr objects)
+    printf("arg names:");
+    for (int i = 0; i < n_total_args; i++) {
+        printf(" %s", qstr_str(MP_OBJ_QSTR_VALUE(*(mp_obj_t*)ip)));
+        ip += sizeof(mp_obj_t);
+    }
+    printf("\n");
+
     // bytecode prelude: state size and exception stack size; 16 bit uints
     {
         uint n_state = mp_decode_uint(&ip);