Building a flexible WCMS ---------- Continued from TableQuantityVersusAppSize: [I don't see the duplication the above is intended to fix. It appears to be akin to noting that fields called "Description" show up frequently, (mis)regarding that as duplication, and trying to merge all "Description" data into a single table. That, of course, would be ridiculous. The above seems similar.] It's not just one column, but sets of columns that are very similar. But again, in accounting we may know that the duplication is not a problem because it hasn't been for decades doing it that way. But every domain is different. I'm currently working on '''a WCMS-like gizmo that I wish I used a GodTable instead of partition things into "kinds" of content or containers'''. I didn't want it to look odd to other maintainers, so I did it the old-fashioned way. It turned out to be too inflexible. How the info is displayed varies depending on where it's used. My hard-wiring of categories (tables) is already kicking my ass in terms of change difficulty. I have to write duplicate code to display A as an X and B as an X, whereas if I used a GodTable, everything would already be an X (or viewable as an X) without reinventing the damned wheel for each damned table. [The apparent need for a GodTable often indicates a real need for type-subtype relationships, further normalisation instead of denormalisation, VIEWs to encapsulate UNION queries, or all three.] What sub-type? And UNION queries are ugly. A given content object may be a "page" in one context, a "link item" with a synopsis (showing it) and optional thumbnail in another, a simple link title in a list in another, both a link item and a page in another, a medium-sized panel in another, a slide-show item in another, etc. I originally tried to separate the synopsis or title link from the "page", but one manager balked: they wanted "one-stop-shopping" for some users (newbie web authors) and thought separating them was too confusing for their group's target users. (Other authors may opt to separate them, but it became clear that both approaches should be supported.) [You asked "what sub-type?"... It would appear that everything is a sub-type of something which you yourself called a "content object". As for UNION queries being "ugly", I'm not yet convinced that such an aesthetically-based personal opinion about a certain operator in a certain query language should have any impact on relational schema design.] The GodTable generally would resemble: * ID - Auto-key * Type - "Summary [link], combo [sum & page], page, image-attachment, etc. May not be needed. * Title * Synopsis * Image - URL, usually used as a thumbnail, but not always * URL - Generally would not be used if Text given * Text - Such as page text * authorDate - Date written or published * Author (or author ID) * parentID - Can reference the parent page or container box or column * (Other doo-dads) We could use the same table for image "attachments" to web pages and the Synopsis can serve as the image caption. Of course, we often want to shut some of these off for display. Since a given object may be used in multiple contexts, we don't want to shut them off for data entry also. A warning may be in order, such as "Field X is not recognized by the current context or container". In other words, we need "soft validation". ''Why do you think the above is a GodTable? Assuming all attributes are functionally dependent on ID and assuming there are no transitive dependencies, it is in (at least) 3NF and therefore acceptable. My interpretation of GodTable is that it is always < 3NF, having multiple entity types in one table. The above, as a GodTable, might be:'' * ID - Auto-key * Type - "Summary [link], combo [sum & page], page, image-attachment, etc. May not be needed. * T''''''ypeDescription * T''''''ypeStatus - "Available, Obsolete" * Title * Synopsis * Image - URL, usually used as a thumbnail, but not always * I''''''mageType * URL - Generally would not be used if Text given * Text1 - page text * Text2 - if we need more text * Text3 - and even more text * Text4 - etc * Text5 - etc * authorDate * Author * A''''''uthorName * A''''''uthorPhoneNumber * A''''''uthorEmail * A''''''uthorAddress * A''''''uthorTrainingRequirements * parentID - Can reference the parent page or container box or column * (Other doo-dads) "Entity" is too vague. Originally it appeared to be 3 entities: "page", "container" (section or "box" within a page), and "link item" within sections. But, putting it into practice and receiving new requests (and complaints) showed this classification was too rigid. It's not even hierarchical anymore. User/author appears to still deserve its own entity so far and that's not a problem. (There are other "organizational" tables that are not being discussed here because they are probably too org-specific.) Experience tests our designs, and parts pass and parts fail. Looking at other WCMS, it seems they got too "entity happy" also, and are too rigid and/or too complex as a result. A WCMS should be a "dynamic graph manager" and not pre-classify things. Note that there can be multiple authors of a given "article". This would imply a many-to-many table to associate authors to articles. However, in the name of KISS, I'd lean toward a textual free-form "authors" field, perhaps pre-populated with the user's name. Only in an org/group where authorship tracking becomes very specific and critical would such be justified. We may get a few complaints when people get married or unmarried and their name changes, but we have to weigh that against the extra UI steps to have an "associator" screen set. As far as the definition of GodTable, it's essentially a "content" GodTable, not an "everything" GodTable. It's god of a star system, not the entire universe. Perhaps "stem content object" may be a better name. ''You haven't answered my question, which was "why do you think the above is a GodTable? What do you consider to be a GodTable?"'' * I thought I did address that. Anyhow, I don't want to get into a definition war at this point, and let's just focus on the specific example for now. ''Why do you think the term "entity" is too vague? It's a conventional term used in familiar database design techniques like "entity/relationship diagram".'' It is vague in that it does not provide an objective "slot" or dividing algorithm in many cases. ''Fortunately, the process of normalisation makes it a non-issue. You don't need to worry about "entities", as long as functional dependencies can be identified. That is normally straightforward.'' No, it's not. If you believe it is, then write down the clear and objective algorithm without any vague/ambiguous words, and we'll test it out. And please, no IknowItWhenIseeIt. I'll puke of I see that again. It's rarely helpful. The existing definitions out there have too many wiggle-words like "dependent on" and "related to", which are generally continuous concepts in normal English, not Boolean. * ''Finding functional dependencies in a given (or proposed) system is a simple and mechanical identification of relationships between data-element values in that system. It is not reliant on "wiggle-words". For example, in a typical scenario "client age" is functionally dependent on "client ID", because given the client's ID we can know the client's age. "Client age" is not functionally dependent on "client ID", because given a client's age we can't know his/her client's ID. From understanding the functional dependencies between data elements -- the determination of which is a mechanical process, sometimes assisted with common sense in order to save time and effort -- we can algorithmically derive a normalised schema. The process is well-known. Surely you don't need me to reiterate it here?'' * You are picking the uncontroversial low-hanging fruit examples. Yes, those are the easy decisions. But because a majority of data elements can be found a strait-forward home does not mean 100% of them can. This should go without saying. If you truly believe the entire process is strait-forward, then write down the clear and objective algorithm without any vague/ambiguous words, and we'll test it out. * ''There are indeed occasional challenging examples. They are rare, and generally due to a lack of information. E.g., "Q: What's this SQB2 column on the SCRAM report? In all the samples I've got, it's blank. A: Dunno, but Pete needs it and he's on secondment until next year." The algorithm for normalisation is well-known. You can find it in any textbook like AnIntroductionToDatabaseSystems.'' * YES! The work world is often murky! We often don't have perfect domain information by far. DilbertIsNoJoke and/or we are often planning during changes in the environment. That's where we have to weigh judgement calls based on forecasted cost, risk, probability, etc. * ''Of course. That's a primary justification for AgileMethodologies. It is not a justification for avoiding normalisation or deliberately creating poorly-normalised GodTable''''''s.'' * If poor normalization better prepares a system for the future, then it may be worth it. ItDepends. That's where I run a SimulationOfTheFuture to analyze the potential paths. As far as the value of AgileMethodologies, that's probably another topic. In this particular case, I would have been better off if I used a GodTable (or something closer to it). My instincts were leaning that way early in the design stage because the customer has something akin to A.D.D. and wouldn't commit to specifics. But I went against my instincts because I didn't wish to confuse other developers. Now they are going to be confused anyhow because of the work-arounds and structural duplication to add the additional features. * ''Can you show an example where poor normalization better prepares a system for the future based on a SimulationOfTheFuture, that doesn't result in update anomalies in the present? I disagree that you needed a GodTable, or at least a poorly-normalised GodTable. (I do not equate GodTable, as some do, with "lots of columns" for some undefined value of "lots".) It appeared you needed a type-subtype hierarchy.'' * My best guess as to what they needed did indeed appear to be a hierarchy (page -> container -> content-item). But I did "feel" that the customer was fuzzy enough that perhaps I needed a more dynamic approach and shouldn't commit to a hierarchy and fixed "types". '''A graph with run-time "morph-able" nodes''' is more flexible than the type/container hierarchy. But the reason we don't make EVERY data structure like that in order to get the flexibility is partly for performance reasons, and partly because it's more difficult to get the "structural integrity" of type-safety and inherent protections provided by hierarchies (it can't "leak" into graph-ness due to referential integrity etc.). Graphs are also harder to grok than hierarchies in general, both design-wise and data-study-wise. So one is weighing cheaper/easier "protection" and grokability against future flexibility. If my initial design was sufficiently "correct" over time, then in hindsight it would have been the "right choice". To make the right choice, one has to have the best possible SimulationOfTheFuture, which includes considering the amount of wildly unpredictable future change requests. * If one knows the future is likely to be wildly unpredictable, then betting on a graph with run-time morph-able nodes appears to be a rational choice. * ''I remember when that was an argument against the RelationalModel in favour of CODASYL. It's reappeared more recently as an argument against the RelationalModel in favour of XML.'' * Different parts of the application need different "flavors" of DB, and some apps may need to be more "graphy" and/or dynamic than others. Something like MultiParadigmDatabase and/or DynamicRelational would better serve apps that tend to straddle both in my opinion. We still have the "authors" tables and some other organization-related tables that appear fairly stable, although some of our permission needs are leaning toward the "graphy" approach. Tools like ParagraphWiki's and WCMS are probably better off with a dynamic approach (until the time somebody discovers a "stable" old-fashioned RDBMS schema for them). One size doesn't fit all. * I would note that predicate-centric querying doesn't seem to be a problem with the system so far, and thus graph-centric query constructs (such as node traversal) are not a big need so far. Most of the querying is specific to a view (output) that a given page or section "kind" needs to display. A set-based classification along with org structuring info is sufficient at this time. Example: "Display at spot X all content nodes belonging to category "news" or "urgent" and not "archived" for the current location (department); and if there is no graphic/icon in the node, use default file news_icon_4.jpg." The '''flexibility bottleneck is mostly on the "schema" side, NOT the query language''' side. (Although, some graph-oriented "functions" would help out the query side.) - t {Why do you keep conflating "defined" with "decidable"?} Decidable by subjective factors? That's not the ideal situation by far, especially when trying to form good design guidelines. I tend toward the largest table(s) row-wise and column-wise without creating data duplication, or at least balancing to get the "the least evil" of two or more duplication paths, and without excessive empty/null cells. Of course, "excessive" will probably be a point of contention. In a table like the above, there may be a lot of empty cells. However, if they don't form a consistent pattern, then it may be fine to leave them blank. A consistent pattern would be something like a group of columns that are always blank when a certain condition is true. If we can identify such, then we may have a candidate for an "entity split". For example, if there are say 5 columns devoted to images above, and these columns are always empty if there is no image for a web content object, then we can consider creating a separate entity "Images". But if usage of images always matches to the object rather than having some or all images independent of web content objects, then it may be fine to not split. I'd lean toward "don't split unless you see a likely need now or the near future" because it would probably lead to: * Fewer joins * Fewer tables to mentally grok and sift in listings * More efficient queries machine-wise An exception may be if images are rare compared to other content; then the carrying costs of all those empty cells is not worth it in terms of mind and machine space. {Your response doesn't appear to have much to do with what you responded to.} Ask better questions. ''That looked like a very good question. Why '''do''' you keep conflating "defined" with "decidable"?'' "Decidable" says nothing of the quality of the answer. Unless, you mean you "know" the answer is "determinable" in an objective way, but cannot put the determination technique down on paper (without vague steps/parts) for public inspection. In that case it's equivalent to "undefined" for practical purposes and is ArgumentFromAuthority. * {According to what you just said, "function that always returns 0" is "'undefined' for practical purposes". I find that odd to say the least.} ''That doesn't answer the question.'' Sorry, I tried my best. I guess I don't know you are getting at. You write and think oddly. It should be simple: if there are clear-cut objective rules then convert the rules to a computer-executable algorithm (or close to) and show us it else admit there is are not clear-cut objective rules endif ''You appear to erroneously equate "can be decided with human reason" with "can be identified using current computer technology".'' That sounds like a geeky attempt to legitimize ArgumentFromAuthority. "The proper algorithm is in my head, and it just happens you can't see/test most of it." * {No, it's the realization that there's a difference between "everything either meets or doesn't meet a definition" and "we can (given enough resources) always tell a given thing meets or doesn't meet a definition".} * The second is nearly impossible to independently verify as being consistent. IN PRACTICE it's ArgumentFromAuthority. Your statement is at best a UselessTruth. * {UselessTruth? It has rather wide reaching implications. Compiler writers run into it on a regular basis. It has consequences in NumberTheory. That truth is used on a regular basis, so how could it possibly be a UselessTruth? (BTW, your first sentence is incoherent, and there's no ArgumentFromAuthority involved since that requires an appeal to what an expert says. There's no such appeal being made.)} * Can you describe an issue a compiler writer may encounter that is sufficiently relevant and/or comparable to normalization "rules"? * {Why the extraneous conditions?} * Let me try to simplify this. If somebody claims a definition is "objectively clear", my test for that is to have it be put into a form that a computer can execute in a sufficient time-span for "lab" analysis. I don't give a shit about the "determinator" in your head. Whether I "should" accept your head alone as evidence, I'll leave to another topic. I want to see a computerizable algorithm before I'll accept your claim. (In my opinion, if you cannot turn it into an algorithm, you are probably either in the wrong business, or wrong about it's objectively.) * {My point is that when you do this, you are conflating "decidable" and "defined". Is there something unclear about "function that always returns 0"? There isn't an algorithm for it. (Actually, you've gotten even more restrictive since you require that it terminates in relatively short time span. According to this idea, "optimal solution to the KnapsackProblem" is also unclear.)} * In a sense, it ''is'' "unclear" because it's very difficult for one to practically verify a given solution candidate is "optimal". BUT, it is possible to define a criteria for comparing two candidate solutions and determining which is "better" using clear criteria that can be done without a Cray etc. * But, that kind of TravelingSalesmanProblem-like problem is in some ways not comparable to normalization. For example, we can make an automated system that can identify the fact that column-B is always the value of "red" when column-A is "7". (I once was contracted to analyze globs of data in such a way to help clean things up.) But in the real world you'll often get say a 97% match, and then have to ask questions such as whether the 3% are data entry errors, or something else in the domain is in play. Sometimes people don't know: "The guy who did it left and didn't tell us." And then there is planning for likely, possible, and rare future changes. Maybe what we need is the field of "'''probabilistic normalization'''", but something tells me it will resemble economics and investment analysis, bringing us right back to our old squabbling ground. -t * { What's wrong with a proof for that (optimal solution to the KnapsackProblem) particular case? How does having an algorithm help since it could give the wrong answer? } * I am not understanding the question. We are looking for useful tools and ways to communicate trade-offs with customers, not solve lab-toy problems just to say "we can". We want a clear, verifiable, and calculatable way to get the best table design possible with existing or realistic resources. * A possibly simpler step in the right direction would be a way to clearly say that Design A is better than Design B. Let's tackle that problem first. (We'll need such a tool anyhow to check a final "big" solution.) * { You said that the definition of "optimal solution to the KnapsackProblem" was unclear because you need to verify that a candidate solution is an optimal solution. I want to know why you think a proof for that particular case wouldn't be sufficient.} * First, such a definition is not very useful in practice, and second it's always good to have an empirical or secondary test. Sometimes proofs have errors, just like source code. Third, sometimes we do want to compare two different solutions that may not be optimal, but are the available candidates for reasons beyond our control. * {First, according to Stony Brook, solutions to the KnapsackProblem are the fourth most needed. It's obviously quite useful in practice. Second, since an algorithm, by itself, can't verify that a solution really is an optimal solution (you also need a proof that the algorithm is correct), you still have that problem and you've added the possibilities of errors in the implementation. Third, is completely irrelevant.} * That doesn't tell us what they are doing with the proof. They could be milking it for optimization implementation ideas/hints/clues, or using it for academic study only, for example. * {Huh? You're the one asking for a proof, so only you can tell us what you want with it.} * I didn't ask for KnapsackProblem proof. It's your analogy. We are getting away from practical issues anyhow. The real-world wants a "runnable" set of rules or processes. ''No, it's simply reasonable. I can trivially identify a cat when I see one. We can easily agree that we've both seen a cat. It's very, very difficult to get a computer program to reliably visually identify cats, and even more difficult if not impossible to define an algorithm to visually identify a cat.'' I'm sure there are many instances of possible cat-like animal configurations that would generate different answers among us. That something can be called a domain "requirement" may likely be at dispute, for example, in a chaotic office where different people are giving different implications or the lines of authority overlap and contradict. I will agree that if we had perfect knowledge of the domain, including likely change patterns, that the disagreements over the application of normalization would be far less common. ---------- See also: MultiParadigmDatabase, GodTable, FileSystemAlternatives ------- CategoryDatabase, CategoryWebDesign JulyThirteen