=============================================================================== First experimental codes ... tools/import-cvs.tcl tools/lib/rcsparser.tcl No actual import, right now only working on getting csets right. The code uses CVSROOT/history as foundation, and augments that with data from the individual RCS files (commit messages). Statistics of a run ... 3516 csets. 1545 breaks on user change 558 breaks on file duplicate 13 breaks on branch/trunk change 1402 breaks on commit message change Time statistics ... 3297 were processed in <= 1 seconds (93.77%) 217 were processed in between 2 seconds and 14 minutes. 1 was processed in ~41 minutes 1 was processed in ~22 hours Time fuzz - Differences between csets range from 0 seconds to 66 days. Needs stats analysis to see if there is an obvious break. Even so the times within csets and between csets overlap a great deal, making time a bad criterium for cset separation, IMHO. Leaving that topic, back to the current cset separator ... It has a problem: The history file is not starting at the root! Examples: The first three changesets are =============================/user M {Wed Nov 22 09:28:49 AM PST 2000} ericm 1.4 tcllib/modules/ftpd/ChangeLog M {Wed Nov 22 09:28:49 AM PST 2000} ericm 1.7 tcllib/modules/ftpd/ftpd.tcl files: 2 delta: 0 range: 0 seconds =============================/cmsg M {Wed Nov 29 02:14:33 PM PST 2000} ericm 1.3 tcllib/aclocal.m4 files: 1 delta: range: 0 seconds =============================/cmsg M {Sun Feb 04 12:28:35 AM PST 2001} ericm 1.9 tcllib/modules/mime/ChangeLog M {Sun Feb 04 12:28:35 AM PST 2001} ericm 1.12 tcllib/modules/mime/mime.tcl files: 2 delta: 0 range: 0 seconds All csets modify files which already have several revisions. We have no csets from before that in the history, but these csets are in the RCS files. I wonder, is SF maybe removing old entries from the history when it grows too large ? This also affects incremental import ... I cannot assume that the history always grows. It may shrink ... I cannot keep an offset, will have to record the time of the last entry, or even the full entry processed last, to allow me to skip ahead to anything not known yet. I might have to try to implement the algorithm outlined below, matching the revision trees of the individual RCS files to each other to form the global tree of revisions. Maybe we can use the history to help in the matchup, for the parts where we do have it. Wait. This might be easier ... Take the delta information from the RCS files and generate a fake history ... Actually, this might even allow us to create a total history ... No, not quite, the merge entries the actual history may contain will be missing. These we can mix in from the actual history, as much as we have. Still, lets try that, a fake history, and then run this script on it to see if/where are differences. =============================================================================== Notes about CVS import, regarding CVS. - Problem: CVS does not really track changesets, but only individual revisions of files. To recover changesets it is necessary to look at author, branch, timestamp information, and the commit messages. Even so this is only heuristic, not foolproof. Existing tool: cvsps. Processes the output of 'cvs log' to recover changesets. Problem: Sees only a linear list of revisions, does not see branchpoints, etc. Cannot use the tree structure to help in making the decisions. - Problem: CVS does not track merge-points at all. Recovery through heuristics is brittle at best, looking for keywords in commit messages which might indicate that a branch was merged with some other. Ideas regarding an algorithm to recover changesets. Key feature: Uses the per-file revision trees to help in uncovering the underlying changesets and global revision tree G. The per-file revision tree for a file X is in essence the global revision tree with all nodes not pertaining to X removed from it. In the reverse this allows us to built up the global revision tree from the per-file trees by matching nodes to each other and extending. Start with the per file revision tree of a single file as initial approximation of the global tree. All nodes of this tree refer to the revision of the file belonging to it, and through that the file itself. At each step the global tree contains the nodes for a finite set of files, and all nodes in the tree refer to revisions of all files in the set, making the mapping total. To add a file X to the tree take the per-file revision tree R and performs the following actions: - For each node N in R use the tuple to identify a set of nodes in G which may match N. Use the timestamp to locate the node nearest in time. - This process will leave nodes in N unmapped. If there are unmapped nodes which have no neighbouring mapped nodes we have to abort. Otherwise take the nodes which have mapped neighbours. Trace the edges and see which of these nodes are connected in the local tree. Then look at the identified neighbours and trace their connections. If two global nodes have a direct connection, but a multi-edge connection in the local tree insert global nodes mapping to the local nodes and map them together. This expands the global tree to hold the revisions added by the new file. Otherwise, both sides have multi-edge connections then abort. This looks like a merge of two different branches, but there are no such in CVS ... Wait ... sort the nodes over time and fit the new nodes in between the other nodes, per the timestamps. We have overlapping / alternating changes to one file and others. A last possibility is that a node is only connected to a mapped parent. This may be a new branch, or again an alternating change on the given line. Symbols on the revisions will help to map this. - We now have an extended global tree which incorporates the revisions of the new file. However new nodes will refer only to the new file, and old nodes may not refer to the new file. This has to be fixed, as all nodes have to refer to all files. Run over the tree and look at each parent/child pair. If a file is not referenced in the child, but the parent, then copy a reference to the file revision on the parent forward to the child. This signals that the file did not change in the given revision. - After all files have been integrated in this manner we have global revision tree capturing all changesets, including the unchanged files per changeset. This algorithm has to be refined to also take Attic/ files into account. ------------------------------------------------------------------------- Two archive files mapping to the same user file. How are they interleaved ? (a) sqlite/src/os_unix.h,v (b) sqlite/src/Attic/os_unix.h,v Problem: Max version of (a) is 1.9 Max version of (b) is 1.11 cvs co 1.10 -> no longer in the repository. This seems to indicate that the non-Attic file is relevant. -------------------------------------------------------------------------- tcllib - more problems - tklib/pie.tcl,v - invalid change text in /home/aku/Projects/Tcl/Fossil/Devel/Examples/cvs-tcllib/tklib/modules/tkpiechart/pie.tcl,v Possibly braces ?