Preserving node names in a digraph

5 次查看(过去 30 天)
I am constructing a large digraph with between 10k-100k nodes, in which I want to add, delete, and merge nodes. The nodes represent objects with other externally-stored data, which are indexed numerically, so the nodeIDs must be preserved to reference properly to the related data.
Is there a way of preserving node ids in a graph, other than giving the nodes string names?
In the following code
from_node=[1 1 2 3 4 4 5 6 7 3];
to_node= [3 2 5 7 6 5 7 7 2 4];
weights=rand(size(from_node));
g=digraph(to_node, from_node, weights);
h=rmnode(g,2);
when you remove node 2, it will reorder the nodes and call some other node 2 unless you specify node names, which must be strings, as such:
from_node=[1 1 2 3 4 4 5 6 7 3];
to_node= [3 2 5 7 6 5 7 7 2 4];
weights=rand(size(from_node));
names = cellstr(string(1:7));
g=digraph(to_node, from_node, weights,names);
h=rmnode(g,findnode(g,num2str(2)));
This is fine for small graphs, but for very large graphs that must be modified, this is extremely memory-inefficient, since you are forced to store a giant table of strings, which is redundant to your node id names.
Moreover, in this case you will need to do a findnode search each time that involves converting the number to a string, which could also be costly if done many many times.
Therefore, I am wondering if there is a more efficient way of preserving node ids upon insertion/deletion than using the names?
Thanks!!

采纳的回答

Walter Roberson
Walter Roberson 2018-3-1
Convert the node numbers to base 2^16. char() the result. Use those as the strings. For node names that are no larger than 100k then this takes two characters (4 bytes) each (plus any overhead from cell arrays.)
  2 个评论
Michael
Michael 2018-3-1
How would you suggest doing this efficiently? dec2base only goes up to base 36 and I don't want to impose a strong computational load on this if I have to convert thousands of indices at once.
Thanks!
Walter Roberson
Walter Roberson 2018-3-1
Labels = char(reshape(typecast(uint32(Indices),'uint16').',2,[]).');

请先登录,再进行评论。

更多回答(1 个)

Christine Tobler
Christine Tobler 2018-3-1
编辑:Christine Tobler 2018-3-1
Unfortunately, there is no direct way of doing this. The graph and digraph classes are designed to be fast when working on an existing graph, but this came at the cost of being relatively slow when adding and removing nodes one at a time.
To avoid having to convert the numbers to strings, you could construct and maintain two vectors which convert from the external indices to graph indices. For example like this:
maxExtInd = 1e6;
s = [1234 6543 765];
t = [6543 765 1234];
% graph2ext(indexIntoGraph) returns externalIndex
graph2ext = unique([s(:); t(:)]);
% ext2graph(externalIndex) returns indexIntoGraph
% (or zero if externalIndex is not in the graph)
ext2graph = sparse(maxExtInd, 1);
ext2graph(graph2ext) = 1:numel(graph2ext);
% Construct the graph:
g = graph(full(ext2graph(s)), full(ext2graph(t)));
graph2ext(g.Edges.EndNodes)
plot(g, 'NodeLabel', graph2ext);
conversionTable = [find(ext2graph(:)), nonzeros(ext2graph)]
% Add a node:
newNode = 456;
assert(ext2graph(newNode) == 0); % Check the node ID is not already in the graph
g = addnode(g, 1);
graph2ext(end+1) = newNode;
ext2graph(newNode) = numnodes(g);
figure;
plot(g, 'NodeLabel', graph2ext);
conversionTable = [find(ext2graph(:)), nonzeros(ext2graph)]
% Remove a node:
nodeToRemove = 1234;
graphNodeToRemove = ext2graph(nodeToRemove);
g = rmnode(g, graphNodeToRemove);
graph2ext(graphNodeToRemove) = [];
ext2graph(nodeToRemove) = 0;
ext2graph(ext2graph > graphNodeToRemove) = ext2graph(ext2graph > graphNodeToRemove) - 1;
figure;
plot(g, 'NodeLabel', graph2ext);
conversionTable = [find(ext2graph(:)), nonzeros(ext2graph)]
  2 个评论
Michael
Michael 2018-3-2
Thank you! I think this is a very nice solution. I'm wondering your thoughts about the trade-off between speed and memory in this particular situation.
In this case, we have to maintain the graph plus a 1xnum_nodes 8-byte double and 1xnum_nodes sparse filled with 8-byte doubles versus a 1xnum_nodes table column of 4-byte 1x2 char arrays. For 1000 entries, factoring in the overhead of the table object, I think it's a ~6.5 x memory savings to keep the table char array. However, we don't have to mess with conversions.
Do you think the search over the sparse/double array will be faster than doing the find_node of the proper node name?
I was able to improve on the speed of previous suggestion for conversion, assuming fewer than 2^32 entries using
char([floor(num./65536) rem(num,65536)])
to convert and
sum(double(cell2mat(nodenames)).*[65536 1],2)
to reverse, but there is certainly overhead using the findnode() functions within the digraph object and conversion to cell arrays of chars needed to use the digraph object.
You mentioned that it is optimized to be fast for operations but slow for manipulation and I see this to be true. When testing out my code, the biggest overhead is in adding an edge which calls expandTable(), which is very costly.
What is it about the table object that makes it optimal to design the graph object using it rather than just defining the nodes as a sparse and the edges as either a binary sparse or double sparse in the case of a weighted digraph? I'm very interested in what data structures are best for what jobs.
Thanks so much!
Christine Tobler
Christine Tobler 2018-3-5
Hi Michael,
With the table char array, you should factor in not only the cost for each 4-byte char array, but also the additional mxArray header (which for each element of a cell array, specifies its datatype and additional information). Also, the sparse array indexing will do a binary search, which the graph object's findnode is not (currently) doing on its node names.
You're right about expandTable being the main overhead - if there are no node and edges properties (that is, if you store node names and edge weights separately during the loop), this overhead should decrease drastically.
The table object is not used to represent the structure of the graph object internally, we are using it only for the node and edge properties. There, it has the advantage of allowing the storage of properties of arbitrary datatypes in a simple manner. For cases where the graphs are modified many times, there is unfortunately a large overhead associated with the nodes and edges tables.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Graph and Network Algorithms 的更多信息

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by